Arxiv今日论文 | 2026-05-13

本篇博文主要内容为 2026-05-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共129篇(Computation and Language (cs.CL))
人工智能共364篇(Artificial Intelligence (cs.AI))
计算机视觉共188篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共360篇(Machine Learning (cs.LG))
多智能体系统共23篇(Multiagent Systems (cs.MA))
信息检索共33篇(Information Retrieval (cs.IR))
人机交互共19篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

【速读】：该论文试图解决 AI 代理在与未知对手（如买家面对陌生卖家、采购助理与供应商谈判）进行自然语言交易时，因对手的 LLM、提示、控制逻辑及基于规则的备选方案均不可见，且每次决策具有金钱后果的情况下，如何仅从少量交互中准确预测对手下一步决策的问题。解决方案的关键在于将这一挑战形式化为目标自适应文本表格预测（target-adaptive text-tabular prediction）：将每个决策点表示为一个表格行，该行融合结构化游戏状态、出价历史与对话文本，并在提示中提供同一目标代理过去的 K 个游戏作为标注的适应示例；模型基于表格基础模型（tabular foundation model）构建，使用游戏状态特征与 LLM 生成的文本表示，并引入 LLM-as-Observer 作为额外表示层——一个冻结的小型 LLM 读取决策时刻的状态与对话，丢弃其直接回答，转而将其隐藏状态作为决策导向特征，使 LLM 充当编码器而非直接的小样本预测器。通过在 13 个前沿 LLM 代理上训练并在 91 个保留的支架代理上测试，完整模型优于直接使用 LLM-as-Predictor 提示及仅含游戏+文本特征的基线，其中 Observer 特征在 K=16 时使响应预测 AUC 提高约 4 点、议价出价预测误差降低 14%，这表明将对手预测定位为目标自适应文本表格任务能实现有效适应，且隐藏的 LLM 表示能够暴露直接提示无法捕捉的决策相关信号。

链接: https://arxiv.org/abs/2605.12411
作者: Eilam Shapira,Moshe Tennenholtz,Roi Reichart
机构: Faculty of Data and Decision Sciences (数据与决策科学学院); Technion – Israel Institute of Technology (以色列理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart’s LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart’s next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while K previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at K=16 , they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.

[MA-1] Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

【速读】：该论文试图解决当前多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）框架将固定行为绑定到固定智能体身份（agent identity）的局限性，导致智能体无法在任务条件变化时，于特定时刻灵活地切换到不同角色或行为。解决方案的关键在于引入“事件”（events）——即系统状态中引发任务定性变化的状态改变——来定义行为转换时机。具体地，框架通过两个核心要素实现智能体身份与行为的解耦：一是提出神经流形多样性（Neural Manifold Diversity, NMD），一种在行为为瞬态且与智能体无关时仍能良好定义的形式化距离度量，用于构建表达性行为流形；二是采用基于事件的超网络（event-based hypernetwork），在共享团队策略上生成低秩适配（Low-Rank Adaptation, LoRA）模块，允许智能体响应事件进行即时策略重配置。理论证明该构造确保多样性不干扰奖励最大化，实验表明该方法在基准测试中优于现有基线，具备零样本泛化能力，且是唯一能解决需要顺序行为重分配任务的方法。

链接: https://arxiv.org/abs/2605.12388
作者: Hannes Büchi,Manon Flageat,Eduardo Sebastián,Amanda Prorok
机构: Department of Computer Science and Technology, University of Cambridge (剑桥大学计算机科学与技术系)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind fixed behaviors to fixed agent identities. Consequently, they are ill-equipped for tasks where agents need to take on different roles at very specific moments in time. We argue that, to define these behavioral transitions, the missing ingredient is events. Events are changes in the state of the system that induce qualitative changes in the task. Based on this view, we introduce a framework that decouples agent identity from behavior, capturing a continuous manifold from which agents instantiate their behaviors in response to events. This framework is based on two elements. First, to build an expressive behavior manifold, we introduce Neural Manifold Diversity (NMD), a formal distance metric that remains well-defined when behaviors are transient and agent-agnostic. Second, we use an event-based hypernetwork that generates Low-Rank Adaptation (LoRA) modules over a shared team policy, enabling on-the-fly agent-policy reconfiguration in response to events. We prove that this construction ensures that diversity does not interfere with reward maximization by design. Empirical results demonstrate that our framework outperforms established baselines across benchmarks while exhibiting zero-shot generalization, and being the only method that solves tasks requiring sequential behavior reassignment.

[MA-2] Attacks and Mitigations for Distributed Governance of Agent ic AI under Byzantine Adversaries

【速读】：该论文试图解决代理型人工智能治理（Agentic AI governance）中，中心化信任节点（Provider）被攻破或恶意行为时引发的安全与隐私漏洞问题。当前最优方案SAGA假设Provider为逻辑中心化信任点，负责存储用户和代理信息并强制执行策略，但一旦Provider被攻陷，会导致代理属性不可追溯、隐私数据泄露或访问控制失效等严重攻击。解决方案的核心在于引入可抵御恶意Provider的多种加固架构：SAGA-BFT采用完全拜占庭容错（Byzantine-resilient）设计提供最强保护，但性能代价高；SAGA-MON和SGA-AUD分别通过轻量级服务器端监控或客户端审计，以极低开销抵御多数攻击；SAGA-HYB则是混合架构，在拜占庭容错与监控/审计之间进行安全-性能权衡。论文通过实证分析不同架构的适用条件，为实际部署提供了选型指导。

链接: https://arxiv.org/abs/2605.12364
作者: Matthew D. Laws,Alina Oprea,Cristina Nita-Rotaru
机构: 未知
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 18 pages, 18 figures, 4 tables

点击查看摘要

Abstract:Agentic AI governance is a critical component of agentic AI infrastructure ensuring that agents follow their owner’s communication and interaction policies, and providing protection against attacks from malicious agents. The state-of-the-art solution, SAGA, assumes a logically centralized point of trust, the Provider, which serves as a repository for user and agent information and actively enforces policies. While SAGA provides protection against malicious agents, it remains vulnerable to a malicious Provider that deviates from the protocol, undermining the security of the identity and access control infrastructure. Deployment on both private and public clouds, each susceptible to insider threats, further increases the risk of Provider compromise. In this work, we analyze the attacks that can be mounted from a compromised Provider, taking into account the different system components and realistic deployments. We identify and execute several concrete attacks with devastating effects: undermining agent attributability, extracting private data, or bypassing access control. We then present three types of solutions for securing the Provider that offer different trade-offs between security and performance. We first present SAGA-BFT, a fully byzantine-resilient architecture that provides the strongest protection, but incurs significant performance degradation, due to the high-cost of byzantine resilient protocols. We then propose SAGA-MON and SAGA-AUD, two novel solutions that leverage lightweight server-side monitoring or client-side auditing to provide protection against most classes of attacks with minimal overhead. Finally, we propose SAGA-HYB, a hybrid architecture that combines byzantine-resilience with monitoring and auditing to trade-off security for performance. We evaluate all the architectures and compare them with SAGA. We discuss which solution is best and under what conditions. Comments: 18 pages, 18 figures, 4 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2605.12364 [cs.CR] (or arXiv:2605.12364v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.12364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-3] Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agent ic Systems

【速读】：该论文试图解决当前多步骤、可修订的AI工作流（workflow）中，中间过程产生的工件（intermediate artifacts）缺乏持久性和可检查性的问题：传统系统仅保留最终产物（final artifact），导致上游状态信息丢失，使得对AI产出结果的审查、修订和维护变得困难。解决方案的关键在于提出一个系统级数据模型（systems-level data model），该模型定义中间工件为类型化（typed）、结构化（structured）、可寻址（addressable）、版本化（versioned）、依赖感知（dependency-aware）、权威（authoritative）且可被下游计算消费（consumable by downstream computation）的实体，并明确将其与聊天记录（chat transcripts）、记忆（memory）、隐藏思维链（hidden chain-of-thought）等区分开来。通过形式化加性（additive）和替代性（superseding）更新语义，并利用工件谱系（artifact lineage）支持跨修订的持久中间状态，该方案使得人类和智能体（agents）能够对证据图、主张结构、假设、计划等中间产物进行检查、修订、替换和改进，从而将评估重点从仅关注最终输出质量转向维护状态质量，最终提升AI工作的可检查性、可修订性和可维护性。

链接: https://arxiv.org/abs/2605.12087
作者: Josh Rosen,Seth Rosen
机构: ThruWire, Inc. (ThruWire公司)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model’s private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time. Comments: 18 pages, 1 figure, 3 tables Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2605.12087 [cs.AI] (or arXiv:2605.12087v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-4] SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

【速读】：该论文试图解决可复用技能（reusable skills）作为大语言模型代理（LLM agents）扩展接口时引入的新型安全失败问题：即使原始用户请求是良性的，作为任务相关技能材料的本地化工件（local artifacts）也能通过工作流上下文引导代理执行不安全行为，而现有安全评估体系对此类攻击面普遍缺乏覆盖。解决方案的关键是提出了一个名为SkillSafetyBench的可运行基准测试框架，该框架包含155个对抗性案例，横跨47个任务、6个风险领域和30个安全类别，每个案例均配有基于规则的验证器（rule-based verifier），从而能够系统性地对因技能中介（skill-mediated）而导致的安全失效进行评估。通过多个CLI代理和后端模型的实验，该框架揭示了在不同攻击方法和支架-模型配对下一致的不安全行为诱导模式。

链接: https://arxiv.org/abs/2605.12015
作者: Chang Jin,An Wang,Zeming Wei,Kai Wang,Biaojie Zeng,Qiaosheng Zhang,Chao Yang,Jingjing Qu,Xia Hu,Xingcheng Xu
机构: Shanghai AI Laboratory (上海人工智能实验室); Peking University (北京大学); East China Normal University (华东师范大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

[MA-5] Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

【速读】：该论文旨在解决多智能体强化学习（Multi-agent Reinforcement Learning, MARL）中因联合动作空间庞大及转移数据有限，导致策略分布无法通过统计计算得到的难题。解决方案的关键是引入一个参数化的无似然密度比估计器（parametric likelihood-free density ratio estimator），并配合两个容量不同的回放缓冲区（replay buffers），分别存储当前策略和过去策略的历史轨迹以表征其数据分布，进而利用该估计器为状态-动作对分配自适应时间差分衰减因子（Adaptive TD(λ), ATD(λ)），从而在避免直接计算策略分布的前提下，有效缓解值估计中的偏差-方差权衡（bias-variance trade-off）。

链接: https://arxiv.org/abs/2605.11880
作者: Yue Deng,Zirui Wang,Yin Zhang
机构: Zhongguancun Academy (中关村学院); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:TD( \lambda ) in value-based MARL algorithms or the Temporal Difference critic learning in Actor-Critic-based (AC-based) algorithms synergistically integrate elements from Monte-Carlo simulation and Q function bootstrapping via dynamic programming, which effectively addresses the inherent bias-variance trade-off in value estimation. Based on that, some recent works link the adaptive \lambda value to the policy distribution in the single-agent reinforcement learning area. However, because of the large joint action space from multiple number of agents, and the limited transition data in Multi-agent Reinforcement Learning, the policy distribution is infeasible to be calculated statistically. To solve the policy distribution calculation problem in MARL settings, we employ a parametric likelihood-free density ratio estimator with two replay buffers instead of calculating statistically. The two replay buffers of different sizes store the historical trajectories that represent the data distribution of the past and current policies correspondingly. Based on the estimator, we assign Adaptive TD( \lambda ), \textbfATD( \lambda ), values to state-action pairs based on their likelihood under the stationary distribution of the current policy. We apply the proposed method on two competitive baseline methods, QMIX for value-based algorithms, and MAPPO for AC-based algorithms, over SMAC benchmarks and Gfootball academy scenarios, and demonstrate consistently competitive or superior performance compared to other baseline approaches with static \lambda values.

[MA-6] Agent DisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

【速读】：该论文试图解决现有深度研究框架中将信息探索（exploration）与利用（exploitation）过程混淆在一个单一模块中，导致搜索策略与内容生成缺乏有效协同和自适应优化的问题。解决方案的关键在于提出一种解耦与协作的智能体架构（AgentDisCo），将深度研究形式化为信息探索与利用之间的对抗优化问题：通过一个评论智能体（critic agent）评估已生成的大纲并精炼搜索查询，以及一个生成智能体（generator agent）检索更新后的结果并相应修订大纲，两个智能体在迭代中相互制衡与协作；此外，通过元优化框架（meta-optimization harness）将生成智能体重新用作评分智能体以评价评论智能体的输出并生成质量信号，并利用代码生成智能体系统性地探索智能体配置并构建策略库（policy bank），从而实现框架的自优化与可复用设计策略的积累，无需大量人工干预。

链接: https://arxiv.org/abs/2605.11732
作者: Jiarui Jin,Zexuan Yan,Shijian Wang,Wenxiang Jiao,Yuan Lu
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users’ historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.

[MA-7] A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar

【速读】：该论文旨在解决代理型AI（Agentic AI）兴起背景下软件工程面临的双重挑战：一是如何利用代理有效支持软件工程任务，二是如何为代理型AI系统本身设计合适的软件工程实践。当前缺乏一个连贯的、社区共识的研究议程来协调这两个方向。解决方案的关键在于通过组织A2SE研讨会，汇聚18位学术界与工业界专家，采用结构化演示、协作主题聚类和焦点小组讨论的方法，识别出六个关键主题领域（治理、代理的软件工程、软件架构的代理、质量与评估、可持续性、代码），并对每个领域进行短期与长期研究方向的优先级排序，最终形成一份社区驱动的、有观点的研究议程，为软件工程社区在关键节点上协调努力提供结构化基础。

链接: https://arxiv.org/abs/2605.11720
作者: Davide Taibi,Henry Muccini,Karthik Vaidhyanathan,Marcos Kalinowski,Michele Albano,Antonio Pedro Santos Alves,Renato Cerqueira,Mateus Devino,Matteo Esposito,Rodrigo Falcão,Vinicius Henning,Foutse Khomh,Valentina Lenarduzzi,Qinghua Lu,Matías Martínez,Henrique Mello,Daniel Mendez,Lucas Romao
机构: University of Southern Denmark, Vejle, Denmark (南丹麦大学); FrAmeLab@SWEN, University of L’Aquila, Italy (拉奎拉大学); Software Engineering Research Center, IIIT Hyderabad, India (海得拉巴国际信息技术学院); Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil (里约热内卢天主教大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 1 table, A2SE meeting, this https URL

点击查看摘要

Abstract:The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.

[MA-8] Shaping Zero-Shot Coordination via State Blocking

【速读】：该论文致力于解决零样本协调（Zero-shot Coordination, ZSC）中现有方法虽强调增加训练时的伙伴多样性，但仍难以可靠泛化至未见过的伙伴（包括人类）这一核心问题。解决方案的关键是提出状态阻断协调（State-Blocked Coordination, SBC）框架，通过直接在环境中对状态进行阻断（state blocking）来生成一系列虚拟环境，使智能体在不修改原始环境的前提下，广泛体验各类次优伙伴策略，从而显著提升对未知伙伴的协调整合能力。

链接: https://arxiv.org/abs/2605.11688
作者: Mingu Kang,Sunwoo Lee,Yonghyeon Jo,Seungyul Han
机构: Graduate School of Artificial Intelligence (人工智能研究生院); UNIST (蔚山科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 technical page followed by references and appendix

点击查看摘要

Abstract:Zero-shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real-world multi-agent systems and human-AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State-Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero-shot coordination, including strong generalization to human partners.

[MA-9] GeomHerd: A Forward-looking Herding Quantification via Ricci Flow Geometry on Agent Interactive Simulations

【速读】：该论文旨在解决现有金融羊群效应（herding）检测方法因依赖价格相关性统计而存在的固有滞后性问题——这种滞后源于只能在协调行为已影响实际回报后才被观测到。解决方案的关键在于提出一个名为 GeomHerd 的前瞻性几何框架，通过直接在上游智能体交互图上量化协调行为来绕过可观测性滞后。具体而言，论文利用离散Ollivier-Ricci曲率（discrete Ollivier–Ricci curvature）追踪这些动作图的拓扑结构，从而捕获新兴协调的结构性特征；理论上建立了将图论度量映射到经典宏观羊群统计量CSAD（Cross-Sectional Absolute Deviation）的平均场桥接，并在连续自旋基质模型（Cividino–Sornette continuous-spin substrate）与Vicsek自驱动粒子模型（Vicsek self-driven-particle model）上验证了其外推有效性。该几何信号通过曲率条件预测头（curvature-conditioned forecasting head）进一步降低了级联窗口内的对数回报平均绝对误差。

链接: https://arxiv.org/abs/2605.11645
作者: Lake Yang,Junwei Su,Jingfeng Zeng,Wenhao Lu,Xingzhi Qian,Weitong Zhang,Chuan Wu,Dunhong Jin
机构: University of Science and Technology of China(中国科学技术大学); MaxQuant(MaxQuant); The University of Hong Kong(香港大学); University College London(伦敦大学学院)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:Herding – where agents align their behaviors and act collectively – is a central driver of market fragility and systemic risk. Existing approaches to quantify herding rely on price-correlation statistics, which inherently lag because they only detect coordination after it has already moved realised returns. We propose GeomHerd, a forward-looking geometric framework that bypasses this observability lag by quantifying coordination directly on upstream agent-interaction graphs. To generate these graphs, we treat a heterogeneous LLM-driven multi-agent simulator – each financial trader instantiated by a persona-conditioned LLM call – as a forecastable world, and evaluate the geometric pipeline on the Cividino–Sornette continuous-spin agent-based substrate as our headline financial testbed. By tracking the discrete Ollivier–Ricci curvature of these action graphs, GeomHerd captures the structural topology of emerging coordination. Theoretically, we establish a mean-field bridge mapping our graph-theoretic metric to CSAD, the classical macroscopic herding statistic, linking GeomHerd to downstream price-dispersion measurement. Empirically, GeomHerd anticipates herding long before aggregate market baselines: on the continuous-spin substrate, our primary detector fires a median of 272 steps before order-parameter onset; a contagion detector ( \beta_- ) recalls 65% of critical trajectories 318 steps early; and on co-firing trajectories the agent-graph signal precedes price-correlation-graph baselines by 40 steps. As a complementary indicator, the effective vocabulary of agent actions contracts during cascades. The geometric signature transfers out-of-domain to the Vicsek self-driven-particle model, and a curvature-conditioned forecasting head reduces cascade-window log-return MAE over detector-conditioned and price-only baselines.

[MA-10] Hierarchical LLM -Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

【速读】：该论文致力于解决在集成地面与非地面网络（Integrated Terrestrial and Non-Terrestrial Network, ITNTN）中多无人机（UAV）运动控制与通信连接的联合优化问题，特别是在三维空中高速公路场景下，无人机需在动态、部分可观测条件下实现碰撞避免、高效交通流和可靠通信。解决方案的关键在于提出一种大语言模型（Large Language Model, LLM）驱动的分层多速率控制框架：在全局层面，利用部署于高空平台站（HAPS）上的LLM控制器执行长期负载均衡与切换（handover）决策规划；在局部层面，每架无人机采用混合控制器，集成慢时间尺度的LLM进行高层空间推理，并结合强化学习（Reinforcement Learning）代理实现快速的无人机-基础设施通信与运动控制。该框架通过分层架构与多速率协调，有效解耦了控制与通信目标间的耦合关系，从而显著提升了运输效率、通信吞吐量并降低了碰撞率。

链接: https://arxiv.org/abs/2605.11509
作者: Zijiang Yan,Hao Zhou,Wael Jaafar,Jianhua Pei,Ping Wang,Halim Yanikomeroglu,Hina Tabassum
机构: Department of Electrical Engineering and Computer Science, York University (约克大学电气工程与计算机科学系); Samsung Research America (三星美国研究院); Department of Software and IT Engineering, École de technologie supérieure (ÉTS), University of Quebec (魁北克大学高等技术学院软件与IT工程系); Non-Terrestrial Networks (Carleton-NTN) Lab and the Department of Systems and Computer Engineering, Carleton University (卡尔顿大学非地面网络实验室与系统与计算机工程系)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Submission for possible publication

点击查看摘要

Abstract:Uncrewed aerial vehicles (UAVs) are increasingly deployed in complex networked environments, yet the joint optimization of multi-UAV motion control and connectivity remains a fundamental challenge. In this paper, we study a multi-UAV system operating in an integrated terrestrial and non-terrestrial network (ITNTN) comprising terrestrial base stations and high-altitude platform stations (HAPS). We consider a three-dimensional (3D) aerial highway scenario where UAVs must adapt their motion to ensure collision avoidance, efficient traffic flow, and reliable communication under dynamic and partially observable conditions. We first model the problem as a hierarchical multi-objective partially observable Markov decision process (H-MO-POMDP), capturing the coupling between control and communication objectives. Based on this formulation, we propose a large language model (LLM)-driven hierarchical multi-rate control framework. At the global level, an LLM-based controller on the HAPS performs long-term planning for load balancing and handover decisions. At the local level, each UAV employs a hybrid controller that integrates a slow-timescale LLM for high-level spatial reasoning with a reinforcement learning agent for faster UAV-to-infrastructure (U2I) communication and motion control. We further develop a high-fidelity 3D simulation platform by integrating the gym-pybullet-drones environment with 3GPP-compliant RF/THz channel models. Numerical results demonstrate that the proposed framework significantly outperforms state-of-the-art baselines, achieving a 14% increase in transportation efficiency and a 25% improvement in telecommunication throughput. Additionally, it achieves a 23% reduction in physical collision rates, demonstrating strong handover stability and zero-shot generalization in dynamic scenarios.

[MA-11] Distance-Constrained Unlabeled Multi-Agent Pathfinding

【速读】：该论文旨在解决一种名为“距离r独立无标签多智能体路径规划”（Distance-r Independent Unlabeled Multi-Agent Pathfinding）的图路径发现问题，其核心在于在经典多智能体路径规划（MAPF）基础上增加一个额外的约束：所有智能体在任意时刻必须保持彼此间的距离至少为r+1。这一约束推广了传统碰撞模型，更贴近现实世界中多智能体协同的场景，但导致可行性判定从标准无标签MAPF的多项式时间可解变为PSPACE完全问题。解决方案的关键在于两种互补方法：（i）基于归约的最优算法，通过一种可行性保持的压缩过程（feasibility-preserving compression procedure）来降低问题规模并实现最优解；（ii）基于配置生成器（configuration generator）的搜索，利用状态空间缩减技术直接探索可行路径。尽管问题具有高度计算复杂性，实验结果表明该算法能在实际时间范围内处理数百个智能体。

链接: https://arxiv.org/abs/2605.11503
作者: Takahiro Suzuki,Yuma Tamura,Keisuke Okumura
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study a graph pathfinding problem Distance- r Independent Unlabeled Multi-Agent Pathfinding, finding a set of collision-free paths between two sets where agents must stay at pairwise distance at least r+1 at all times. This additional constraint, generalizing collision modeling for classical MAPF, targets aspects of real-world multi-agent coordination. This additional distance constraint makes feasibility (i.e., whether a solution exists) PSPACE-complete, in contrast to standard (unlabeled) MAPF, where it can be decided in polynomial time. We address the challenge via two complementary approaches: (i) reduction-based optimal algorithms with a feasibility-preserving compression procedure, and (ii) a configuration generator-based search. Despite the hardness, empirical results show that our algorithm can handle hundreds of agents in a practical timeframe.

[MA-12] Digital Identity for Agent ic Systems: Toward a Portable Authorization Standard for Autonomous Agents

【速读】：该论文试图解决企业自主代理（autonomous agents）在跨越组织边界运行时，仅凭身份（identity）不足以表达和强制执行明确、可约束、可审计、可撤销且能被独立接收者一致解释的授权问题，并揭露了现有身份与访问模型在保险理赔和供应链完整性等典型场景中的结构性缺陷。解决方案的关键在于提出一种可移植的授权模型（portable authorization model），其核心是分离凭证容器（credential container）、授权载荷语义（authorization payload semantics）和执行引擎（enforcement engine），通过发行者签发的授权载荷、类型化约束代数（typed constraint algebra）、决策一致评估语义（decision-consistent evaluation semantics）、委托衰减（delegation attenuation）、治理语义解析（governed semantic resolution）、故障关闭处理（fail-closed processing）和预飞发现（pre-flight discovery）等机制，使得JWT/JWS、可验证凭证（Verifiable Credentials）、OAuth富授权请求（OAuth Rich Authorization Requests）或策略引擎绑定等不同配置文件在跨信任边界时能保持共同的授权含义。

链接: https://arxiv.org/abs/2605.11487
作者: Partha Madhira
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 46 pages, 10 figures

点击查看摘要

Abstract:Enterprise AI is shifting from copilots to autonomous agents capable of executing workflows, negotiating outcomes, and making decisions with limited human oversight. As these systems extend across organizational boundaries, identity alone is insufficient: an agent’s authority must also be explicit, constrained, auditable, revocable, and consistently interpretable by independent receivers. This paper analyzes representative enterprise use cases in insurance claims processing and supply chain integrity to surface structural gaps in existing identity and access models. It proposes a portable authorization model for autonomous agents based on issuer-authored authorization payloads, typed constraint algebra, decision-consistent evaluation semantics, delegation attenuation, governed semantic resolution, fail-closed processing, and pre-flight discovery. The model separates credential containers, authorization payload semantics, and enforcement engines, allowing profiles such as JWT/JWS, Verifiable Credentials, OAuth Rich Authorization Requests, or policy-engine bindings to preserve a common authorization meaning across trust boundaries.

[MA-13] Predictive Maps of Multi-Agent Reasoning : A Successor-Representation Spectrum for LLM Communication Topologies

【速读】：该论文试图解决多智能体大语言模型系统在部署时缺乏预推理诊断工具以选择最优通信拓扑（如链式、星型、网状等）的问题，现有评估方法仅能事后且针对特定任务回答拓扑对漂移、共识收敛和扰动鲁棒性的影响。解决方案的关键在于引入基于后继表示（successor representation）的结构诊断，即通过行随机通信算子的谱量——谱半径（spectral radius）、谱间隙（spectral gap）和条件数（condition number）——分别预测三种不同的故障模式：条件数能完美预测扰动鲁棒性（斯皮尔曼秩相关系数 r_s=1.0），谱间隙部分预测共识动力学（r_s=0.5），谱半径与累积误差呈完美逆相关（r_s=-1.0），后者源于线性谱无法感知非收缩偏差漂移，因此论文进一步提出仿射噪声扩展模型来恢复实验排序。这一工作为多智能体大语言模型系统提供了基于表征的、感知漂移的结构诊断工具，与经典谱理论和共识理论相呼应。

链接: https://arxiv.org/abs/2605.11453
作者: Ethan David James Park,Dalal Alharthi
机构: University of Arizona (亚利桑那大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Spectral Theory (math.SP)
备注:

点击查看摘要

Abstract:Practitioners deploying multi-agent large language model (LLM) systems must currently choose between communication topologies such as chain, star, mesh, and richer variants without any pre-inference diagnostic for which topology will amplify drift, converge to consensus, or remain robust under perturbation. Existing evaluation answers these questions only post hoc and only for the task measured. We introduce a structural diagnostic for multi-agent LLM communication graphs based on the successor representation M = (I - \gamma P)^-1 of the row-stochastic communication operator, and we connect three of its spectral quantities, the spectral radius \rho(M) , the spectral gap \Delta(M) , and the condition number \kappa(M) , to three distinct failure modes. We derive closed-form spectra for the chain, star, and mesh under row-stochastic normalization, and validate the predictions on a 12-step structured state-tracking task with Qwen2.5-7B-Instruct over 100 independent trials. The condition number is a perfect rank-order predictor of empirical perturbation robustness ( r_s = 1.0 ); the spectral gap partially predicts consensus dynamics ( r_s = 0.5 ); and the spectral radius is perfectly \emphinverted with respect to cumulative error ( r_s = -1.0 ). We trace this inversion to a regime in which linear spectra are blind to non-contracting bias drift, and we propose an affine-noise extension of the predictive map that recovers the empirical ordering. We read this as a first step toward representational, drift-aware structural diagnostics for multi-agent LLM systems, sitting alongside classical spectral and consensus theory.

[MA-14] Information and Contract Design for Repeated Interactions between Agents with Misaligned Incentives IJCAI2026

【速读】：该论文试图解决在多智能体系统中，由于信息不对称（information asymmetries）和激励不匹配（misaligned incentives）导致的沟通效率低下与决策失效问题。具体而言，模型聚焦于一个持有私有信息但无法采取行动的发送者（Sender）与一个依赖该信息做出决策的接收者（Receiver）之间的交互，核心挑战在于发送者需要学习一种接收者能够可靠执行的优化沟通策略，而该策略对奖励冲突程度和接收者已观察到的环境信息高度敏感。解决方案的关键在于引入一种线性契约（linear contract）机制，允许智能体之间为信息建立价格，从而通过支付结构来调节信息传递与奖励分配。实验表明，发送者能够学会利用这一价格机制提升自身奖励，但代价是牺牲智能体之间的“公平性”（fairness），因为发送者会提取接收者的大部分剩余价值。这一发现对多智能体系统中的公平性、契约设计及学习机制提出了重要问题。

链接: https://arxiv.org/abs/2605.11294
作者: Nanda Kishore Sreenivas,Kate Larson
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: Accepted to IJCAI 2026

点击查看摘要

Abstract:We study the consequences of information asymmetries and misaligned incentives in settings with multiple independent agents. We model an interaction between a Sender, who holds vital private information but cannot act, and a Receiver, who must make decisions but is dependent on the Sender’s information. We find that the Sender learns an optimal communication strategy that the Receiver reliably acts on. Importantly, this strategy is highly sensitive to the degree of conflict in the agents’ rewards and the amount of environmental information the Receiver can already observe. We introduce a mechanism allowing the agents to form linear contracts, where a price is established for the information. We demonstrate that the Sender learns to use these payment structures to improve its rewards, though this comes at a cost of “fairness” between agents as the Sender is able to extract much of the Receiver’s surplus. This raises questions about fairness, contract design, and learning in the context of multi-agent systems.

[MA-15] PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

【速读】：该论文旨在解决基于大语言模型（LLM）的智能体在生成长期计划时因不可行动作、约束违反及累积错误而导致的计划执行不匹配问题。其解决方案的关键在于提出PIVOT（Plan-Inspect-eVOlve Trajectories）框架，该框架将智能体轨迹视为可通过环境交互迭代优化的对象，通过四个阶段（生成候选轨迹、执行并计算带有文本梯度（textual gradients）的结构化损失以编码计划-执行差异、应用信号生成改进轨迹、最终全局验证）以及单调接受过程（monotonic acceptance process）实现自监督的轨迹优化，从而有效缩小计划与执行之间的差距。

链接: https://arxiv.org/abs/2605.11225
作者: Tuo Zhang,Alin-Ionut Popa,Yan Xu,Rui Song,Dimitrios Dimitriadis
机构: Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents frequently generate seemingly coherent plans that fail upon execution due to infeasible actions, constraint violations, and compounding errors over extended horizons. PIVOT (Plan-Inspect-eVOlve Trajectories) addresses this plan-execution misalignment through a self-supervised framework that treats trajectories as optimizable objects iteratively refined via environment interaction. The framework comprises four stages: PLAN generates candidate trajectories; INSPECT executes them and computes structured losses with textual gradients encoding plan-execution discrepancies; EVOLVE applies these signals to produce improved trajectories; and VERIFY performs a final global check against task constraints. A monotonic acceptance process ensures a non-decreasing solution quality. Empirical evaluations on DeepPlanning and GAIA demonstrate state-of-the-art performance: with human-in-the-loop (HITL) feedback, PIVOT establishes a strong upper bound up to 94% relative improvement in constraint satisfaction, while its fully autonomous variant retains substantial gains, showing that the core trajectory-refinement mechanism remains effective without external supervision. At the same time, PIVOT remains computationally efficient, requiring up to 3x to 5x fewer tokens than competing refinement methods. These findings establish that (self- or human-supervised) feedback-based trajectory optimization is a principled methodology for mitigating plan-execution gaps in autonomous agent systems.

[MA-16] Multi-Agent System Identification with Nonlinear Sheaf Diffusion

【速读】：该论文试图解决从多智能体系统的轨迹数据中唯一恢复局部交互律（local interaction laws）的困难性，核心障碍在于节点级轨迹观测仅体现边缘力的聚合效应，使得在节点层面一致的多种交互律不可区分。解决方案的关键在于：通过非线性层拉普拉斯（nonlinear sheaf Laplacian）的框架，利用层上同调（sheaf cohomology）来量化恢复的拓扑障碍——当层上同调为零时，从无约束函数类中可实现唯一恢复；当障碍非平凡时，则需在有限维参数化类中通过确保数据依赖的信息矩阵（data-dependent information matrix）正定来实现恢复。

链接: https://arxiv.org/abs/2605.11204
作者: Nivar Anwer,Hans Riess,Matthew Hale
机构: Georgia Tech (佐治亚理工学院)
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Algebraic Topology (math.AT)
备注:

点击查看摘要

Abstract:Local interaction laws governing multi-agent systems can be difficult to recover from trajectory data, even when the dynamics are observed faithfully. In systems governed by a nonlinear sheaf Laplacian – a generalization of the graph Laplacian accommodating heterogeneous state spaces and asymmetric communication channels – the coordination law is encoded by edge potential functions whose gradients produce the inter-agent forces. Because trajectory observations record node-state evolution, they expose only the aggregate effect of the edge forces at each node: distinct interaction laws that agree at the node level are indistinguishable from trajectory data alone. We show that the fundamental obstruction to recovery is topological, measured by sheaf cohomology, and that unique recovery from an unconstrained function class is possible if and only if this cohomology vanishes. When the obstruction is nontrivial, we show that recovery within a finite-dimensional parameterized class is possible precisely when a data-dependent information matrix is positive definite. Experiments validate the theory and illustrate that accurate trajectory reproduction need not certify recovery of the underlying interaction law.

[MA-17] he Price of Proportional Representation in Temporal Voting IJCAI

【速读】：该论文旨在解决时间投票模型（temporal voting model）中比例代表制（proportional representation）与社会福利（social welfare）之间的冲突问题，具体量化了强制实施比例代表制公理（如合理代表JR及其变体）所导致的社会福利效率损失，并探究不同强度公理下这种损失的渐进特性。解决方案的关键在于：通过定义可实现的最大功利主义福利（utilitarian welfare）与受公理约束的最大福利之间的最坏情况比率（worst-case ratio），形式化刻画了福利与比例性之间的张力；并证明在时间设定下，JR引入的福利损失随轮次或选民数量增长至多亚线性，且在时间 horizon 趋于无穷时渐近消失，而更强的公理下冲突则持续存在；此外，论文还揭示了即使在静态偏好和有界度审批下，各公理下的福利最大化问题均为NP完全且APX难，但提出了基于若干自然结构参数的固定参数算法（fixed-parameter algorithms）。

链接: https://arxiv.org/abs/2605.11157
作者: Nicholas Teh
机构: University of Oxford (牛津大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注: Appears in the 35th International Joint Conference on Artificial Intelligence (IJCAI), 2026

点击查看摘要

Abstract:We study proportional representation in the temporal voting model, where collective decisions are made repeatedly over time over a fixed horizon. Prior work has extensively investigated how proportional representation axioms from multiwinner voting (e.g., justified representation (JR) and its variants) can be adapted, satisfied, and verified in this setting. However, much less is understood about their interaction with social welfare. In this work, we quantify the efficiency cost of enforcing proportionality. We formalize the welfare-proportionality tension via the worst-case ratio between the maximum achievable utilitarian welfare and the maximum welfare attainable subject to a proportionality axiom. We show that imposing proportional representation in the temporal setting can incur a growing, yet sublinear, welfare loss as the number of voters or rounds increases. We further identify a clean separation among axioms: for JR, the welfare loss diminishes as the time horizon grows and vanishes asymptotically, whereas for stronger axioms this conflict persists even with many rounds. Moreover, we prove that welfare maximization under each axiom is NP-complete and APX-hard, even under static preferences and bounded-degree approvals, and provide fixed-parameter algorithms under several natural structural parameters.

[MA-18] Control Charts for Multi-agent Systems

【速读】：该论文试图解决开放环境下多智能体系统（multi-agent systems）动态监测的自动化问题，因为当前方法仅限于定性检查（qualitative inspection）。解决方案的关键在于将过程理论中的自适应控制图（adaptive control charts）扩展至多智能体系统，实现自动监测。然而，论文通过仿真和理论分析表明，自适应控制图虽然能从环境中学习，但对缓慢叛变的对抗智能体（adversarial agents）具有脆弱性，从而揭示了多智能体系统控制中的一个根本权衡：要么系统中的智能体不能学习，要么系统容易受到对抗者的攻击。

链接: https://arxiv.org/abs/2605.11135
作者: Hayden Helm,Carey Priebe,Brandon Duderstadt
机构: Helivan(赫利凡); Johns Hopkins University (约翰霍普金斯大学); Calcifer Computing (卡尔西弗计算)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative agents have proven to be powerful assistants in a wide variety of contexts. Given this success, users are now deploying agents with minimal restrictions in open ended, multi-agent environments. Current methods for monitoring the dynamics of open-ended multi-agent systems are limited to qualitative inspection. In this paper, we extend the process-theoretic notion of adaptive control charts to multi-agent systems to enable automated monitoring. Using simulation, we demonstrate that adaptive control charts are necessary for monitoring multi-agent systems that can learn from their environment. We further demonstrate, both empirically and theoretically, that adaptive control charts are susceptible to adversarial agents that defect sufficiently slowly. These results illustrate a fundamental tradeoff in multi-agent system control: either agents in a system cannot learn or the system is susceptible to adversaries.

[MA-19] GRAFT-ATHENA: Self-Improving Agent ic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

【速读】：该论文试图解决的问题是：现有科学发现智能体系统（agentic AI systems）将每个科学问题孤立处理，缺乏跨领域共享方法论经验的统一基底，导致无法从已解决问题中积累经验以改进未来决策。解决方案的关键在于提出 GRAFT-ATHENA 框架，其核心组件 GRAFT（Graph Reduction to Adaptive Factored Trees）通过将组合决策空间投影为因子化概率树（factored probabilistic trees），使每个方法对应一条单一路径，从而将参数复杂度从指数级降至线性级；同时，该因子化结构作为策略的 I-map（I-map of the policy），将路径嵌入为度量空间中的唯一指纹（fingerprints），使得新问题能通过与过去相似问题的距离度量来学习先验知识，从而实现跨领域自演进的动作空间扩展。

链接: https://arxiv.org/abs/2605.11117
作者: Juan Diego Toscano,Zhaojie Chai,George Em Karniadakis
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Probability (math.PR)
备注:

点击查看摘要

Abstract:Scientific discovery can be modeled as a sequence of probabilistic decisions that map physical problems to numerical solutions. Recent agentic AI systems automate individual scientific tasks by orchestrating LLM-driven planners, solvers, and evaluators. Each method is a combination of methodological actions, with many viable combinations for any given problem and structural dependencies between choices. However, existing frameworks treat each problem in isolation, with no shared substrate to accumulate methodological experience across domains. Here we show that GRAFT-ATHENA, a self-improving agentic framework, learns from past problems and autonomously expands its own action space across diverse domains. GRAFT (Graph Reduction to Adaptive Factored Trees) projects combinatorial decision spaces into factored probabilistic trees in which each method is a single path, taking the parameter footprint from exponential to linear. In the lineage of classical Bayesian networks, the factorization is an I -map of the policy, and the resulting paths embed as unique fingerprints in a metric space whose closeness lets each new problem learn from similar past ones. On canonical physics-informed machine learning (PIML) benchmarks, GRAFT-ATHENA improves over human and prior agentic baselines, and on production solvers, it tackles complex engineering problems such as reconstructing Mach-10 flow over the Apollo Command Module from a 1968 report and recovering shear-thinning blood-cell rheology. Notably, the system grows its own knowledge substrate, autonomously proposing regularization constraints for ill-posed inverse problems and discovering new numerical methods such as a spectral PINN with exponential convergence. These results provide a foundation for autonomous laboratories that grow more capable with every problem they solve.

[MA-20] An Executable Benchmarking Suite for Tool-Using Agents

【速读】：该论文试图解决可执行环境（如Web、代码和微任务）中闭环工具使用代理评估时，基准报告常常混淆工作负载（workloads）、动作生成驱动程序（action-generating drivers）以及用于系统级主张的证据（evidence）这一问题。解决方案的关键是提出一个可执行的基准测试套件（executable benchmarking suite），通过一个共享的证据接纳契约（evidence-admission contract）将这些对象显式区分。该套件通过通用工作负载适配器、任务清单、事件模式、重放/冻结策略、声明的驱动程序和报告管道，将WebArena Verified、SWE-Gym片段（带SWE-bench兼容验证）和MiniWoB++连接起来，并在标准发布中采用“门”（gate）机制将面向论文的证据与预检、夹具、烟雾和诊断行分离，同时保留非接纳的人工制品用于审计和入门，确保接纳证据记录延迟、无效动作行为、补丁生成成本等关键指标，从而提供决策相关的、可审计的评估基础。

链接: https://arxiv.org/abs/2605.11030
作者: Zhiqing Zhong,Zhijing Ye,Jiamin Wang,Xiaodong Yu
机构: Stevens Institute of Technology (史蒂文斯理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 2 figures, 20 tables, including appendices

点击查看摘要

Abstract:Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

[MA-21] SkillGen: Verified Inference-Time Agent Skill Synthesis

【速读】：该论文旨在解决大型语言模型（LLM）智能体在无需重新训练的前提下提升能力时，高质量技能仍主要依赖人工编写的问题。其核心解决方案是提出SkillGen这一多智能体框架，通过对比归纳（contrastive induction）方法，同时利用智能体在成功和失败轨迹中的模式，识别出可复用的成功模式、重复出现的失败模式以及在邻近成功中出现但失败中缺失的行为，从而生成候选技能并迭代优化。关键创新在于将技能建模为干预（intervention），通过在同一实例上比较有无技能时的结果来实证评估技能的净效果，同时量化修复（即技能修正基线失败）和回归（即技能破坏基线成功）两种效应，确保技能既能纠正错误又不引入新问题。

链接: https://arxiv.org/abs/2605.10999
作者: Yuchen Ma,Yue Huang,Han Bao,Haomin Zhuang,Swadheen Shukla,Michel Galley,Xiangliang Zhang,Stefan Feuerriegel
机构: Munich Center for Machine Learning, LMU Munich (慕尼黑机器学习中心, 慕尼黑大学); University of Notre Dame (圣母大学); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high-quality skills are still largely written by hand. We introduce SkillGen, a multi-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human-readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held-out performance, outperforms existing skill-generation baselines, and produces skills that transfer across models.

[MA-22] Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agent ic Reasoning

【速读】：该论文旨在解决少样本多模态时间序列分类（few-shot multimodal time series classification）中因数据稀疏和分布偏移导致的上下文崩塌（context collapse）与推理偏差问题。其解决方案的关键在于提出首个基于视觉-语言模型（VLM）的智能体推理框架 MarsTSC，该框架通过引入自进化知识库（self-evolving knowledge bank）作为动态上下文，并采用反思性智能体推理（reflective agentic reasoning）进行迭代优化。具体而言，框架包含三个协作角色：生成器（Generator）通过推理执行可靠分类，反思器（Reflector）诊断错误根因并提取被生成器忽略的时序特征判别性洞察，修改器（Modifier）对知识库进行验证性更新以防止上下文崩塌。此外，框架还设计了测试时更新策略（test-time update strategy），在推理阶段谨慎且持续地精炼知识库，以缓解少样本偏差和分布偏移。该方案在12个主流时间序列基准上基于6种VLM骨干实现了显著且一致的性能提升。

链接: https://arxiv.org/abs/2605.09395
作者: Lin Li,Jiawei Huang,Qihao Quan,Dan Li,Boxin Li,Xiao Zhang,Erli Meng,Wenjie Feng,Jian Lou,See-Kiong Ng
机构: Sun Yat-sen University (中山大学); Xiaomi Corporation (小米公司); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: 18 pages, 12 figures, 6 tables. Preprint

点击查看摘要

Abstract:In this paper, we propose the first VL \underline\textbfM \underline\textbfa gentic \underline\textbfr easoning framework for few- \underline\textbfs hot multimodal \underline\textbfT ime \underline\textbfS eries \underline\textbfC lassification ( \textbfMarsTSC ), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that \textbfMarsTSC delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.

自然语言处理

[NLP-0] LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

【速读】：该论文试图解决现有记忆基准主要关注用户历史、短轨迹或下游任务成功，而缺乏直接评估记忆系统是否有效内化环境特定经验的问题，尤其是在专门网络环境中智能体需要长期记忆来回忆起界面功能、状态动态、工作流程和重复失败模式。解决方案的关键是提出了LongMemEval-V2（LME-V2）基准，包含451个涵盖五种核心记忆能力的手工问题（静态状态回忆、动态状态追踪、工作流知识、环境陷阱和前提意识），并配有多达500条轨迹和115M token的历史轨迹。在此基础上，设计了两种记忆方法：AgentRunbook-R，一种基于RAG的高效记忆方法，利用知识池存储原始状态观察、事件和策略笔记；以及AgentRunbook-C，将轨迹存储为文件并调用编码代理在增强沙箱中收集证据。其中AgentRunbook-C以72.5%的平均准确率显著优于其他基线，实现了精度-延迟帕累托前沿的推进。

链接: https://arxiv.org/abs/2605.12493
作者: Di Wu,Zixiang Ji,Asmi Kawatkar,Bryan Kwan,Jia-Chen Gu,Nanyun Peng,Kai-Wei Chang
机构: University of California, Los Angeles(加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

[NLP-1] MEME: Multi-entity Evolving Memory Evaluation

【速读】：该论文试图解决大语言模型（LLM）智能体在持久环境中面对多实体（multi-entity）和演化（evolving）场景时，现有记忆系统在依赖推理（dependency reasoning）任务（如Cascade和Absence）上完全失效的问题；尽管静态检索性能尚可，但在默认配置下依赖推理平均准确率极低（Cascade仅3%，Absence仅1%），且提示优化、更深检索、减少填充噪音及更强LLM均无法弥合差距。解决方案之关键在于：当前唯一能部分弥合该差距的方法是基于文件系统（file-based agent）配合Claude Opus 4.7作为内部LLM，但其成本约为基线方案的70倍，导致该方案无法大规模实践，因此目前尚无可实用的、可扩展的解决方案。

链接: https://arxiv.org/abs/2605.12477
作者: Seokwon Jung,Alexander Rubinstein,Arnas Uselis,Sangdoo Yun,Seong Joon Oh
机构: KAIST AI (KAIST AI); Tübingen AI Center, University of Tübingen (图宾根大学图宾根AI中心); NAVER AI Lab (NAVER AI实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: this https URL.

[NLP-2] Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

【速读】：该论文的核心问题是解决稀疏混合专家模型（Sparse Mixture-of-Experts, SMoE）训练中路由崩溃和辅助负载均衡损失降低专家专门化之间的矛盾。解决方案的关键在于揭示了路由器与对应专家之间存在的几何耦合（geometric coupling）机制：对于给定token，所选专家的路由器权重和专家权重沿同一输入方向接收梯度，仅标量系数不同，从而使得路由决策被编码在所选专家的内部激活中。通过一个无参数在线K-Means路由器，论文证明了利用这种几何耦合可以实现无需辅助损失的低负载不平衡路由，表明几何耦合是路由器学习有效分工的核心机制。

链接: https://arxiv.org/abs/2605.12476
作者: Sagi Ahrac,Noya Hochwald,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router–expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a 1 B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router–expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.

[NLP-3] KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

【速读】：该论文试图解决预训练Transformer在不修改架构或重新训练的情况下进行稳定、高效的长上下文推理的问题，特别是克服流式方法因固定内存预算而丢失远程信息、以及高成本全注意力（full attention）在长序列中内存爆炸的困境。解决方案的关键在于提出了KV-Fold协议：将键值缓存（KV cache）视为左折叠（left fold）中的累加器，按顺序处理序列分块（chunk），每次过程将当前块与之前累积的缓存拼接后输入模型，生成的新键值追加到缓存并传递到下一块，从而形成块到块的递归（chunk-to-chunk recurrence）。这一递归天然稳定，每步偏移短暂上升后即饱和为平坦平台，且对数值精度（10⁶倍变化）、分块大小和模型族鲁棒；在任务层面，KV-Fold能在单卡40GB显存限制下实现长距离精确检索（如128K token上下文中100%精确匹配），无需训练或架构改动，仅依赖冻结预训练Transformer自带的自注意力机制即可支持稳定的KV缓存递归。

链接: https://arxiv.org/abs/2605.12471
作者: Alireza Nadali,Patrick Cooper,Ashutosh Trivedi,Alvaro Velasquez
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 6 tables

点击查看摘要

Abstract:We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

[NLP-4] Solve the Loop: Attractor Models for Language and Reasoning

【速读】：该论文试图解决循环变压器（Looped Transformers）在训练过程中存在的稳定性差、优化与部署成本高昂、以及有效深度固定且受限的问题。解决方案的关键在于引入吸引子模型（Attractor Models），其核心机制是：主干模块（backbone module）首先生成输出嵌入，吸引子模块（attractor module）通过求解不动点（fixed point）来精炼这些嵌入，并利用隐式微分（implicit differentiation）计算梯度。这一设计使得训练内存与有效深度解耦，迭代次数可通过收敛判定自适应选择，从而实现了循环计算的可扩展训练与部署。

链接: https://arxiv.org/abs/2605.12466
作者: Jacob Fein-Ashley,Paria Rashidinejad
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model’s initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

[NLP-5] Multi-Stream LLM s: Unblocking Language Models with Parallel Streams of Thoughts Inputs and Outputs

【速读】：该论文旨在解决当前基于语言模型的自主代理（autonomous agents）在单一消息流（single stream of computation）计算模式下的根本性局限：代理无法在阅读的同时生成输出（行动），在输出（写作）时无法实时响应新信息，且思考与行动或阅读相互阻塞。解决方案的关键在于将传统顺序消息格式的指令微调（instruction-tuning）转变为面向多个并行计算流（parallel streams of computation）的指令微调——将“用户”、“系统”、“工具”及“自身（链式思考）”等不同角色拆分为独立的流（stream），使语言模型的每一次前向传播（forward pass）能同时从多个输入流读取并在多个输出流生成令牌（tokens），所有流之间保持因果依赖（causal dependence）于时间步。这一数据驱动的架构变更从根本上解除了角色间的串行依赖，从而提升了代理的并行效率、安全性（通过关注点分离）和可监控性。

链接: https://arxiv.org/abs/2605.12460
作者: Guinan Su,Yanwu Yang,Xueyan Li,Jonas Geiping
机构: Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所); Tübingen AI Center(图宾根人工智能中心); ETH Zurich(苏黎世联邦理工学院); University Hospital Tübingen(图宾根大学医院); University of Tübingen(图宾根大学); ELLIS Institute Tübingen(图宾根ELLIS研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint, 37 pages. Code at this https URL

点击查看摘要

Abstract:The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability. Comments: Preprint, 37 pages. Code at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.12460 [cs.LG] (or arXiv:2605.12460v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-6] xtSeal: A Localized LLM Watermark for Provenance Distillation Protection

【速读】：该论文旨在解决大语言模型（LLMs）生成文本的水印（watermark）问题，即在确保水印检测强度、鲁棒性（如抵抗人机混编稀释）和理论无失真（distortion-free）的同时，不引入推理开销、不影响下游任务性能、不降低输出多样性，并支持水印在模型蒸馏中的放射性（radioactive）传递。解决方案的关键在于：基于Gumbel-max采样，采用双密钥生成（dual-key generation）以恢复被水印压缩的输出多样性，配合熵加权评分（entropy-weighted scoring）与多区域定位（multi-region localization）提升检测性能，并兼容推测解码（speculative decoding）与多token预测（multi-token prediction）等优化技术，从而实现零额外推理成本的强检测能力。

链接: https://arxiv.org/abs/2605.12456
作者: Tom Sander,Hongyan Chang,Tomáš Souček,Tuan Tran,Valeriu Lacatusu,Sylvestre-Alvise Rebuffi,Alexandre Mourachko,Surya Parimi,Christophe Ropers,Rashel Moritz,Vanessa Stark,Hady Elsahar,Pierre Fernandez
机构: FAIR(脸书人工智能研究), Meta Superintelligence Labs(Meta超级智能实验室)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive’': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

[NLP-7] he Algorithmic Caricature: Auditing LLM -Generated Political Discourse Across Crisis Events

【速读】：该论文旨在解决大型语言模型生成政治话语的检测难题，尤其是当传统基于句子级线索（如困惑度、突发性或标记不规则性）的检测方法因生成系统改进而逐渐失效时，如何评估合成政治话语的真实性。解决方案的关键在于采用计算社会科学视角，通过构建涵盖九个危机事件的配对语料库（真实与合成话语各178万余条），从情感强度、结构规律性、词汇-意识形态框架和跨事件依赖性四个维度进行群体级评估，利用均值差距和离散度证据量化差异，并总结为简单的事件级指标“Caricature Gap”。该方法的本质是转向人口现实主义审计，而非依赖语法或流畅性，从而揭示合成话语的局限性在于群体级不真实，而非单句质量。

链接: https://arxiv.org/abs/2605.12452
作者: Gunjan,Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.

[NLP-8] ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成回答时即使错误也常表现出高置信度的问题，旨在实现可靠的口头置信度（verbalized confidence）校准，避免现有联合优化方法中置信度对齐目标对答案准确性的干扰。解决方案的关键在于：第一，将置信度估计与答案生成解耦（decoupled），即先生成答案，再基于固定的问答对（question–answer pair）进行置信度估计，从而在优化置信度时不直接扰动答案生成过程；第二，采用排序强化学习（rank-based reinforcement learning）目标，通过从多次模型补全（completions）中构建基于采样的替代（sampling-based surrogate）以估计正确可能性（correctness likelihood），并优化置信度的相对排序，使估计正确可能性更高的回答获得更高的口头置信度。这种方法在保持答案准确性的前提下显著改善了校准与失败预测性能。

链接: https://arxiv.org/abs/2605.12446
作者: Chen Li,Xiaoling Hu,Songzhu Zheng,Jiawei Zhou,Chao Chen
机构: Stony Brook University, NY, USA(美国纽约州立大学石溪分校); Massachusetts General Hospital and Harvard Medical School, MA, USA(美国马萨诸塞州麻省总医院和哈佛医学院); Morgan Stanley, NY, USA(美国纽约摩根士丹利)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question–answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.

[NLP-9] A Causal Language Modeling Detour Improves Encoder Continued Pretraining

【速读】：该论文试图解决的问题是：在将预训练编码器（encoder）适配到新领域时，标准的掩码语言建模（Masked Language Modeling, MLM）持续训练并非最优，其下游性能可通过一种替代训练策略得到提升。解决方案的关键在于引入一个“因果语言建模（Causal Language Modeling, CLM）绕路”阶段：在领域适应过程中暂时切换到CLM训练，随后再进行短时间MLM衰减（MLM decay）。该策略之所以有效，是因为CLM提供的密集监督（dense supervision）对低层Transformer层（第0–7层）产生的影响远大于MLM，这种表示变化在后续MLM衰减阶段得以保持，并且增益随模型容量缩放；实验证实，冻结低层会消除下游收益，而冻结中层则保留收益，从而表明低层是CLM绕路带来提升的核心机制。

链接: https://arxiv.org/abs/2605.12438
作者: Rian Touchent,Eric de la Clergerie
机构: Sorbonne Université (索邦大学); INRIA Paris (INRIA 巴黎); ALMAnaCH Team (ALMAnaCH 团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM’s dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

[NLP-10] Geometric Factual Recall in Transformers

【速读】：该论文试图解决Transformer语言模型如何记忆事实关联的问题，挑战了传统观点中将内部权重矩阵视为嵌入对上的联想记忆（associative memory）、参数数量与事实数量线性相关的理解。解决方案的关键在于提出并证明了一种几何形式的记忆（geometric memorization）：在受控设置中，单层Transformer必须记忆从主语到共享属性集的随机双射时，只需要对数级的嵌入维度（logarithmic embedding dimension），其中主语嵌入编码其关联属性向量的线性叠加（linear superposition），而一个小型的多层感知机（MLP）充当关系条件的选择器（relation-conditioned selector），通过ReLU门控（ReLU gating）提取相关属性，而非作为联想键值映射（associative key-value mapping）。进一步地，该机制在多跳关系查询（如链式推理）中表现出容量-深度权衡（capacity-depth tradeoff），且训练后的MLP在主语嵌入被适当重新初始化时能够零样本迁移到全新的双射上，证明其学习到的是通用的选择机制，而非记忆特定事实。

链接: https://arxiv.org/abs/2605.12426
作者: Shauli Ravfogel,Gilad Yehudai,Joan Bruna,Alberto Bietti
机构: New York University (纽约大学); Flatiron Institute (熨斗研究所)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emphgeometric form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emphlinear superpositions of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting – chains of relational queries such as ``Who is the mother of the wife of x ?‘’ – providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.

[NLP-11] Predicting Disagreement with Human Raters in LLM -as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

【速读】：该论文旨在解决大语言模型自动生成教育材料时，所分配的难度评级与人类评分者之间存在不一致性的问题。现有方法依赖生成时的概率信号，但这些信号必须在评分生成时收集且难以跨模型比较。解决方案的关键在于利用难度属于序数尺度这一特性，引入独立的嵌入空间（如ModernBERT），通过计算评分集的几何一致性来识别可能与人类评分者产生分歧的候选评级，从而将这些案例提交重新评分，而不依赖于生成时的概率信号。

链接: https://arxiv.org/abs/2605.12422
作者: Yo Ehara
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to Educational Data Mining (EDM) 2026 (Poster/Demo Track)

点击查看摘要

Abstract:Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.

[NLP-12] Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

【速读】：该论文试图解决大型语言模型（LLMs）在上下文学习（in-context learning）中所依赖的潜在假设空间（latent hypothesis space）结构不明确的问题。解决方案的关键在于提出LLMs在一个低维几何空间——概念信念空间（conceptual belief space）上分配信念，并将上下文学习建模为信念随时间更新的轨迹。通过以故事理解（story understanding）为动态信念更新的自然情境，结合行为分析与内部表征分析，论文发现信念更新可描述为低维结构化流形（low-dimensional structured manifold）上的轨迹，且该结构在行为与内部表征中一致，可通过线性探针（linear probe）解码以预测行为，同时基于概念空间几何的因果干预可引导信念轨迹。

链接: https://arxiv.org/abs/2605.12412
作者: Eric Bigelow,Raphaël Sarfati,Daniel Wurgaft,Owen Lewis,Thomas McGrath,Jack Merullo,Atticus Geiger,Ekdeep Singh Lubana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.

[NLP-13] A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

【速读】：该论文试图解决受控文本生成（Controlled Text Generation, CTG）领域由于各研究使用不同数据集和评估方法，导致难以客观比较不同系统性能的问题。解决方案的关键是采用“公平竞争环境”（level-playing-field, LPF）方法，即通过标准化方式生成和处理所有系统输出，并应用一套基于当前使用情况选取的共享评估方法与数据集，以实现对CTG系统的信息丰富且公平的比较评估。

链接: https://arxiv.org/abs/2605.12395
作者: Michela Lorandi,Anya Belz
机构: Dublin City University(都柏林城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation methods are used in each case to assess the control achieved. Objectives: Our aim in the work reported in this paper is to develop an approach to evaluation that enables us to comparatively evaluate different CTG systems in a manner that is both informative and fair to the individual systems. Methods: We use a level-playing-field (LPF) approach to comparative evaluation where we (i) generate and process all system outputs in a standardised way, and (ii) apply a shared set of evaluation methods and datasets, selected based on those currently in use, in order to ensure fair evaluation. Results: When re-evaluated in this way, performance results for a representative set of current CTG systems differ substantially from originally reported results, in most cases for the worse. This highlights the importance of a shared standardised way of assessing controlled generation. Conclusions: The discrepancies revealed by LPF evaluation demonstrate the urgent need for standardised, reproducible evaluation practices in CTG. Our results suggest that without such practices, published performance claims may substantially misrepresent true system capabilities. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.12395 [cs.CL] (or arXiv:2605.12395v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.12395 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michela Lorandi [view email] [v1] Tue, 12 May 2026 16:57:53 UTC (1,646 KB) Full-text links: Access Paper: View a PDF of the paper titled A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles, by Michela Lorandi and Anya BelzView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-14] Scalable Token-Level Hallucination Detection in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理密集型任务中产生的内部幻觉（hallucinations）难以检测的问题。现有步级分析（step-level analysis）方法依赖步分割（step segmentation），导致检测粒度有限且可扩展性差，在内容看似连贯但存在逻辑缺陷或不可靠中间结果时尤为困难。解决方案的核心在于提出TokenHD——一个用于训练令牌级幻觉检测器（token-level hallucination detector）的全流程管道。其关键包含两部分：一是可扩展的数据引擎（scalable data engine），能够合成大规模、高质量的令牌级幻觉标注数据；二是训练配方（training recipe）中采用的重要性加权策略（importance-weighted strategy），用于增强模型对关键错误令牌的鲁棒性。通过这一方案，检测器可直接处理自由文本（free-form text），无需预先定义步分割或文本重格式化，从而在保持细粒度检测的同时显著提升可扩展性与泛化能力。

链接: https://arxiv.org/abs/2605.12384
作者: Rui Min,Tianyu Pang,Chao Du,Minhao Cheng,Yi R. Fung
机构: Sea AI Lab (Sea人工智能实验室); Hong Kong University of Science and Technology (香港科技大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.

[NLP-15] Pretraining Exposure Explains Popularity Judgments in Large Language Models SIGIR2026

【速读】：该论文试图解决大语言模型（LLMs）中普遍存在的流行度偏差（popularity bias）的根源问题，即模型的实体偏好究竟是源于预训练阶段的数据统计暴露（statistical exposure），还是反映了真实世界的流行度（real-world popularity）。由于大多数训练语料库不可访问，二者难以区分。解决方案之关键在于首次基于完全可观察的预训练数据（OLMo模型及其完整的预训练语料Dolma，涵盖7.4万亿token）进行直接的大规模分析，通过计算2,000个实体（涵盖五类：人物、地点、组织、艺术、产品）在预训练中的精确暴露统计，并与维基百科页面浏览量（Wikipedia popularity）以及两种LLM的流行度信号（直接标量估计和成对比较）进行对比。研究发现，预训练暴露与维基百科流行度强相关，但LLM的流行度判断（尤其是通过成对比较）与预训练暴露的一致性显著高于与维基百科的一致性，且这种一致性在更大模型中更强，在维基百科数据不可靠的长尾部分依然持续。因此，论文的核心结论是：LLM中的流行度先验主要由预训练统计塑造，而非外部流行度信号，从而揭示了数据暴露在驱动流行度偏差中的中心作用。

链接: https://arxiv.org/abs/2605.12382
作者: Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.

[NLP-16] Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

【速读】：该论文试图解决参数高效微调（PEFT）技术中每个新任务都需要单独微调的问题，即如何在不增加额外训练成本的前提下，通过泛化方法处理多个任务或任务组合。解决方案的关键在于推理时对单独训练的PEFT模块的输出进行求和（output composition），该方法在情感控制、主题控制和多属性控制等多组受控文本生成任务中，始终优于或匹配其他组合方式（如训练时多数据集联合训练、推理时组合权重矩阵），甚至在单任务测试集上相比单任务专用模块平均提升2个百分点。

链接: https://arxiv.org/abs/2605.12345
作者: Michela Lorandi,Anya Belz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three ways of generalising beyond single-task training/inference: (i) training on combinations of multiple, related datasets; (ii) at inference, composing the weight matrices of separately trained PEFT modules; and (iii) at inference, composing the outputs of separately trained PEFT modules. We test these approaches on three different LLMs, QLoRA as the PEFT technique, and three sets of controlled text generation datasets for sentiment control, topic control, and multi-attribute control. We find that summing PEFT module outputs is a particularly strong composition method, which consistently either outperforms or matches the performance of alternative approaches. This is the case even when comparing against single-task specialised modules on the single-task test set, where three-module output composition achieves an average 2% point performance increase across all models for sentiment control.

[NLP-17] A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems

【速读】：该论文旨在解决中小企业在数据录入系统中因类别语义或形态接近性导致的结构性分类错误问题，此类错误在缺乏自动化输入控制时不可恢复，并会扭曲关键绩效指标（KPIs）进而误导管理决策；现有标准化工具因孤立评估语义与形态维度且过度依赖标准词典，无法有效处理中小企业主数据中大量自定义SKU、缩写及专业术语。解决方案的关键在于引入类别错误敏感性指数（Categorical Error Sensitivity Index, ISEC），这是一个基于词嵌入（word embeddings）的语义距离、改进的Damerau-Levenshtein算法下的自定义加权形态转换成本及经验频率构建的序数综合得分，通过向量数据库架构将计算复杂度降低约195倍，从而作为可扩展、主动的数据治理工具，帮助中小企业预先识别其分类数据资产中隐含的结构性风险。

链接: https://arxiv.org/abs/2605.12328
作者: Ricardo Raúl Palma,Mauro Anibal Benetti,Fabricio Orlando Sanchez Varretti
机构: Universidad Nacional de Cuyo (National University of Cuyo); Instituto de Ingeniería (Institute of Engineering); Universidad Tecnologica Nacional (National Technological University); Facultad Regional San Rafael (San Rafael Regional Faculty)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.

[NLP-18] GKnow: Measuring the Entanglement of Gender Bias and Factual Gender ACL2026

【速读】：该论文试图解决现有研究中对语言模型性别预测机制理解的两个主要局限：一是过于局限于特定性别相关任务（如性别代词预测），二是未能区分事实性性别输出（基于语义属性的正确性别假设）与性别偏见输出（基于刻板印象）。为此，论文提出了一个综合基准 GKnow，用于评估语言模型在不同类型性别相关预测中的性别知识与性别偏见。解决方案的关键在于通过 GKnow 识别和分析负责性别预测的电路和单个神经元，并在此基础上进行神经元消融实验，从而揭示性别偏见与事实性别在电路和神经元层面上的严重纠缠，证明消融作为去偏方法的不可靠性，同时指出现有性别偏见评估基准可能掩盖消融后事实性别知识的下降。最终，GKnow 本身作为构建稳健性别偏见基准的贡献被提出。

链接: https://arxiv.org/abs/2605.12299
作者: Leonor Veloso,Hinrich Schütze
机构: Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心(MCML))
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.

[NLP-19] okenRatio: Principled Token-Level Preference Optimization via Ratio Matching

【速读】：该论文旨在解决现有偏好对齐方法（如Direct Preference Optimization, DPO）在序列级偏好建模与生成过程逐token决策之间的不一致性：DPO 在整个输出序列上建模偏好，而生成是由每个时间步的token选择驱动的；同时，现有的token级扩展通常将序列级Bradley-Terry目标简单分解到各个时间步，未能显式恢复每个前缀（state-wise）的最优性。解决方案的关键是提出Token-level Bregman Preference Optimization (TBPO)，它建立在token-level Bradley-Terry偏好模型之上（该模型对给定前缀下的下一个token动作进行偏好建模），并推导出一个Bregman散度密度比匹配目标，该目标在泛化logistic/DPO损失的同时，保留了由token级模型诱导的最优策略以及DPO的简洁性。具体实现中，TBPO提供了两个实例化：TBPO-Q显式学习一个轻量级状态基线，而TBPO-A通过优势归一化移除基线，从而在指令遵循、有用性/无害性和摘要生成等基准上提升了对齐质量、训练稳定性并增加了输出多样性。

链接: https://arxiv.org/abs/2605.12288
作者: Truong Nguyen,Tien-Phat Nguyen,Linh Ngo Van,Duy Minh Ho Nguyen,Khoa Doan,Trung Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

[NLP-20] What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty ACL

【速读】：该论文试图解决的核心问题是：确定影响英语词汇学习难度的因素，并量化这些因素如何因学习者的母语（西班牙语、德语或中文）而异。解决方案的关键在于使用梯度提升模型（gradient-boosted models），将词汇的熟悉度（如词频）、意义、表面形式（如发音和拼写）以及跨语言迁移特征作为预测变量，并利用Shapley值（Shapley values）评估各特征组的相对重要性。研究发现，词汇熟悉度是三种语言学习者共有的主导因素，但西班牙语和德语学习者的难度还受正字法迁移（orthographic transfer）的显著影响，而中文学习者无法利用这一机制，其难度主要来自熟悉度与表面特征的组合。模型提供了可解释的、针对不同母语的难度估计，可用于设计词汇教学大纲。

链接: https://arxiv.org/abs/2605.12281
作者: Jonas Mayer Martins,Zhuojing Huang,Aaricia Herygers,Lisa Beinborn
机构: University of Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to BEA 2026 at ACL. 18 pages, 13 figures

点击查看摘要

Abstract:What makes a word difficult to learn, and how does the difficulty depend on the learner’s native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word’s familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.

[NLP-21] Reconstruction of Personally Identifiable Information from Supervised Finetuned Models

【速读】：该论文首次系统性地研究了监督微调（SFT）大语言模型（LLM）中个人身份信息（PII）重建的隐私泄露问题，旨在评估在医疗和法律等敏感领域，具有不同程度先验知识的攻击者能否从微调模型中成功推断出训练数据中包含的敏感个人身份信息。解决方案的关键在于提出一种名为COVA的新型解码算法，该算法通过设计针对前缀攻击（prefix-based attack）的解码策略，在多种PII类型上一致超越现有提取方法，即使攻击者仅掌握部分关于微调数据集的知识，也能显著提升重建成功率，从而揭示了不同PII类型在SFT模型中的泄漏差异性。

链接: https://arxiv.org/abs/2605.12264
作者: Sae Furukawa,Alina Oprea
机构: Northeastern University (东北大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric QA datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.

[NLP-22] PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

【速读】：该论文针对长程语言代理（long-horizon language agents）因对话历史快速累积而超越固定上下文窗口容量所导致的记忆管理难题——现有方法在扩展上下文窗口时未优化检索内容、摄入阶段的事实提取产生高昂token成本，或依赖启发式图遍历而牺牲准确率与效率——提出了一种免训练（training-free）的检索侧框架PRISM，其核心思路是将长程记忆视为图结构记忆上的联合检索与压缩问题。解决方案的关键在于组合四个正交推理组件：基于类型化关系路径的层级束搜索（Hierarchical Bundle Search over typed relation paths）、通过检测查询意图来对齐遍历方向的查询敏感边成本计算（Query-Sensitive Edge Costing）、将候选束压缩为紧凑答案侧上下文的证据压缩（Evidence Compression），以及将大多数查询路由至零LLM层的自适应意图路由（Adaptive Intent Routing）。通过将检索形式化为类型化路径模板上的最小成本选择，并配合LLM侧的压缩步骤，PRISM能够在严格上下文预算下呈现正确证据，且无需微调或修改上游摄入管道，从而在LoCoMo基准测试中以一个数量级更小的上下文预算实现了显著更高的LLM评判准确率，占据了准确率-上下文成本前沿中此前空白的区域。

链接: https://arxiv.org/abs/2605.12260
作者: Jingyi Peng,Zhongwei Wan,Weiting Liu,Qiuzhuang Sun
机构: Singapore Management University (新加坡管理大学); The Ohio State University (俄亥俄州立大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.

[NLP-23] PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

【速读】：该论文试图解决对真实世界的多轮对话式诈骗如何随时间演变进行建模和理解的问题。现有研究主要集中于静态诈骗检测或合成诈骗，忽略了诈骗者通过逐步操纵受害者的心理技巧在对话中演化风险的过程。解决方案的关键是引入了PreScam基准，它基于用户提交的诈骗报告，经过筛选和结构化形成涵盖20个诈骗类别的11,573个对话式诈骗实例，并按照提出的**诈骗杀伤链（scam kill chain）**对每个实例进行层次化结构组织，在每轮对话中标注诈骗者的心理动作和受害者的回应，从而支持两个核心任务：实时终止预测和诈骗者行动预测，以揭示模型在表面流畅性与进度建模之间的差距。

链接: https://arxiv.org/abs/2605.12243
作者: Weixiang Sun,Shang Ma,Yiyang Li,Tianyi Ma,Zehong Wang,Colby Nelson,Xusheng Xiao,Yanfang Ye
机构: University of Notre Dame(圣母大学); Arizona State University(亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer’s subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.

[NLP-24] Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLM s ACL2026

【速读】：该论文针对自动语音识别（Automatic Speech Recognition, ASR）转录文本中普遍存在的不流利现象（disfluencies，包括填充词、重复和假起始）所导致的可读性下降以及下游系统（如聊天机器人和语音助手）可靠性受损的问题，提出了一种多语言修正方案。现有方法多基于检测并移除不流利词，但易破坏语法结构（grammatical structure）和语义连贯性（semantic coherence），而近期利用大语言模型（Large Language Model, LLM）的研究仅聚焦于不流利检测或数据增强，缺乏全面修正。解决方案的关键在于：先通过序列标注器（sequence tagger）标记不流利词，再以这些信号指导LLM进行指令微调（instruction fine-tuning），将转录重写为流畅文本；同时引入对比学习（contrastive learning）目标，通过惩罚模型复现不流利词的行为，迫使模型在保留语法和语义的前提下移除不流利成分。该流水线结合了标记级提示、指令微调和对比学习，在多语言场景下（以印地语、孟加拉语和马拉地语为例）显著优于仅依赖检测的策略和序列到序列（sequence-to-sequence）基线模型，为语音驱动的自然语言处理系统提供了实用且可扩展的不流利修正方案。

链接: https://arxiv.org/abs/2605.12242
作者: Deepak Kumar,Baban Gain,Asif Ekbal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main)

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at this https URL.

[NLP-25] Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在长上下文任务中后训练方法存在的三大局限：离策略方法（如监督微调SFT和知识蒸馏KD）因曝光偏差（exposure bias）和长程模型生成错误恢复有限而性能不足；在策略强化学习方法（如组相对策略优化GRPO）虽能更好对齐模型生成状态，但因奖励稀疏（sparse rewards）导致训练不稳定且样本效率低；在策略蒸馏（OPD）虽提供密集的token级指导，但无法直接优化任意奖励信号。解决方案的关键是提出dGRPO（Distilled Group Relative Policy Optimization）方法，该方法通过将GRPO与OPD的密集指导相结合，在一个单一目标中同时实现基于结果的策略优化（outcome-based policy optimization）与知识蒸馏，从而为长上下文推理提供更稳定、高效的对齐路径，并引入合成长上下文数据集LongBlocks以支持多跳推理、上下文定位和长文本生成等任务。

链接: https://arxiv.org/abs/2605.12227
作者: Miguel Moura Ramos,Duarte M. Alves,André F. T. Martins
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.

[NLP-26] Mechanistic Interpretability of ASR models using Sparse Autoencoders

【速读】：该论文试图解决的问题是：当前稀疏自编码器（Sparse Autoencoder，SAE）在解释深度Transformer模型内部机制时，仅被应用于文本大语言模型（Large Language Models，LLMs），而缺乏在音频处理模型如自动语音识别器（Automatic Speech Recognizers，ASRs）上的等效研究。解决方案之关键是将SAE应用于基于Transformer的ASR模型Whisper，具体是在Whisper编码器提取的帧级嵌入上训练一个高维稀疏潜在空间，通过该空间发现跨语言和非语言边界的多样单义特征，并实现跨语言特征引导，从而证明SAE在音频领域的可行性并揭示Whisper编码的丰富语言信息。

链接: https://arxiv.org/abs/2605.12225
作者: Dan Pluth,Zachary Nicholas Houghton,Yu Zhou,Vijay K. Gurbani
机构: Vail Systems, Inc. (Vail Systems公司); University of Oregon (俄勒冈大学)
类目: Computation and Language (cs.CL)
备注: 10 pages + references and appendix

点击查看摘要

Abstract:Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.

[NLP-27] Not How Many But Which: Parameter Placement in Low-Rank Adaptation

【速读】：该论文试图解决的问题是：在LoRA适配器的B矩阵中，给定固定数量的可训练条目（k个）预算下，如何选择这些条目的位置（即参数放置问题）以在不同训练范式（监督微调SFT与基于GRPO的强化学习）下保持性能，尤其是解决GRPO训练中随机选择可训练参数导致性能无法提升的问题。解决方案的关键在于揭示两种训练范式下梯度结构的根本差异：SFT的梯度具有低秩性和方向稳定性，因此任何随机子集都能累积一致的更新；而GRPO的梯度具有高秩性且各步之间接近正交，只有那些具有持续一致符号梯度的参数元素才能保留有效学习信号。基于此，论文提出了一种极低成本的梯度评分程序，通过快速计算参数在训练早期步骤中的梯度符号一致性，在不到10秒内（仅需不到0.5%的训练成本）识别出关键参数，这些参数集中分布在残差流写入投影（V、O、Down）上，且该模式在不同模型家族和规模（1.5B-8B）中稳定成立，从而在GRPO训练中仅使用这些关键可训练参数即可恢复标准LoRA的精度。

链接: https://arxiv.org/abs/2605.12207
作者: Arijit Sehanobish,Charles Lovering
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Comments welcome

点击查看摘要

Abstract:We study the \textitparameter placement problem: given a fixed budget of k trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which k matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).

[NLP-28] Mitigating Context-Memory Conflicts in LLM s through Dynamic Cognitive Reconciliation Decoding

【速读】：该论文旨在解决大型语言模型（LLM）中因预训练积累的参数化知识与外部上下文知识不一致而导致的知识冲突（knowledge conflicts）问题。现有方法在处理冲突时存在缺陷：静态的对比解码在无冲突场景下会破坏输出分布，而动态解码虽试图衡量冲突程度，却难以应对复杂现实情况。论文提出的解决方案之关键是动态认知协调解码（Dynamic Cognitive Reconciliation Decoding, DCRD），一种两阶段解码方法。其核心在于首先通过分析注意力图（attention map）评估上下文保真度并预测潜在冲突，然后根据预测结果将输入导向两条解码路径之一：贪婪解码（greedy decoding）用于无冲突场景以保持高效准确，或基于上下文保真度的动态解码（context fidelity-based dynamic decoding）用于冲突场景以缓解矛盾。这种设计使DCRD能同时在冲突与无冲突场景下取得优异性能，并基于此构建了知识冲突基准ConflictKG以验证效果。

链接: https://arxiv.org/abs/2605.12185
作者: Yigeng Zhou,Wu Li,Yifan Lu,Yequan Wang,Xuebo Liu,Wenya Wang,Jun Yu,Min Zhang,Jing Li
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TASLP

点击查看摘要

Abstract:Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.

[NLP-29] Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

【速读】：该论文试图解决企业系统中由于租户特定业务逻辑的动态变化和部署转移（deployment shift）导致的世界模型（world models）泛化能力下降的问题。具体而言，传统离线训练的世界模型在环境动态可配置且随时间演变时表现脆弱，而现有文献未探讨当转换规则可在推理时读取时，代理是否仍需学习这些规则。解决方案的关键在于提出运行时发现（runtime discovery）机制，即企业发现代理（enterprise discovery agents）在推理阶段通过读取系统配置来恢复相关的迁移动态，从而将预测锚定于当前活动实例，而非仅依赖离线学习的内部化表示。论文通过引入CascadeBench基准和部署转移评估，证明了基于发现的代理在动态变化环境中比纯离线训练的世界模型具有更强的鲁棒性。

链接: https://arxiv.org/abs/2605.12178
作者: Jishnu Sethumadhavan Nair,Patrice Bechard,Rishabh Maheshwary,Surajit Dasgupta,Sravan Ramachandran,Aakash Bhagat,Shruthan Radhakrishna,Pulkit Pattnaik,Johan Obando-Ceron,Shiva Krishna Reddy Malay,Sagar Davasam,Seganrasan Subramanian,Vipul Mittal,Sridhar Krishna Nemala,Christopher Pal,Srinivas Sunkara,Sai Rajeswar
机构: ServiceNow(ServiceNow)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system’s configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.

[NLP-30] Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

【速读】：该论文旨在解决大语言模型（LLM）生产部署中因用户反馈非随机、主要集中在满意度分布两端而导致的质量评估严重偏差问题（直接平均与真实系统质量可能相差40-50个百分点）。解决方案的关键在于提出一个三阶段层次贝叶斯流水线，无需个体交互的真实标签：首先通过主题聚类代理（Topic Clustering Agent）基于文本嵌入采用UMAP+HDBSCAN进行主题划分；接着偏差建模代理（Bias Modeling Agent）利用NUTS拟合两阶段层次Beta-Binomial模型，通过部分池化推断每个主题的选择率（s_c）和真实质量（q_c）；最后合成代理（Synthesis Agent）根据真实主题流行度（(\hat\pi_c=n_c/N)）对q_c重新加权，得到偏差校正后的聚合后验质量（(\bar Q)）及其置信区间。其中，打破参数不可辨识性的核心是引入来自反馈通道的弱先验信息（如典型正向反馈率和负正比，均可从生产仪表盘直接读取），否则仅靠每簇的充分统计量会存在一参数族等效拟合，导致弱先验变体与真实质量偏差达22-33个百分点；而带通道先验的层级知情模型（Hierarchical-Informed）在偏差比从1:1到30:1的模拟中，与真实质量（(Q^\star=0.6249)）的差距仅为4-13个百分点，且95%置信区间在50/50随机种子副本中覆盖了(Q^\star)。

链接: https://arxiv.org/abs/2605.12177
作者: Andrea Morandi,Mahesh Viswanathan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates s_c and quality q_c with partial pooling; a Synthesis Agent reweights q_c by true topic prevalence \hat\pi_c = n_c/N to report a bias-corrected aggregate posterior \bar Q = \sum_c \hat\pi_c q_c with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, C=18 clusters, Q^\star=0.6249 ) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of Q^\star as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering Q^\star in 50/50 random-seed replicates at \kappa_\max=10 . Without channel-side priors, every weak-prior variant misses Q^\star by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

[NLP-31] Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection

【速读】：该论文旨在解决自动虚假信息检测中一类特殊的难题：当虚假信息文章在局部上保持连贯性，但通过省略与同期报道相关的重要背景事实来误导读者时，传统的检测方法难以捕捉这种隐含的缺失信息。现有方法要么仅附加上下文检索结果作为辅助证据，要么推断一个类别性的省略信号，但未能显式地表征具体的缺失事实。解决方案的关键在于提出隐式因果空洞（Latent Causal Void, LCV）框架，该方法通过检索时间对齐的上下文文章，利用冻结的指令微调大型语言模型为每个目标句子与上下文文章对生成简短的缺失上下文描述，然后将这一关系文本作为跨源关系输入到由目标句子和上下文文章构成的异构图中进行推理。通过显式重建缺失事实本身，而非仅仅附加检索证据或预测省略信号，LCV在英中双语基准数据集上分别取得了2.56和2.84个宏F1分数的提升，验证了这种表征对于感知省略的虚假信息检测的有效性。

链接: https://arxiv.org/abs/2605.12156
作者: Hui Li,Zhongquan Jian,Jinsong Su,Junfeng Yao
机构: School of Informatics, Xiamen University (厦门大学信息学院); School of Computer and Data Science, Minjiang University (闽江学院计算机与数据科学学院)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with contemporaneous reports that supply background facts the article omits. We study this omission-relevant setting and observe that current omission-aware approaches typically either attach retrieved context as auxiliary evidence or infer a categorical omission signal, leaving the specific missing fact implicit. We propose \emphLatent Causal Void (LCV), a retrieval-guided detector that explicitly reconstructs the missing fact for each target sentence and uses it as a textual cross-source relation in graph reasoning. Concretely, LCV retrieves temporally aligned context articles, asks a frozen instruction-tuned large language model to generate a short missing-context description for each sentence–article pair, and feeds the resulting relation text into a heterograph over target sentences and context articles. On the bilingual benchmark of Sheng et al., LCV improves over the strongest omission-aware baseline by 2.56 and 2.84 macro-F1 points on the English and Chinese splits, respectively. The results indicate that modeling the missing cross-source fact itself, rather than only attaching retrieved evidence or predicting an omission signal, is a useful representation for omission-aware misinformation detection.

[NLP-32] Metaphor Is Not All Attention Needs

【速读】：该论文旨在探究文学性越狱（literary jailbreaks）能够成功绕过大型语言模型安全机制的根本原因，即为何诗歌化等风格改写手段会导致安全护栏失效。解决方案的关键在于通过注意力模式的可解释性分析（interpretability analysis of attention patterns）来解构这一现象：研究者设计了输入级消融实验，评估单个及组合诗歌修辞手段的贡献；构建了注意力图的可解释向量表征，并对其进行聚类，同时训练线性探针以预测安全结果和文学格式。实验结果表明，模型能够高精度地识别诗歌与散文格式，但难以在各自格式内预测越狱是否成功；聚类显示不同文学格式的注意力模式存在明确分离，但与安全标签无关。由此得出结论，文学性越狱的成功并非源于模型未能识别诗歌格式，而是因为累积的风格不规则性（stylistic irregularities）改变了提示处理流程，并避开了后训练阶段所考虑的词汇触发机制（lexical triggers），从而要求安全机制必须能够应对风格诱发的行为偏移。该研究以 Qwen3-14B 作为代表性开源模型案例进行了验证。

链接: https://arxiv.org/abs/2605.12128
作者: Olga Sorokoletova,Francesco Giarrusso,Giacomo De Luca,Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Marcello Galisai,Vincenzo Suriani,Daniele Nardi
机构: Sapienza University of Rome(罗马大学); Department of Computer, Control and Management Engineering(计算机、控制与管理工程系); DEXAI – Icaro Lab(DEXAI – Icaro实验室); University of Rome Tor Vergata(罗马第二大学); Sant’Anna School of Advanced Studies(圣安娜高等研究学校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.

[NLP-33] Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward

【速读】：该论文意图解决全球超过300种低资源手语（low-resource sign languages）的自动识别与翻译（sign language recognition and translation）难题，具体以阿塞拜疆手语（Azerbaijan Sign Language, AzSL）为例，其核心困境在于文档化程度低、数据集稀疏且缺乏专用计算工具。解决方案的关键在于三个范式转变：从架构中心（architecture-centric）转向数据中心 AI（data-centric AI），通过社区协同设计（community co-design）和方言多样性捕获（dialectal diversity capture）构建高质量、隐私保护的姿态表示（pose-based representations）；从与签名者无关（signer-independent）系统转向与签名者自适应（signer-adaptive）系统，以应对个体差异；从参考基准（reference-based）转向任务特定（task-specific）评估指标。技术层面，采用基于轻量级 MediaPipe 的架构、社区验证的注释（community-validated annotations）和离线优先部署（offline-first deployment），并利用突厥语系手语（Turkic sign languages）如哈萨克、土耳其、阿塞拜疆手语之间的语言邻近性（linguistic proximity）进行有效迁移学习（transfer learning），最终依赖以聋人社区（Deaf communities）为中心的持续性跨学科协作（interdisciplinary collaboration）来保障文化真实性、伦理治理与实用沟通效益。

链接: https://arxiv.org/abs/2605.12096
作者: Nigar Alishzade,Gulchin Abdullayeva
机构: Engineering Faculty of Karabakh University, Khankendi, Azerbaijan (卡拉巴赫大学工程学院，汉肯德，阿塞拜疆); MSERA Institute of Mathematics, Baku, Azerbaijan (MSERA数学研究所，巴库，阿塞拜疆)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review synthesizes literature on sign language recognition and translation for under-resourced languages, using Azerbaijan Sign Language (AzSL) as a case study. Analysis of global initiatives extracts eight actionable lessons, including community co-design, dialectal diversity capture, and privacy-preserving pose-based representations. Turkic sign languages (Kazakh, Turkish, Azerbaijani) receive special attention, as linguistic proximity enables effective transfer learning. We propose three paradigm shifts: from architecture-centric to data-centric AI, from signer-independent to signer-adaptive systems, and from reference-based to task-specific evaluation metrics. A technical roadmap for AzSL leverages lightweight MediaPipe-based architectures, community-validated annotations, and offline-first deployment. Progress requires sustained interdisciplinary collaboration centered on Deaf communities to ensure cultural authenticity, ethical governance, and practical communication benefit.

[NLP-34] World Action Models: The Next Frontier in Embodied AI

【速读】：该论文试图解决 Vision-Language-Action (VLA) 模型在具身策略学习中仅学习反应式的观测-动作映射（reactive observation-to-action mappings），而未明确建模物理世界在外界干预下的演化过程这一根本性缺陷，同时应对当前将世界模型（world model）与动作生成结合的文献在研究架构、学习目标和应用场景上高度碎片化、缺乏统一概念框架的问题。解决方案的核心在于提出并形式化定义 World Action Models (WAMs) 这一新兴范式——即统一预测性状态建模与动作生成的具身基础模型，并据此构建一个结构化的分类体系：将现有方法划分为级联 WAMs (Cascaded WAMs) 和联合 WAMs (Joint WAMs)，进一步按生成模态、条件机制和动作解码策略进行细分，同时系统梳理支撑 WAMs 发展的数据生态（涵盖机器人遥操作、便携式人类演示、仿真和互联网规模的第一人称视频）以及围绕视觉保真度、物理常识和动作合理性的评估协议，从而为该领域提供首个系统性的完整图谱，厘清关键架构范式及其权衡，并指明开放挑战与未来机遇。

链接: https://arxiv.org/abs/2605.12090
作者: Siyin Wang,Junhao Shi,Zhaoyang Fu,Xinzhe He,Feihong Liu,Chenchen Yang,Yikang Zhou,Zhaoye Fei,Jingjing Gong,Jinlan Fu,Mike Zheng Shou,Xuanjing Huang,Xipeng Qiu,Yu-Gang Jiang
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

[NLP-35] Do Language Models Encode Knowledge of Linguistic Constraint Violations?

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models，LLMs）内部是否编码了专门用于检测语法约束违反（grammatical constraint violations）的表示机制，即是否存在一组统一的、选择性激活的“语法违反检测器”。解决方案之关键包括三个方面：首先，使用稀疏自编码器（sparse autoencoders）将多语义（polysemantic）激活分解为稀疏的单语义（monosemantic）特征，从而恢复可能与违反相关的特征候选；其次，引入一个敏感性得分（sensitivity score），用于无监督地识别那些在违反约束的输入上比在合法输入上优先激活的特征；最后，提出一个合取的证伪框架（conjunctive falsification framework），联合评估三个标准（例如特征的选择性、因果性和跨现象一致性），以检验这些特征是否构成统一的检测器。然而，实验结果表明，这些标准在各语言现象上并未同时满足，也未发现跨所有类别一致共享的特征，因此当前模型中有统一的语法违反检测器的假设未能得到支持。

链接: https://arxiv.org/abs/2605.12055
作者: Hardy,Sebastian Padó
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.12055 [cs.CL] (or arXiv:2605.12055v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.12055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-36] Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition

【速读】：该论文旨在探究儿童导向语言（child-directed language, CDL）是否具有针对语言学习的优化特性，尤其是它如何促进动词意义的习得，以及这种优势是CDL独有的还是源自口语语域的共性。解决方案的关键在于：使用神经网络语言模型分别以CDL和成人导向语言（adult-directed language, ADL）进行训练，并通过从训练数据中选择性移除句法或词汇共现信息来操纵输入，从而评估这些扰动对动词意义习得的影响。实验发现，破坏句法结构会损害所有数据集上的学习，但基于CDL和口语ADL训练的模型表现出比基于书面输入训练的模型显著更高的韧性；同时，追踪语义和句法表现时观察到语义优先的发展轨迹，即动词意义早于稳定的句法熟练度出现，这种异步性在口语语域（尤其是CDL）中最为突出。这些结果表明，先前归因于CDL的动词学习优势可能实际上反映了口语语域的更广泛特性，而非CDL特有的优化。

链接: https://arxiv.org/abs/2605.12047
作者: Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza
机构: Center for Language and Cognition (CLCG), University of Groningen (格罗宁根大学语言与认知中心(CLCG))
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.

[NLP-37] SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

【速读】：该论文试图解决现有技能库（skill libraries）在组合任务中存在的两个关键挑战：一是智能体不仅需要识别相关技能，还需理解技能之间的依赖和构建关系；二是缺乏结构线索导致库维护困难（如技能合并、拆分、删除）。解决方案的关键在于提出SKILLGRAPH框架，将可复用技能表示为有向图（directed graph）中的节点，并通过类型边（typed edges）编码前提（prerequisite）、增强（enhancement）和共现（co-occurrence）关系。针对新任务，SKILLGRAPH不仅能检索单个技能，还能检索有序的技能子图（ordered skill subgraph），以指导多步决策。该图结构通过智能体轨迹和强化学习反馈进行持续更新，使技能库与智能体策略协同改进。

链接: https://arxiv.org/abs/2605.12039
作者: Xiaoyuan Li,Moxin Li,Keqin Bao,Yubo Ma,Wenjie Wang,Dayiheng Liu,Fuli Feng
机构: University of Science and Technology of China(中国科学技术大学); Alibaba Group(阿里巴巴集团); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.

[NLP-38] SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

【速读】：该论文试图解决大型语言模型在现有知识评估基准上表现良好，但面对问题变体时知识能力脆弱的问题，而当前基于LLM辅助的“生成-再验证”流水线因变体生成效率低且验证不可靠，导致鲁棒性增强成本高昂、难以规模化。解决方案的关键在于提出SAGE框架，该框架使用微调的小模型，具体包括：1）VariantQual，一个基于评分标准（rubric）的验证器，在人工标注种子数据上训练，用于可靠地评估变体质量；2）VariantGen，一个变体生成器，通过监督微调初始化，再以VariantQual作为奖励模型进行强化学习优化，从而高效生成高质量、多样化的变体。通过这种组合，SAGE能够以远低于人工标注的成本构建大规模、高质量且可泛化的鲁棒性增强基准，解决了原有方法在效率和可靠性上的瓶颈。

链接: https://arxiv.org/abs/2605.12022
作者: Xiaoyuan Li,Yuzhe Wang,Moxin Li,Keqin Bao,Rui Men,Yichang Zhang,Dayiheng Liu,Wenjie Wang,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.

[NLP-39] Learning Agent ic Policy from Action Guidance

【速读】：该论文试图解决大型语言模型（LLMs）在智能体强化学习（Agentic RL）中因基础策略探索能力不足而无法触及奖励状态（reward states）的问题，即当基础策略无法生成有效训练信号时，现有方法依赖昂贵的迭代式监督微调（SFT）来恢复学习信号。解决方案的关键在于提出ActGuide-RL方法，通过注入来自日常人类交互的丰富动作数据作为计划式参考指导（plan-style reference guidance），使智能体策略能够克服到达奖励状态的可达性障碍。该方法采用混合策略训练（mixed-policy training）联合优化有指导和无指导的 rollout，从而将探索增益内化至无指导策略中。基于对收益-风险权衡的理论与实证分析，ActGuide-RL遵循最小干预原则（minimal intervention principle），仅在必要时作为自适应后备调用指导，以匹配任务难度并降低离策略风险（off-policy risk）。这一范式显著减少了对大规模SFT数据的依赖，在搜索智能体基准测试中取得了与SFT+RL管线相当的性能，而无需冷启动。

链接: https://arxiv.org/abs/2605.12004
作者: Yuxiang Ji,Zengbin Wang,Yong Wang,Shidong Yang,Ziyu Ma,Guanhua Chen,Zonghua Sun,Liaoni Wu,Xiangxiang Chu
机构: Xiamen University (厦门大学); AMAP, Alibaba Group (AMAP, 阿里巴巴集团); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textscActGuide-RL, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textscActGuide-RL substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.

[NLP-40] owards Visually-Guided Movie Subtitle Translation for Indic Languages

【速读】：该论文试图解决电影字幕翻译中纯文本系统因缺失视觉模态信息（如情感、动作和社会细微差异）而导致的翻译质量不足问题，尤其针对英语到印地语、孟加拉语、泰卢固语、泰米尔语和卡纳达语等低资源印度语言的翻译任务。解决方案的关键在于提出两种轻量级视觉接地策略：一是基于5分钟滑动窗口的结构化属性摘要，二是对字幕间视觉间隙进行自由文本摘要；并进一步发现，由于长视频中字幕与帧之间存在时间错位，漫无边际的视觉接地往往低效，而“神谕选择性接地”（oracle selective grounding）——即仅将基线中质量最低的20-30%片段替换为视觉增强输出——能在显著减少视觉处理开销的同时持续提升COMET评分。其中，基于粗略属性的视觉上下文总结（coarse attribute-based visual context summarization）更为稳健，能有效捕捉纯文本系统常常忽略的场景级情感和上下文微妙线索。

链接: https://arxiv.org/abs/2605.11993
作者: Tarun Chintada,Kshetrimayum Boynao Singh,Asif Ekbal
机构: Department of Computer Science and Engineering(计算机科学与工程系); Indian Institute of Technology Patna(印度理工学院巴特那分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses

[NLP-41] On Predicting the Post-training Potential of Pre-trained LLM s

【速读】：该论文旨在解决传统基准测试（如MMLU）无法有效反映大型语言模型（Large Language Models, LLMs）在复杂开放场景下的后训练潜力（post-training potential），导致基础模型选择低效的问题。解决方案的关键在于提出了基于规则判别评估（Rubric-based Discriminative Evaluation, RuDE）的统一框架，该框架通过响应判别（response discrimination）机制绕过基础模型的生成能力差距，并基于系统化的4C分类法（4C Taxonomy）构建跨领域的受控对比对（controlled contrastive pairs），利用细粒度规则违规（fine-grained rubric violations）实现评估。实验表明RuDE与后训练性能的相关性超过90%，并通过强化学习（Reinforcement Learning, RL）验证其能有效识别出具有高潜力的小型模型，从而为基础模型开发提供计算高效的机制。

链接: https://arxiv.org/abs/2605.11978
作者: Xiaoyuan Li,Yubo Ma,Kexin Yang,Moxin Li,Keqin Bao,Wenie Wang,Fuli Feng,Dayiheng Liu
机构: University of Science and Technology of China(中国科学技术大学); Alibaba Group(阿里巴巴集团); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model’s plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model’s performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.

[NLP-42] Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging

【速读】：该论文试图解决现有目标引导主动对话系统（target-guided proactive dialogue system）在动态对话中忽略对对话场景（conversational scenarios）和意图关键词（intent keywords）进行建模，导致系统话语与真实世界对话动态性不匹配的问题。解决方案的关键在于两点：一是联合建模用户画像（user profiles）和领域知识（domain knowledge）作为对话场景，通过引入场景偏差（scenario bias）来动态影响系统话语的生成；二是采用意图关键词桥接（intent-keyword bridging）方法，预测后续对话轮次的意图关键词，从而提供更高级别和更灵活的引导。实验表明，该方案显著提升了系统的主动性、流畅性和信息性。

链接: https://arxiv.org/abs/2605.11964
作者: Maodong Li,Yancui Li,Fang Kong
机构: School of Computer Science and Technology, Soochow University, China(中国苏州大学计算机科学与技术学院); Jiangsu Key Lab of Language Computing, Suzhou 215123, China(中国苏州江苏省语言计算重点实验室); School of Computer and Information Engineering, Henan Normal University, China(中国河南师范大学计算机与信息工程学院)
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 Figures, 18 Tables

点击查看摘要

Abstract:A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real-world conversations. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher level and more flexible guidance. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target-guided proactive dialogue systems, thereby narrowing the gap with real world interactions.

[NLP-43] Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models ICPR2026

【速读】：该论文试图解决多模态视频摘要（multimodal video summarization）中视觉特征与自然语言生成之间语义不对齐（semantic misalignment）的问题。传统方法依赖基于目标分类训练的CNN特征，其将视觉概念表示为离散类别，与语言生成所需的连续语义表示存在本质差异。解决方案的关键在于提出ClipSum框架，其核心是直接利用冻结的CLIP视觉-语言特征（frozen CLIP vision-language features），通过显式时间建模（explicit temporal modeling）和维度自适应融合（dimension-adaptive fusion）实现教学视频摘要。CLIP在4亿图文对上的对比预训练（contrastive pre-training）使得视觉特征在表示层面已与文本解码器生成的语言概念语义对齐，从而天然地桥接了视觉-语言鸿沟（vision-language gap）。实验证明，冻结的CLIP（维度仅为512）在ROUGE-1指标上（33.0%）超越了ResNet-152（维度2048，30.5%）和微调后的CLIP（32.3%），表明保持预训练语义对齐比增大特征容量或任务特定适应更为重要。

链接: https://arxiv.org/abs/2605.11959
作者: Maham Nazir,Muhammad Aqeel,Richong Zhang,Francesco Setti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP’s contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. this https URL

[NLP-44] StepCodeReason er: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

【速读】：该论文旨在解决现有代码推理方法仅监督最终代码输出、忽略中间状态而导致的奖励黑客（reward hacking）问题，即模型通过不一致的推理过程得到正确答案。解决方案的关键在于两个方面：首先，提出了StepCodeReasoner框架，通过自动在代码中插入结构化的基于打印的执行轨迹锚点（execution-trace anchors），训练模型预测每一步的运行时状态，从而将代码推理转化为一个可验证的逐步执行建模问题；其次，引入了双层GRPO（Bi-Level GRPO）强化学习算法，在轨迹间（inter-trajectory）通过比较不同执行路径，以及在轨迹内（intra-trajectory）根据中间准确性对下游正确性的影响进行信用分配，从而实现结构化的奖励分配。

链接: https://arxiv.org/abs/2605.11922
作者: Hao Wang,Rui Li,Lei Sha,Jie M. Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1% on CRUXEval and 86.5% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0% and 77.7%) and GPT-4o (85.6% and 75.1%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9%, outperforming baseline CodeReasoner-7B (72.3%), its 14B counterpart (81.1%), and GPT-4o (77.3%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

[NLP-45] YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

【速读】：该论文试图解决现有偏好优化（Preference Optimization）方法在提升大语言模型数学推理能力时，仅依赖于外部构造的偏好数据作为样本级监督，而未能充分利用模型内部表征中蕴含的与数学推理（如数学知识、符号操作、逻辑推理）相关的神经元激活信息这一局限。解决方案的关键在于提出了YFPO（Yoked Feature Preference Optimization）框架：首先利用AttnLRP识别与数学相关的神经元，然后计算偏好响应与非偏好响应之间的神经元激活间隔（activation margin），并以此构建一个辅助奖励信号，从而将外部偏好学习与内部神经元级信号结合，实现对偏好优化的细粒度增强。初步实验在GSM8K基准上验证了该方法的可行性，表明神经元级信号可与偏好优化交互并偶尔提升推理性能。

链接: https://arxiv.org/abs/2605.11906
作者: Yifan Le
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2figures. Work in progress

点击查看摘要

Abstract:Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and dispreferred responses as sample-level supervision. However, such external signals rarely make explicit use of capability-related information contained in the model’s internal representations. For mathematical reasoning, certain neuron groups may exhibit activation patterns associated with mathematical knowledge, symbolic manipulation, or logical reasoning. Similar to reflexive behavioral signals, these internal activations may provide a coarse indication of whether the model is engaging math-related this http URL introduce YFPO, short for Yoked Feature Preference Optimization, a preliminary neuron-guided preference optimization framework for mathematical reasoning. YFPO first uses AttnLRP to identify math-related neurons, and then constructs an auxiliary reward from their activation margin between preferred and dispreferred responses. This design augments external preference learning with internal neuron-level signals. We conduct preliminary experiments on a small-scale language model using GSM8K as the main benchmark. Results suggest that neuron-level signals can interact with preference optimization and occasionally improve reasoning performance, offering a promising direction for more fine-grained and interpretable reasoning-oriented post-training.

[NLP-46] Qwen -Scope: Turning Sparse Features into Development Tools for Large Language Models

【速读】：该论文试图解决大语言模型内部决策过程不透明的问题，这一不透明性限制了对模型行为的检查、控制和系统性改进能力。解决方案的关键在于利用稀疏自编码器（Sparse Autoencoders, SAEs）将模型激活（activations）分解为稀疏且可解释的特征表示，并基于此构建了Qwen-Scope开源套件。通过覆盖Qwen3和Qwen3.5系列中7种模型变体（包括密集架构和混合专家架构）的14组SAEs，论文展示了SAEs不仅可作为事后分析工具，还能作为可复用的表示层接口，在推理时控制、评估分析、数据工作流和后训练优化四个方向上直接服务于模型开发，例如在不修改权重的情况下通过特征方向控制语言和偏好，或通过特征信号缓解代码切换和重复等不良行为。

链接: https://arxiv.org/abs/2605.11887
作者: Boyi Deng,Xu Wang,Yaoning Wang,Yu Wan,Yubo Ma,Baosong Yang,Haoran Wei,Jialong Tang,Huan Lin,Ruize Gao,Tianhao Li,Qian Cao,Xuancheng Ren,Xiaodong Deng,An Yang,Fei Huang,Dayiheng Liu,Jingren Zhou
机构: Qwen Team
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.

[NLP-47] Concordance Comparison as a Means of Assembling Local Grammars

【速读】：该论文试图解决葡萄牙语文本中人名识别（Named Entity Recognition for person names）这一信息抽取中的重要但具有挑战性的问题，解决方案的关键在于：使用一种工具比较两个局部语法（Local Grammars, LGs）生成的一致性（concordances）并突出差异，通过分析比较结果观察到每对LG之间的包含、交集和析取关系，从而组装出产生最佳结果的增强语法，最终在葡萄牙语HAREM黄金语料上获得F值76.86，相比当时最优方法提高了6个点。

链接: https://arxiv.org/abs/2605.11862
作者: Juliana Pirovani,Elias de Oliveira,Eric Laporte
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.

[NLP-48] UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLM s

【速读】：该论文试图解决现有视觉潜在推理方法中显式文本链式思维（textual chain-of-thought, CoT）与视觉潜在标记（visual latent tokens）交错使用所导致的效率低下与推理碎片化问题。其解决方案之核心在于提出UniVLR框架，该框架将文本推理与辅助视觉证据统一视为共享的视觉工作空间（shared visual workspace），通过将推理轨迹与辅助图像共同渲染并学习压缩为紧凑的视觉潜在标记，使得推理时仅依赖视觉潜在进行隐式推理并直接解码最终答案，从而彻底消除了对外部工具调用及冗长文本推理的依赖，实现了更高效且统一的视觉推理范式。

链接: https://arxiv.org/abs/2605.11856
作者: Houcheng Jiang,Jiajun Fu,Junfeng Fang,Chen Gao,Xiang Wang,Xiangnan He,Yong Li
机构: University of Science and Technology of China(中国科学技术大学); Tsinghua University(清华大学); National University of Singapore(新加坡国立大学); Zhongguancun Academy(中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

[NLP-49] Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

【速读】：该论文试图解决扩散语言模型（Diffusion Language Models, DLMs）在标准负证据下界（Negative Evidence Lower Bound, NELBO）监督微调中存在的训练-推理不匹配问题：训练阶段一步重构随机掩码，而推理阶段遵循置信度引导的多步由易到难去噪轨迹，导致微调效率低下；同时，现有基于轨迹的自蒸馏方法仅用于加速采样，未能显著提升模型内在能力，甚至可能损害全扩散解码性能。解决方案的关键在于提出轨迹对齐的玻尔兹曼建模（Trajectory-Aligned optimization via Boltzmann Modeling, TABOM），这是一种基于自蒸馏轨迹的后训练框架，其核心思想是：通过将推理过程中的非掩码偏好建模为预测熵的玻尔兹曼分布（Boltzmann distribution），并推导出可处理的成对排序目标（pairwise ranking objective），从而将训练过程与推理的由易到难结构对齐，使模型能从自身的分布流形轨迹中实现真正的知识获取，而非仅加速解码。TABOM在新领域上获得显著收益，扩展了DLM的有效知识边界，并有效缓解了标准监督微调中的灾难性遗忘。

链接: https://arxiv.org/abs/2605.11854
作者: Kecheng Chen,Ziru Liu,Xijia Tao,Hui Liu,Yibing Liu,Xinyu Fu,Shi Wu,Suiyun Zhang,Dandan Tu,Lingpeng Kong,Rui Liu,Haoliang Li
机构: City University of Hong Kong (香港城市大学); Huawei Research (华为研究院); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model’s underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM’s own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbfTrajectory-\textbfAligned optimization via \textbfBoltzmann \textbfModeling (\textbfTABOM), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model’s certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

[NLP-50] GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

【速读】：该论文试图解决强化学习在LLM智能体后训练中面临的细粒度信用分配难题：传统方法依赖结果层面的稀疏奖励，难以在长程轨迹中精准定位并强化对正确部分起关键作用的动作。解决方案的核心是提出一种自适应粒度的信用分配框架GEAR（Granularity-adaptivE Advantage Reweighting），其关键机制在于利用自蒸馏获得token级和segment级的参考引导散度信号，通过比较在线策略学生与地面真值条件化教师的差异，识别语义偏离的起始点（即散度激增处），并以此作为锚点动态划分自适应segment：对于学生与教师保持对齐的token保留token级分辨率，而对于发生偏离的延续部分则聚合成segment，并用偏离点的散度值调制该segment的优势权重，从而实现精细且灵活的信用分配。

链接: https://arxiv.org/abs/2605.11853
作者: Sijia Li,Yuchen Huang,Zifan Liu,Yanping Li,Jingjing Fu,Li Zhao,Jiang Bian,Ling Zhang,Jun Zhang,Rui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment’ s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.

[NLP-51] Probabilistic Calibration Is a Trainable Capability in Language Models

【速读】：该论文试图解决语言模型在需要输出满足用户指定随机性约束的场景中，其生成概率与目标校准不佳的问题。解决方案的关键是通过微调（fine-tuning）直接提升这一能力，具体提出了两种校准微调（Calibration Fine-Tuning）变体：软目标方法（soft-target method）将期望输出分布转换为基于字典树（trie）的下一词元目标，而硬目标方法（hard-target method）则直接在从同一目标分布采样的补全序列上进行训练。两种方法均能显著改善保留分布族和未见参数设置下的结构化采样保真度，表明概率校准是一种可训练的能力。

链接: https://arxiv.org/abs/2605.11845
作者: Davide Baldelli,Sruthi Kuriakose,Maryam Hashemzadeh,Amal Zouaq,Sarath Chandar
机构: Chandar Research Lab(钱达尔研究实验室); Mila – Quebec AI Institute(魁北克人工智能研究所); LAMA-WeST Lab(LAMA-WeST实验室); Polytechnique Montréal(蒙特利尔理工学院); Université de Montréal(蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at this https URL.

[NLP-52] More Edits More Stable: Understanding the Lifelong Normalization in Sequential Model Editing

【速读】：该论文旨在解决大型语言模型（LLMs）在终身模型编辑（Lifelong Model Editing）中面临的灾难性遗忘（catastrophic forgetting）与模型崩溃（model collapse）问题。解决方案的关键在于揭示并理论化终身归一化（Lifelong Normalization, LN）的作用机制，即LN通过将值梯度（value gradients）用运行统计量（running statistics）进行归一化，形成一个自增强的稳定循环（self-reinforcing stability loop）；结合岭回归（ridge-regularized regression）时，LN能产生渐近正交（asymptotic orthogonality）且有界范数（bounded norms）的参数更新，从而直接缓解遗忘和系统性崩溃。基于这一理论洞察，论文进一步推导出StableEdit方法，通过引入显式预热阶段（warm-up stage）和全白化（full whitening）来强化该稳定循环，以极小的额外开销提升长时程稳定性。

链接: https://arxiv.org/abs/2605.11836
作者: Xin Ma,Wei Chen,Qi Liu,Derong Xu,Zhi Zheng,Tong Xu,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a “black box”, leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at this https URL.

[NLP-53] ROMER: Expert Replacement and Router Calibration for Robust MoE LLM s on Analog Compute-in-Memory Systems

【速读】：该论文旨在解决模拟存内计算（compute-in-memory, CIM）系统中硬件噪声对基于混合专家（mixture-of-experts, MoE）架构的大语言模型（large language models, LLMs）带来的性能退化问题。具体而言，硬件噪声会破坏专家负载均衡（expert load balance），导致在干净环境下训练的路由决策在噪声环境中持续次优。解决方案的关键在于提出了一种名为ROMER的训练后校准框架，其核心包括两个步骤：(1) 将激活不足的专家（underactivated experts）替换为高频专家（high-frequency experts），以恢复负载均衡；(2) 通过基于百分位数的归一化（percentile-based normalization）重新校准路由器logits，从而稳定噪声条件下的路由决策。实验结果表明，在真实芯片噪声条件下，ROMER在DeepSeek-MoE、Qwen-MoE和OLMoE上分别实现了高达58.6%、58.8%和59.8%的困惑度（perplexity）降低，验证了其有效性及跨不同MoE架构的泛化能力。

链接: https://arxiv.org/abs/2605.11800
作者: Wenyong Zhou,Yuannuo Feng,Yizhe Chen,Taiqiang Wu,Wendong Xu,Wenbo Qi,Zhengwu Liu,Wang Kang,Ngai Wong
机构: The Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong (香港大学电气与电子工程系); The School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China (北京航空航天大学集成电路科学与工程学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6%, 58.8%, and 59.8% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.

[NLP-54] Choosing features for classifying multiword expressions

【速读】：该论文试图解决多词表达（Multiword expressions, MWEs）的分类问题。由于其类别归属的可靠性依赖于所选特征，且现有分类对计算用途的支撑不足，需要设计一个更适用于多语言的增强分类方案。解决方案的关键在于：基于跨语言的前期研究成果，优先选取那些在类别归属上具有高可靠性的特征，从而提升分类的跨语言适用性和计算有效性。

链接: https://arxiv.org/abs/2605.11779
作者: Eric Laporte
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.

[NLP-55] Entropy Polarity in Reinforcement Fine-Tuning: Direction Asymmetry and Control

【速读】：该论文旨在解决强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）中，对策略熵（Policy Entropy）在token级别（token-level）的调节机制理解不足的问题。现有方法仅通过全局目标控制熵，而忽略了采样更新如何逐token重塑熵的微观过程。解决方案的关键在于提出熵力学（Entropy Mechanics）理论框架，通过一阶近似推导出熵极性（Entropy Polarity）——一种有符号的token-level量，能预测每个采样更新是扩大还是收缩熵。该框架揭示了结构不对称性：强化高频高概率token倾向于收缩熵，而扩张熵则需要低概率样本或更强的分布校正。基于此，作者提出极性感知策略优化（Polarity-Aware Policy Optimization, PAPO），通过优势重加权（advantage reweighting）同时保留熵扩张和收缩两种极性分支，并利用在线经验熵轨迹作为相位信号，自适应地在不同更新之间重新分配优化压力，从而实现探索与利用的平衡。实验证明，PAPO在数学推理和智能体基准测试中优于基线方法，带来了更高效的训练和显著的奖励提升。

链接: https://arxiv.org/abs/2605.11775
作者: Jiazheng Zhang,Ziche Fu,Junrui Shen,Yunbin Zhao,Yunke Zhang,Zhiheng Xi,Long Ma,Chenxin An,Zhihao Zhang,Shichun Liu,Dingwei Zhu,Shihan Dou,Shaofan Liu,Han Li,Wiggin Zhou,Aiden Adams,Tao Gui,Fei Huang,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

[NLP-56] From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

【速读】：该论文旨在解决大型语言模型（LLMs）在处理电子健康记录（EHRs）时，因纵向或高频序列产生过长token序列而导致的计算成本高、性能下降的问题。现有压缩方案或引入额外推理延迟，或存在临床信息丢失风险。解决方案之关键是提出了一种称为Medical Token-Pair Encoding (MedTPE)的分层方法，它通过将频繁共现的医疗token对合并为复合token实现无损压缩，并采用依赖感知替换策略维持计算复杂度，仅需通过自监督学习微调占模型参数0.5-1.0%的新增token嵌入，从而在降低输入长度（最多31%）和推理延迟（34-63%）的同时，保持甚至提升预测性能和输出格式合规性。

链接: https://arxiv.org/abs/2605.11774
作者: Mingcheng Zhu,Zhiyao Luo,Yu Liu,Tingting Zhu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 6 figures, 13 tables

点击查看摘要

Abstract:By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM’s parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.

[NLP-57] Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在安全关键的空中交通管制（Air Traffic Control, ATC）环境中可靠性评估不足的问题，具体表现为基于F1或宏观准确率等聚合指标的评估方法将所有错误等量齐观，无法区分高风险语义错误（如跑道标识符或运动约束错误）与低风险错误的不对称后果。解决方案之关键在于提出一种面向安全（safety-oriented）且后果感知（consequence-aware）的评估框架，该框架通过对不同语义错误赋予与操作后果严重性相匹配的权重，从而更真实地反映LLM在ATC场景下的操作可靠性，避免聚合指标对高风险错误的掩盖。

链接: https://arxiv.org/abs/2605.11769
作者: Yujing Chang,Yash Guleria,Duc-Thinh Pham,Nhut-Huy Pham,Ningli Wang,Vu N. Duong,Sameer Alam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.

[NLP-58] DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

【速读】：该论文试图解决视觉-语言-动作模型（Vision-Language-Action Model, VLA）在精细操作任务中因缺乏失败意识而导致的脆弱性问题：在关键阶段，微小的动作误差会迅速累积成不可恢复的失败，而现有模型仅依赖成功演示训练，无法主动感知并规避这些风险。解决方案的核心在于提出一种关键阶段测试时梦境框架（critical-phase test-time dreaming framework），即DreamAvoid，其关键在于三个协同组件：（1）梦境触发器（Dream Trigger）用于实时判断执行是否进入关键阶段；（2）动作提议器（Action Proposer）从VLA中采样多个候选动作块（candidate action chunks）；（3）梦境评估器（Dream Evaluator）通过在混合数据（成功、失败与边界案例）上联合训练，能够“梦境”出各候选动作对应的短期未来状态并评估其价值，从而选择最优动作以避免失败。此外，该框架还引入了自主边界学习范式（autonomous boundary learning paradigm），以细化系统对成功与失败之间微妙边界的理解。

链接: https://arxiv.org/abs/2605.11750
作者: Xianzhe Fan,Yuxiang Lu,Shenyuan Gao,Xiaoyang Wu,Ruihua Han,Manling Li,Hengshuang Zhao
机构: HKU (香港大学); HKUST (香港科技大学); Northwestern University (西北大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system’s understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to “dream” the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at this https URL.

[NLP-59] raining-Inference Consistent Segmented Execution for Long-Context LLM s ICML2026

【速读】：该论文试图解决Transformer大模型在长上下文生成（long-context generation）中因全上下文注意力（full-context attention）带来的计算与内存成本导致的严重可扩展性挑战。现有许多推理高效的长上下文方法虽然在推理时采用有界上下文（bounded-context）或段级执行（segment-level execution），但训练仍使用全上下文注意力，导致训练与推理在执行语义和状态转换上不匹配。基于这一洞察，论文提出的关键解决方案是设计一个训练-推理一致的段级生成框架（training-inference consistent segment-level generation framework），使训练和推理遵循相同的段级前向执行语义。具体而言，训练时通过将梯度传播限制在仅来自紧邻前一段的KV状态（key-value states），同时在前向传播中允许头特定（head-specific）访问过去段的KV状态，而不将其纳入梯度传播，从而强制执行与推理的一致性。该方法在长上下文基准测试中实现了与全上下文注意力相当的性能，同时在与强推理高效基线的竞争中达到了具有竞争力的延迟-内存权衡，并在极长上下文长度（例如128K时比使用FlashAttention的全上下文注意力降低了约6倍的峰值预填充内存）下显著提升了可扩展性。

链接: https://arxiv.org/abs/2605.11744
作者: Xianpeng Shang,Jiang Li,Zehua Duo,Qianyi Cai,Xiangdong Su
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICML 2026. 19 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).

[NLP-60] Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

【速读】：该论文试图解决的问题是：现有研究将on-policy蒸馏（On-policy Distillation, OPD）在大语言模型后训练中的高效率主要归因于更密集和稳定的监督信号，但对其效率背后的参数级机制缺乏深入理解。解决方案的关键在于揭示OPD效率源于一种“预见性”（foresight）：它在训练早期就建立了朝向最终模型的稳定更新轨迹。这种预见性体现在两个层面：在模块分配层面，OPD能够识别低边际效用区域，并将更新集中在更关键的推理模块上；在更新方向层面，OPD表现出更强的低秩集中性，其主导子空间在训练早期就与最终更新子空间对齐。基于这些发现，论文提出了EffOPD，一种即插即用的加速方法，通过自适应选择外推步长并沿当前更新方向移动来加速OPD，无需额外可训练模块或复杂超参数调优，可实现约3倍平均训练加速且保持最终性能相当。

链接: https://arxiv.org/abs/2605.11739
作者: Yuchen Cai,Ding Cao,Liang Lin,Chunxi Luo,Xin Xu,Kai Yang,Weijie Liu,Saiyong Yang,Tianxiang Zhao,Guangzhong Sun,Guiquan Liu,Junfeng Fang
机构: USTC(中国科学技术大学); Tencent(腾讯); NUS(新加坡国立大学); HKUST(GZ)(香港科技大学(广州)); Alibaba(阿里巴巴); SHU(上海大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD’s efficiency remain poorly understood. In this work, we argue that OPD’s efficiency stems from a form of ``foresight’': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbfModule-Allocation Level, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbfUpdate-Direction Level, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbfEffOPD, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of 3\times while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.

[NLP-61] Allegory of the Cave: Measurement-Grounded Vision-Language Learning

【速读】：该论文试图解决视觉语言模型（VLM）因依赖经ISP处理的RGB图像进行推理，导致传感器原始证据（如动态范围、低光细节）在渲染过程中被裁剪、抑制或量化，从而引入信息丢失和基础错误（grounding error）的问题。解决方案的关键是将视觉接口从标准RGB域迁移至更接近相机原始测量的测量域（measurement domain），具体通过三项设计实现：一是使用从RAW数据派生的Meas.-XYZ输入替代RGB图像，保留未压缩的传感器证据；二是引入相机条件化基础（camera-conditioned grounding），使模型适应不同传感器的特征；三是采用曝光包围监督聚合（Exposure-Bracketed Supervision Aggregation），将现有RGB监督信号有效转移到测量域观测上，从而在不依赖大规模测量域标注数据的情况下实现训练。

链接: https://arxiv.org/abs/2605.11727
作者: Kepeng Xu,Li Xu,Gang He,Wenxin Yu
机构: Xidian University (西安电子科技大学); Southwest University of Science and Technology (西南科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.

[NLP-62] Slicing and Dicing: Configuring Optimal Mixtures of Experts

【速读】：该论文试图解决混合专家（Mixture-of-Experts, MoE）架构中多种核心设计选择（包括专家数量、粒度、共享专家、负载均衡机制及token丢弃策略）之间的相互作用问题，即这些选择能否在不考虑交互效应的前提下独立优化，以及它们对模型最终性能的独立贡献大小。解决方案的关键在于通过大规模系统实验，在总参数达6.6B的模型中进行了超过2000次预训练，系统性地变化总专家数、专家维度、单层内异构专家大小、共享专家大小及负载均衡机制，从而量化每个设计维度的影响。研究发现，专家数量和粒度是主导性能的核心因素，且最优专家大小仅依赖于活跃参数数量而非总参数数，而共享专家、异构专家和负载均衡等其他设置的影响相对较小，仅无丢弃路由（dropless routing）能带来一致的增益。

链接: https://arxiv.org/abs/2605.11689
作者: Margaret Li,Sneha Kudugunta,Danielle Rothermel,Luke Zettlemoyer
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like this http URL, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

[NLP-63] Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

【速读】：该论文试图解决大型语言模型（LLM）遗忘机制中的一个关键脆弱性：经过遗忘训练的模型在遭遇重新学习攻击（relearning attacks）时会迅速恢复被“遗忘”的知识，这严重威胁了开放权重模型的安全性。解决方案之关键是从表征几何（representation geometry）视角揭示了这一脆弱性的根本机制——现有遗忘方法主要优化表征中的主导成分（dominant components），而次要成分（minor components）几乎未被改变；重新学习攻击能轻易逆转主导成分的修改，但次要成分对此类逆转表现出更强的抵抗性。基于此洞察，论文提出次要成分遗忘（Minor Component Unlearning, MCU），通过明确将遗忘效果集中于表征中这些固有机理稳定的方向上，显著提升模型对重新学习攻击的抵抗能力，并在实验上超越包括sharpness-aware minimization在内的现有最优方法。

链接: https://arxiv.org/abs/2605.11685
作者: Zeguan Xiao,Xuanzhe Xu,Yun Chen,Yong Wang,Jian Yang,Yanqing Hu,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Alibaba Group (阿里巴巴集团); Southern University of Science and Technology (南方科技大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover “forgotten” knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.

[NLP-64] Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japans National Assessment of Academic Ability

【速读】：该论文试图解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）评估中缺乏基于日本K-12真实学校考试的高效度基准数据集的问题，现有基准多依赖合成或策划数据，无法真实反映实际教育场景的复杂性。解决方案的关键在于从日本全国学术能力评估中构建了一个多模态数据集，该数据集保留了官方发布的中学科学、数学和日语试题的真实考试布局、图表及日文教育文本，并包含了全国约90万学生答题分布的聚合数据，从而能够在统一评估框架下直接比较人类与模型的表现，为多模态教育推理提供可重复、以人为基础的基准，支持真实评估场景下的评估、反馈生成和可解释AI研究。

链接: https://arxiv.org/abs/2605.11663
作者: Kyosuke Takami,Yuka Tateisi,Satoshi Sekine,Yusuke Miyao
机构: Osaka Kyoiku University(大阪教育大学); University of Tokyo(东京大学); NII LLMC(国立情报学研究所LLMC)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan’s National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N \approx 900,000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: this https URL

[NLP-65] Hide to See: Reasoning -prefix Masking for Visual-anchored Thinking in VLM Distillation

【速读】：该论文试图解决基于视觉-语言模型（VLM）的“先思考后回答”（think-answer）方法（如Qwen3-VL-Thinking）因中间推理步骤导致的高计算成本，从而限制实际部署的问题。为了将这种能力蒸馏到紧凑的think-answer VLM中，关键目标是提升学生在推理过程中对视觉证据的利用能力。解决方案的核心在于提出一种新型的思考-回答蒸馏框架，该框架通过遮蔽学生模型显著推理前缀（salient reasoning prefixes）来强制其思考锚定于视觉信息。具体关键策略包括：（1）逐令牌的显著推理前缀遮蔽，即针对每个下一个令牌预测选择性地遮蔽高影响力的推理前缀；（2）自定步调遮蔽预算调度，即根据教师-学生分布差异作为蒸馏难度指标，逐步增加遮蔽规模。在蒸馏阶段，学生模型使用显著推理前缀遮蔽替代标准因果遮蔽（causal mask），同时阻挡未来令牌和显著推理线索，从而迫使模型依赖视觉证据补偿缺失的文本线索。实验结果表明，该方法在多模态推理基准上优于现有开源VLM、VLM蒸馏及自蒸馏方法，并验证了学生思考过程中视觉利用的增强。

链接: https://arxiv.org/abs/2605.11651
作者: Seonghoon Yu,Dongjun Nam,Byung-Kwan Lee,Jeany Son
机构: KAIST（韩国科学技术院）; NVIDIA（英伟达）; POSTECH（浦项科技大学）
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student’s ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student’s salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by discrepancy between teacher–student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.

[NLP-66] Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

【速读】：该论文试图解决在多语言场景下扩展自生成反事实解释（SCEs）时面临的挑战：现有方法难以在非主导语言中生成有效的SCEs，且解释质量受限于有效性与最小性之间的固有权衡。解决方案的关键在于引入Macro框架，这是一种基于偏好对齐的方法，通过应用直接偏好优化（Direct Preference Optimization, DPO）到多语言SCE生成中，利用复合评分函数构建偏好对，从而将有效性与最小性之间的权衡转化为可测量的偏好信号，实现两者之间的平衡优化。

链接: https://arxiv.org/abs/2605.11632
作者: Yilong Wang,Qianli Wang,Bohao Chu,Yihong Liu,Jing Yang,Simon Ostermann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In submission

点击查看摘要

Abstract:Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

[NLP-67] OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在部署到实际系统时面临的矛盾：大型模型虽具备强大的链式思维（Chain-of-Thought, CoT）推理能力，但受限于延迟与资源约束；而小型模型更适合在线服务，却因缺乏大规模、高质量的多模态CoT监督数据而导致推理性能受限。解决方案的核心是提出OmniThoughtVis，一个可扩展的数据整理与蒸馏（distillation）流程，其关键步骤包括：从多样化的开源种子池出发，生成结构化CoT轨迹，并联合标注推理难度、答案质量和语义任务标签；为确保数据质量，采用基于规则的过滤、难度感知筛选以及基于标签的多样性采样，最终构建包含180万样本的高质量语料库，支持可控子集构建用于下游训练。通过该流程将高容量教师模型的推理能力蒸馏至Qwen3-VL系列（2B至8B参数），实验表明蒸馏后模型在多模态推理基准上取得显著提升，例如4B模型在MathVerse上提升16.8分、在MMMU-Pro上提升5.6分，甚至在某些任务上超越未蒸馏的8B模型，验证了可扩展推理蒸馏对于部署导向型多模态模型的实用价值。

链接: https://arxiv.org/abs/2605.11629
作者: Yuanhao Yue,Chengyu Wang,Yuanjie Lyu,Lei Shen,Jun Huang
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.

[NLP-68] When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

【速读】：该论文致力于解决现有针对大语言模型（LLMs）微调过程中后门攻击的局限性：现有方法多基于词元（token）层级的操纵，缺乏深层语义加工，导致隐蔽性不足；同时，依赖静态单一触发器（static trigger）的后门容易被检测，且干净微调（clean fine-tuning）会削弱触发器与目标输出之间的关联。论文通过因果验证发现，情感（emotion）作为整体语气风格因素，可在LLM表示空间中与语义解耦，形成与中性文本不同的聚类。基于此，方案的关键在于将情感因素设计为动态后门触发器，提出寄生情感风格动态后门攻击（Paraesthesia），通过将带有特定情感风格的样本混入干净数据中进行微调，使模型在推理阶段遇到情感输入时自动触发预定义的攻击响应。方法核心包括情感风格的量化（quantification）与重写（rewriting），从而在不牺牲模型干净效用（clean utility）的前提下，在两个任务类型及四个模型上均达到约99%的攻击成功率。

链接: https://arxiv.org/abs/2605.11612
作者: Ziyu Liu,Tao Li,Tianjie Ni,Xiaolong Lan,Wengang Ma,Tao Yang,Guohua Wang,Junjiang He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99% across both task types and four different models, while maintaining the clean utility of the models.

[NLP-69] Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

【速读】：该论文试图解决在线自蒸馏（on-policy self-distillation）在数学推理任务中表现不稳定的问题，其根源在于特权上下文（privileged context）会扭曲教师模型的置信度分布：一方面膨胀结构连接词和可验证声明的置信度，另一方面抑制驱动多步搜索的思考标记（如“Wait”、“Let”、“Maybe”），导致学生模型无法有效学习推理过程。解决方案的关键是提出反自蒸馏（Anti-Self-Distillation, AntiSD），通过反转学生与教师之间的散度方向（从下降改为上升）来翻转每个token的符号，从而获得自然有界的优势信号；同时引入熵触发门（entropy-triggered gate），当教师熵崩溃时自动禁用该项，使其成为默认自蒸馏的即插即用替代。该方法在4B至30B参数的多个模型上，以2到10倍更少的训练步骤达到GRPO基线的准确率，并将最终准确率提升高达11.5个百分点。

链接: https://arxiv.org/abs/2605.11609
作者: Guobin Shen,Xiang Cheng,Chenxiao Zhao,Lei Huang,Jindong Li,Dongcheng Zhao,Xing Yu
机构: Xiaohongshu Inc.(小红书公司); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher’s confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (“Wait”, “Let”, “Maybe”) that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline’s accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

[NLP-70] PRISM: A Geometric Risk Bound that Decomposes Drift into Scale Shape and Head

【速读】：该论文旨在解决后训练LLM变体（如量化、LoRA适配和蒸馏模型）评估中缺乏可解释性诊断的问题：现有相似性指标（如CKA、SVCCA）仅能标记性能退化，但无法将表示漂移直接关联到风险机制或提供修复方向。解决方案的关键是提出PRISM（Proxy Risk Inference via Structural Mapping），利用LLM的线性输出头及其骨干网络近似等距的结构特性，推导出一个目标模型与变体之间交叉熵风险差距的闭式上界。该上界通过校准用于变体排名，并将表示漂移分解为三个独立可测的轴：尺度不匹配、形状不匹配和头部发散，每个轴对应特定的故障模式（如低比特量化下的形状畸变、LoRA遗忘下的尺度可分离性、GGUF k-量化下的头部发散）。由于形状项可微，因此该几何框架还能作为训练时的正则化器以缓解灾难性遗忘。实验表明，PRISM在量化与LoRA遗忘场景下的变体排序平均斯皮尔曼相关系数分别达0.820和0.831，且轴引导的形状正则化器在聚合减轻下游遗忘方面优于经验回放方法。

链接: https://arxiv.org/abs/2605.11608
作者: Chieh-Yen Lin,Shao-Hua Sun
机构: Appier AI Research (Appier人工智能研究); National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.

[NLP-71] DiffScore: Text Evaluation Beyond Autoregressive Likelihood

【速读】：该论文试图解决自回归语言模型在文本评估中因从左到右的解码顺序导致的位置偏差（positional bias）问题，即早期token仅依赖左侧上下文打分，混淆了架构非对称性与真实文本质量。解决方案的关键在于提出基于掩码重建（masked reconstruction）的替代范式，使用连续掩码率（continuous masking rates）测量文本的可恢复性（text recoverability），从而消除位置偏差并自然建立从局部流畅性到全局连贯性的评估层级。具体实现为DiffScore框架，它基于掩码大扩散语言模型（Masked Large Diffusion Language Models），并提供了多时间步质量剖面（multi-timestep quality profiles）和双向点互信息分解（bidirectional PMI decomposition）等诊断工具，以解耦流畅性与忠实度。

链接: https://arxiv.org/abs/2605.11601
作者: Wen Lai,Yingli Shen,Dingnan Jin,Qing Cui,Jun Zhou,Maosong Sun,Alexander Fraser
机构: Ant Group (蚂蚁集团); Tsinghua University (清华大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: this https URL.

[NLP-72] Efficient LLM -based Advertising via Model Compression and Parallel Verification

【速读】：该论文试图解决大语言模型（LLMs）在实时广告系统中因推理延迟高和计算成本大而难以部署的问题。解决方案的关键在于提出了一种高效生成式定向（Efficient Generative Targeting）框架，该框架整合了三种核心技术：自适应组量化（adaptive group quantization）以减少模型存储和计算开销、层自适应层次稀疏化（layer-adaptive hierarchical sparsification）以动态剪枝非关键计算、以及前缀树并行验证（prefix-tree parallel verification）以加速自回归生成中的验证步骤，三者协同作用在保持生成质量可接受的前提下实现显著的推理加速。

链接: https://arxiv.org/abs/2605.11582
作者: Wenxin Dong,Chang Gao,Guanghui Yu,Xuewu Jiao,Mingqing Hu,Qiang Fu,Peng Xu,Penghui Wei,Hui Xu,Yue Xing,Shuanglong Li,Lin Liu
机构: Baidu Inc. (百度)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, industry paper

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.

[NLP-73] Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

【速读】：该论文旨在解决大型语言模型（LLMs）在实时在线广告系统中推理时，因解码阶段频繁的内核启动（kernel launch）和算子间HBM数据往返导致的高延迟问题——现有MegaKernel方法要么因手动调优缺乏便携性（portability），要么因自动编译引入运行时动态调度（runtime dynamic scheduling）而产生不可接受的分支开销，难以在资源受限的GPU上兼顾效率与可移植性。解决方案的关键在于观察到固定部署配置下MegaKernel的最优执行路径唯一确定，从而将运行时动态决策完全提升至编译时（compile time）：具体包括（1）通过三维共享内存约束模型（three-dimensional shared-memory constraint model）结合K维分割（K-dimension splitting），将峰值共享内存使用减少50%；（2）基于MLIR的细粒度DAG离线搜索（fine-grained DAG offline search），固化最优执行路径以彻底消除运行时分支；（3）设计异构混合推理引擎（heterogeneous hybrid inference engine），将MegaKernel作为插件嵌入TensorRT-LLM，融合高吞吐的Prefill阶段与低延迟的Decode阶段，首次实现商业在线广告系统中的工业级部署。

链接: https://arxiv.org/abs/2605.11581
作者: Wenxin Dong,Mingqing Hu,Guanghui Yu,Qiang Fu,Peng Xu,Hui Xu,Yue Xing,Xuewu Jiao,Shuanglong Li,Lin Liu
机构: Baidu Inc.(百度公司)
类目: Computation and Language (cs.CL)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios–the first industrial deployment of MegaKernel in a commercial online advertising system.

[NLP-74] BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

【速读】：该论文试图解决自回归语言模型因逐Token生成而导致的根本性瓶颈问题：自然语言本质上由多Token单元（如短语、n-gram、搭配）共同承载语义，但现有模型每次仅生成一个Token，这不仅限制了预训练时模型的表达能力，也降低了推理时的吞吐量。现有方案如推测解码（speculative decoding）或基于扩散的语言模型，要么未能解决底层瓶颈，要么牺牲了语言建模所必需的因果结构。解决方案的关键在于提出了BitLM，该模型将每个Token表示为固定长度的二进制码，并使用轻量级扩散头（lightweight diffusion head）在每个块（block）内并行去噪多个Token。其核心创新是：在块间保持从左到右的因果注意力（causal attention），从而保留自回归建模的可靠性，同时在块内进行联合词汇决策，实现迭代细化的并行性。通过用按位去噪（bitwise denoising）替代大词汇表softmax，BitLM将Token生成重构为在紧凑二进制空间中的迭代承诺（iterative commitment），从而在保持语言模型因果基础不变的前提下，实现更高效的预训练和显著更快的推理。

链接: https://arxiv.org/abs/2605.11577
作者: Shaobin Zhuang,Yuang Ai,Jiaming Han,Xiaohui Li,Huaibo Huang,Xiangyu Yue,Xuefeng Hu,Kun Xu,Yali Wang,Hao Chen
机构: Shanghai Jiao Tong University (上海交通大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); SIAT, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4figures, 1 table

点击查看摘要

Abstract:Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.

[NLP-75] hree Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练知识与矛盾文档冲突时表现出的行为不一致问题：现有研究要么发现模型顽固保留训练答案（约50%时间忽略文档），要么发现模型完全顺从文档（约96%时间遵循上下文），这种经验矛盾长期未获解释。解决方案的关键在于提出一个三机制框架（three-regime framework），将先前实验中混淆的三种定性不同的处理情境加以区分：机制1（单源更新，dominant predictor: 证据一致性）、机制2（竞争整合，dominant predictor: 参数化确定性）和机制3（任务适当选择，dominant predictor: 任务知识要求）。此外，该研究形式化区分了参数化强度（parametric strength，即暴露频率）与参数化唯一性（parametric uniqueness，即编码一致性），并在稳定事实领域中实证发现强度是唯一有效的预测因子，而唯一性与之正交（r = -0.002, p = .97）。通过跨多个模型（Claude Sonnet 4.6、GPT-5.5、Gemini 2.5 Flash、Llama 4 Maverick、DeepSeek V3）的9,970次API调用实验，GEE逻辑回归确认了机制2的确定性梯度（beta = -0.38至-0.50，所有p = .013），机制3消融实验则表明仅改变任务框架（task framing）即可将上下文遵循率从接近100%（上下文知识条件）翻转至6-71%（参数化知识条件），所有模型均显著（p < .001）。该框架通过多项结果建模、对冲反应敏感性分析和FDR校正验证了鲁棒性。

链接: https://arxiv.org/abs/2605.11574
作者: Pruthvinath Jeripity Venkata
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models

点击查看摘要

Abstract:The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p = .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.

[NLP-76] aming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting ACL2026

【速读】：该论文旨在解决组相对策略优化（GRPO）在大语言模型推理能力训练中难以有效平衡探索与利用（exploration-exploitation tradeoff）的问题，该问题常导致次优性能。解决方案的关键是基于理论洞察——熵的变化由token概率与其对应优势（advantages）之间的协方差（covariance）所决定——提出了一种无超参数（hyperparameter-free）、协方差加权的优化方法，通过高斯核（Gaussian kernel）动态降低极端token级别更新的权重。该方法能自动减少探索-利用权衡带来的不稳定性，同时保留信息性学习信号，从而稳定训练过程中的熵并提升下游推理基准性能。

链接: https://arxiv.org/abs/2605.11538
作者: Cheng Wang,Qin Liu,Wenxuan Zhou,Muhao Chen
机构: National University of Singapore (新加坡国立大学); University of California, Davis (加利福尼亚大学戴维斯分校); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.

[NLP-77] Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

【速读】：该论文试图解决缺乏针对多模态临床检查报告（含版面、表格、生物标志物、异常标记、影像发现及领域术语）生成安全、有优先级且面向患者的行动建议的基准评测问题。解决方案之关键为：构建了Checkup2Action数据集（含2,000份去标识化的真实检查报告），并将检查报告到行动建议的生成定义为约束性结构化生成任务（constrained structured generation），通过生成结构化“行动卡片”（Action Card）——每张卡片描述一个临床相关问题并规定其优先级、推荐科室、随访时间窗口、面向患者的解释及向临床医生提问的问题，同时避免诊断或治疗处方性声明——并配套提出涵盖问题覆盖度和精确度、优先级一致性、科室与时间推荐准确性、行动复杂度、有用性、可读性及安全性合规的评估协议，从而为评估大语言模型的面向患者推理能力提供了多模态基准。

链接: https://arxiv.org/abs/2605.11533
作者: Sike Xiang,Shuang Chen,Kevin Qinghong Lin,Jialin Yu,Yijia Sun,Philip Torr,Amir Atapour-Abarghouei
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbfCheckup2Action, a multimodal clinical check-up report dataset and benchmark for structured \textitAction Card generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.

[NLP-78] Controllable User Simulation

【速读】：该论文试图解决使用离线数据集评估对话代理时无法覆盖罕见场景或测试新策略的问题，并且指出通过提示或微调大型语言模型进行可控用户模拟的标准实践存在结构性偏差。具体而言，标准方法基于事后轨迹标签（post-hoc trajectory labels）进行监督微调，这些标签与生成数据的采集策略（behavior policy）不可分离地耦合，引入了前瞻偏差（look-ahead bias），破坏了因果一致性（causal consistency）。在策略转移（policy shift）下，这种失败导致评估指标的方差呈几何级数增长，即所谓“可控性崩溃”（controllability collapse）。解决方案之关键在于恢复因果一致性：论文建立了准确模拟的理论条件，并提出了三种实用的训练缓解措施——先验控制（a priori controls）、逐步动态控制（step-wise dynamic controls）以及直接策略条件学习（direct policy-conditioned learning），从而消除前瞻偏差、保留自然方差，并实现对未见代理行为的鲁棒零样本泛化。

链接: https://arxiv.org/abs/2605.11519
作者: Guy Tennenholtz,Ofer Meshi,Amir Globerson,Uri Shalit,Jihwan Jeong,Craig Boutilier
机构: Google Research(谷歌研究院); Tel Aviv University(特拉维夫大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.

[NLP-79] AutoLLM Research: Training Research Agents for Automating LLM Experiment Configuration – Learning from Cheap Optimizing Expensive

【速读】：该论文旨在解决大规模语言模型（LLM）实验配置自动化的难题：现有自动化方法依赖低成本反复试错，而可扩展的LLM实验成本极高，无法承受此类迭代，导致配置工作仍高度依赖专家经验和人工劳动。解决方案之关键在于提出了一个名为AutoLLMResearch的多保真度（multi-fidelity）代理框架，其核心设计包括两个部分：一是LLMConfig-Gym，一个覆盖架构设计、超参数调优等四项关键LLM实验任务、并由超百万GPU小时可验证结果支持的多保真度环境；二是将配置研究建模为长期马尔可夫决策过程（Markov Decision Process），并通过结构化训练流程激励代理学习跨保真度外推（cross-fidelity extrapolation）推理能力，从而模仿人类研究者从低成本实验中提取通用原则并高效迁移至昂贵实验场景的策略。

链接: https://arxiv.org/abs/2605.11518
作者: Taicheng Guo,Nitesh V. Chawla,Olaf Wiest,Xiangliang Zhang
机构: University of Notre Dame (圣母大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

[NLP-80] A Study on Hidden Layer Distillation for Large Language Model Pre-Training

【速读】：该论文试图解决当前大型语言模型（LLM）预训练中知识蒸馏（Knowledge Distillation, KD）方法过度依赖输出 logits、忽视教师模型中间表示所蕴含语义信息的问题，并特别关注隐藏层蒸馏（Hidden Layer Distillation, HLD）在仅解码器（decoder-only）架构的大规模预训练场景中尚未被系统验证的空白。解决方案的关键在于通过计算受控实验（compute-controlled experiments）对 HLD 进行基准测试：以 Gemma3 3.4B 为教师模型，训练 123M 和 735M 参数的学生模型，在 C4 数据集上最多使用 168B tokens，并将 HLD 与基于 logits 的 KD 及自监督基线进行对比。实验表明，HLD 在多数下游评估任务上并未一致优于标准 KD，但在所有共享超参数配置下，HLD 能系统性地带来困惑度（perplexity）收益，这暗示教师中间表示中存在潜在信号可被提取，但需要方法上的突破才能在 LLM 预训练中发挥更显著的作用。

链接: https://arxiv.org/abs/2605.11513
作者: Maxime Guigon,Lucas Dixon,Michaël E. Sander
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher’s intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.

[NLP-81] Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

【速读】：该论文旨在解决生物医学文献的出版类型与研究设计自动索引（publication type and study design indexing）在分布偏移（distributional shift）下鲁棒性不足的问题——现有模型虽在内域精度（in-domain accuracy）上表现优异，但可能过度依赖浅层词汇或数据集特定线索，导致泛化能力脆弱。解决方案的关键在于采用结合实体掩码（entity masking）与领域对抗训练（domain-adversarial training）的鲁棒性导向训练策略，通过这些机制选择性抑制（1）非任务定义特征（如虚假主题相关性）的依赖，同时保留显著的方法学信号；由此证明，鲁棒性与内域精度之间的常见权衡可以通过设计仅压制非关键特征的目标加以缓解，其改进源于两个互补机制：增强对显式方法学线索的依赖，以及降低对虚假领域主题特征的依赖。

链接: https://arxiv.org/abs/2605.11502
作者: Shufan Ming,Joe D. Menke,Neil R. Smalheiser,Halil Kilicoglu
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); University of Illinois Chicago(伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE ICHI 2026

点击查看摘要

Abstract:Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: this https URL

[NLP-82] StoicLLM : Preference Optimization for Philosophical Alignment in Small Language Models

【速读】：该论文试图解决在极端数据稀缺条件下，小语言模型（small LLMs）内化复杂哲学框架（如斯多葛主义）的能力问题。解决方案的关键在于使用偏好优化算法（ORPO, AlphaPO）对仅包含300个高保真示例的微数据集进行专门化训练，从而诱导模型在内向斯多葛美德（inward-facing Stoic virtues）上实现强对齐，接近少样本提示（few-shot prompting）的效果，并释放上下文窗口；然而，该方案无法克服小模型在外向世界主义责任（outward-facing cosmopolitan duties）上的表征局限性，所有模型（包括少样本基线）均表现出持续性失败，表明微数据集适应本身不足以解决此类表征缺陷。

链接: https://arxiv.org/abs/2605.11483
作者: Ishmam Khan,Sindhuja Thogarrati,Shuo Zhang
机构: Tufts University (塔夫茨大学); Bose Corporation (Bose公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism’s outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.

[NLP-83] Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

【速读】：该论文试图解决在线策略自我蒸馏（on-policy self-distillation）中大语言模型推理训练中一个未被质疑的默认设计：教师模型总是暴露于完整的参考推理过程（full reference reasoning），导致教师侧的暴露不匹配（exposure mismatch），即教师基于远超学生当前能力的推理生成的目标难以被学生有效吸收，且学生-教师不匹配程度随暴露增强而单调增长。解决方案的关键在于将教师暴露（teacher exposure）从固定超参数转变为可学习的训练时控制变量，提出了自适应教师暴露的自我蒸馏（ATESD）。ATESD采用轻量级Beta策略控制器（Beta-policy controller），基于紧凑的训练状态统计量（如学生当前表现）动态建模揭露比率（reveal ratio），并通过折扣学习进度奖励（discounted learning-progress reward）来优化该控制器，使每个暴露决策根据其对未来学生改进的长期影响而非即时损失变化获得奖励，从而解决了在线蒸馏中的延迟信用分配问题。

链接: https://arxiv.org/abs/2605.11458
作者: Zihao Han,Tiangang Zhang,Huaibin Wang,Yilun Sun
机构: ByteDance Douyin (字节跳动抖音)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 11 pages, 4 figures; code not released yet

点击查看摘要

Abstract:On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student’s own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student’s current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student’s future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-1.7B, 4B, 8B show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

[NLP-84] Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

【速读】：该论文试图解决大型语言模型（LLM）代理在作为用户、数字服务与LLM基础设施交互枢纽时，因系统级风险被忽视而暴露出的新型安全问题，具体而言，是揭示了一种名为Mobius Injection的攻击范式，该攻击利用代理逻辑中的结构漏洞“语义闭合”（Semantic Closure），通过单次文本注入诱导代理组件持续递归执行，从而将自主代理武器化为僵尸节点，发动基于代理的定向拒绝服务（AbO-DDoS）攻击。解决方案的关键在于两点：一是识别并利用Semantic Closure这一漏洞，使攻击具备轻量、隐蔽、高可配置性，并能实现单节点调用放大至51.0倍、多节点p95延迟膨胀至229.1倍；二是提出基于代理组件能量（Agent Component Energy, ACE）分析的主动防御机制，通过测量代理组件图中的异常能量来检测恶意递归触发。

链接: https://arxiv.org/abs/2605.11442
作者: Zi Liang,Ronghua Li,Yanyun Wang,Qingqing Ye,Haibo Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user-agent-service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent-based and -Oriented DDoS (AbO-DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real-world impact, we conduct extensive experiments across three representative claw-style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open-weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single-node call amplification up to 51.0x and multi-node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent’s component graph.

[NLP-85] Agent -BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

【速读】：该论文试图解决大语言模型（LLM）在部分可观测环境中执行长期任务时面临的两大挑战：一是部分可观测性要求模型持续维护对未观测环境属性的不确定性；二是长期交互历史导致上下文无限增长，稀释了任务相关信息。解决方案的关键在于引入信念状态（belief state）的概念，并提出了Agent-BRACE方法：将LLM智能体解耦为信念状态模型和策略模型，并通过强化学习联合优化。信念状态模型产生一个结构化近似的信念分布，该分布由一组关于环境的原子自然语言声明（atomic natural language claims）组成，每个声明附带一个从“确定”到“未知”的序数语言确定性标签；策略模型则基于这种紧凑的结构化近似信念（而非完整历史）进行决策，从而在显式不确定性管理下学习选择动作，并维持一个与情节长度无关的恒定上下文窗口。

链接: https://arxiv.org/abs/2605.11436
作者: Joykirat Singh,Zaid Khan,Archiki Prasad,Justin Chih-Yao Chen,Akshay Nambi,Hyunji Lee,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

[NLP-86] Freeze Deep Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）低成本持续预训练（continued pre-training）中选择性分层更新（selective layer-wise updates）时的经验性黑箱问题——即由于缺乏可解释的指导，无法确定哪些层应冻结或训练。解决方案的关键在于提出了一个与架构无关的诊断框架LayerTracer，通过定位任务执行位置（locating task execution positions）和量化层敏感性（quantifying layer sensitivity），揭示层间表征及稳定性的演化模式。分析发现深层是任务执行的关键区域，且对破坏性更新保持高稳定性；基于此，论文通过三个受控的持续预训练实验证明，训练浅层并冻结深层的策略在C-Eval和CMMLU基准上一致优于全参数微调及相反分配，并通过混合模型案例验证了将高质量预训练模块置于深层能有效保留模型的固有知识，为资源受限团队提供了低成本、可解释的分层参数分配指导。

链接: https://arxiv.org/abs/2605.11416
作者: Yu-Hang Wu,Qin-Yuan Liu,Qiu-Yang Zhao,Bo Jiang,Jiang-Feng Yang,Qing-Wei Cong
机构: Nanhu Research Institute of China Electronic Science and Technology (中国电子科技南湖研究院); School of Electronic and Electrical Engineering, Shanghai University of Engineering Science (上海工程技术大学电子电气工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.

[NLP-87] MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification

【速读】：该论文试图解决工业级表格数据（tabular data）在高维、大量缺失值、稀疏标注场景下缺乏统一自监督预训练框架的问题。其解决方案的关键在于：通过专用可学习编码（learnable tokens）对缺失值进行建模，以区分结构性缺失与随机缺失；采用孪生路径架构（twin-path architecture）联合优化掩码重建与任务监督的混合预训练，并引入混合专家增强损失（MoE-augmented loss），使特征通过专有子网络（specialized subnetworks）自适应路由，从而在不依赖手工特征的前提下实现工业级表格数据的通用预训练。

链接: https://arxiv.org/abs/2605.11408
作者: Bo Zheng,Yudong Chen,Zihua Xiong,Shuai Fang,Peidong He,Yang Yang,Sheng Guo
机构: Zhejiang University (浙江大学); MyBank, Ant Group (网商银行, 蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme–utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision–and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment–when its structural idiosyncrasies are respected.

[NLP-88] fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

【速读】：该论文试图解决强化学习中的组相对策略优化（Group Relative Policy Optimization, GRPO）在大型语言模型（LLM）数学推理任务中存在的两个被忽视的低效问题：一是固定KL系数（KL coefficient）过度限制了模型在需要显著偏离参考策略时的策略探索；二是均匀的问题采样（uniform question sampling）未能突出中等难度问题所提供的最具信息量的梯度信号。解决方案的关键在于提出了FG-ExPO（Frontier-Guided Exploration-Prioritized Policy Optimization），它集成了两个轻量组件：精度条件化KL缩放（Accuracy-Conditioned KL Scaling, AKL），通过批平均准确率的平滑非线性函数动态调整KL惩罚强度，在模型表现不佳时放松约束、在达到满意结果时加强约束；以及高斯课程采样（Gaussian Curriculum Sampling, GCS），按照以中等准确率（约0.5）为中心的高斯分布为问题分配采样权重，从而将模型训练聚焦于其学习前沿（learning frontier）。该方法通过扩展模型在固定推理预算下的有效探索空间，显著提升了数学推理基准上的性能。

链接: https://arxiv.org/abs/2605.11403
作者: Mingxiong Lin,Zhangquan Gong,Maowen Tang,Qian Li,Chuangchuang Wang,Jian Ma,Sutian Huang,Kai Tang,Haonan Lu
机构: OPPO AI Center (OPPO AI中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model’s effective exploration space under a fixed inference budget.

[NLP-89] AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment NEURIPS2026

【速读】：该论文试图解决现有健康基准无法对语言模型识别用户医疗表述中护理紧急程度（acuity）进行统一评估的问题。当前基准侧重医疗问答、广泛健康互动或狭窄工作流程中的特定分诊任务，缺乏跨场景的紧急程度识别评估框架。解决方案之关键是构建了AcuityBench，它通过整合五个公开数据集（涵盖用户对话、在线论坛帖子、临床小案例和患者门户消息），并统一采用四级紧急程度框架（从家庭监测到紧急护理），形成包含914个案例的基准集，其中697个共识案例用于标准准确度评估，217个医生确认的模糊案例用于不确定性感知评估。同时支持显式四分类问答（QA）和基于评分标准的自由形式对话响应两种互补任务格式，从而系统比较和压力测试模型在不同场景下引导用户达到适当护理水平的能力。

链接: https://arxiv.org/abs/2605.11398
作者: Robin Linzmayer(1 and 2),Georgianna Lin(2),Di Coneybeare(3),Jason Chu(3),Trudi Cloyd(3),Manish Garg(3),Miles Gordon(3),Elizabeth Hartofilis(3),Benjamin Hong(3),Ashraf Hussain(3),Eugene Y. Kim(3),Oluchi Iheagwara King(3),Ross McCormack(3),Erica Olsen(3),John K. Riggins Jr(3),Mustafa N. Rasheed(3),Dana L. Sacco(3),Vinay Saggar(3),Osman R. Sayan(3),Amit Shembekar(3),Janice Shin-Kim(3),Wendy W. Sun(3),Bernard P. Chang(3),David Kessler(3),Noémie Elhadad(1 and 2) ((1) Department of Computer Science, Columbia University, (2) Department of Biomedical Informatics, Columbia University, (3) Department of Emergency Medicine, Columbia University Irving Medical Center)
机构: Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系); Department of Biomedical Informatics, Columbia University(哥伦比亚大学生物医学信息学系); Department of Emergency Medicine, Columbia University Irving Medical Center(哥伦比亚大学欧文医学中心急诊医学系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 41 pages, 5 figures. Preprint under review for the Track on Evaluations and Datasets at NeurIPS 2026

点击查看摘要

Abstract:We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.

[NLP-90] Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

【速读】：该论文旨在解决当前大型语言模型（LLM）智能体缺乏灵活性的问题：现有脚手架（scaffolds）固定预编码推理决策，当任务要求自适应调整推理结构时表现脆弱，而人类能灵活切换多种推理模式（如规划、执行、修订目标、联想判断和形式化程序）。解决方案的关键在于Deep Reasoning——一种推理时（inference-time）方法，通过结构化元推理（structured meta-reasoning）动态构建任务特定的脚手架。该方法使用形式语言将元推理表示为联想推理（associative inference）、形式计算（formal computation）和递归子问题求解（recursive subproblem solving）的可执行分解，从而将分解原则编码为上下文示例（in-context examples）来指导测试时脚手架构建，最终实例化为通用代理DOLORES，通过将复杂任务分配至更受控的推理线程（reasoning threads）来减少过早终止和幻觉，显著提升多跳推理、长链问答等硬基准任务上的性能。

链接: https://arxiv.org/abs/2605.11388
作者: Dean Light,Michael Theologitis,Kshitish Ghate,Shuyue Stella Li,Benjamin Newman,Chirag Shah,Aylin Caliskan,Pang Wei Koh,Dan Suciu,Yulia Tsvetkov
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint under review

点击查看摘要

Abstract:Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning – an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.

[NLP-91] An Empirical Study of Automating Agent Evaluation

【速读】：该论文试图解决智能体（Agent）自动化评估的高成本与高专业性需求问题，即：前沿编码助手因缺乏领域特定的评估知识，无法可靠地完成涉及多步骤行为、工具使用和中间推理的评估任务（仅30%执行成功率，且评估往往过度工程化，平均包含12个以上指标）。解决方案的关键在于提出 EvalAgent，它将评估领域专家知识编码为可组合的评估技能（Evaluation Skills），包括程序化指令、可重用代码与模板、以及动态检索的 API 文档，这些技能被整合进一个基于执行轨迹的流水线（Trace-based Pipeline）中，从而端到端地自动生成包含指标、可执行代码和报告的完整评估工件。实验表明，该技能组合机制是系统提升评估质量的核心：移除技能后，Eval@1 指标从 65% 骤降至 30%。

链接: https://arxiv.org/abs/2605.11378
作者: Kang Zhou,Sangmin Woo,Haibo Ding,Kiran Ramnath,Subramanian Chidambaram,Aosong Feng,Vinayak Arannil,Muhyun Kim,Ishan Singh,Darren Wang,Zhichao Xu,Megha Gandhi,Nirmal Prabhu,Soumya Smruti Mishra,Vivek Singh,Gouri Pandeshwar,Lin Lee Cheong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

[NLP-92] PresentAgent -2: Towards Generalist Multimodal Presentation Agents

【速读】：该论文旨在解决传统演示生成仅限于静态幻灯片创建的问题，将其扩展为基于研究基础、多模态媒体与交互式交付的端到端演示视频生成。解决方案的关键在于提出了一个统一的代理框架 PresentAgent-2，该框架能够从开放式的用户查询出发，自动将查询总结为聚焦主题，并执行深度研究以收集包括文本、图像、GIF 和视频在内的多模态资源，随后构建幻灯片、生成模式特定的脚本，并将幻灯片、音频和动态媒体合成为完整的演示视频。此外，该框架在单一框架内支持三种独立的演示模式——单人讲述、多角色讨论（含引导问题、概念解释、细节澄清与要点总结）以及交互式问答（基于生成的幻灯片、脚本、证据和演示上下文独立回答观众问题），并通过构建多模态演示基准来评估内容质量、媒体相关性、动态媒体使用、对话自然性和交互基础性。

链接: https://arxiv.org/abs/2605.11363
作者: Wei Wu,Ziyang Xu,Zeyu Zhang,Yang Zhao,Hao Tang
机构: Peking University (北京大学); La Trobe University (拉筹伯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: this https URL. Website: this https URL.

[NLP-93] SOMA: Efficient Multi-turn LLM Serving via Small Language Model

【速读】：该论文旨在解决多轮对话场景中大型语言模型（LLMs）部署时面临的效率与质量权衡问题：标准做法将完整对话历史拼接至每一轮以维持连贯性，但导致高延迟、内存占用和API开销，尤其当查询路由至大型专有模型时；现有方法难以在保持响应质量的同时提升效率。解决方案之关键在于提出SOMA框架，它利用对话早期轮次估计局部响应流形（local response manifold），随后使用较小的替代模型（surrogate model）完成剩余对话。具体通过学习软提示（soft prompts）最大化大模型与小模型在语义上的分歧，暴露最不对齐的局部方向，并引入反退化控制（anti-degeneration control）稳定训练，将挖掘的案例蒸馏至局部LoRA微调（localized LoRA fine-tuning），使替代模型在推理时无需额外提示；同时，一个简单的门控机制实现一次性切换并可回滚以应对漂移。

链接: https://arxiv.org/abs/2605.11317
作者: Xueqi Cheng,Qiong Wu,Zhengyi Zhou,Xugui Zhou,Tyler Derr,Yushun Dong
机构: Florida State University(佛罗里达州立大学); ATT Chief Data Office(ATT首席数据办公室); Louisiana State University(路易斯安那州立大学); Vanderbilt University(范德堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models’ responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: this https URL.

[NLP-94] Predicting Psychological Well-Being from Spontaneous Speech using LLM s

【速读】：该论文旨在解决从自发语音中零样本预测Ryff心理幸福感（Psychological Well-Being, PWB）分数的问题，即无需任务特定训练数据，仅利用大型语言模型（LLMs）从简短语音录音中提取语义线索以评估心理健康状态。解决方案的关键在于：首先，与临床心理学和语言学专家合作开发了一种领域知识提示（domain-informed prompt），引导LLMs聚焦于与心理幸福感相关的语义特征；其次，系统评估了12种指令微调LLM（如Llama-3、Gemma-3、DeepSeek等）在111名参与者语音数据上的表现，发现模型能捕捉有意义语义线索，最高达0.8的斯皮尔曼相关系数；此外，通过统计分析解释预测变异性和系统性偏差，并结合基于关键词的词云分析突出模型预测所依赖的语言特征，从而提升可解释性。

链接: https://arxiv.org/abs/2605.11303
作者: Erfan Loweimi,Sofia de la Fuente Garcia,Saturnino Luz
机构: Centre for Medical Informatics (CMI), Usher Institute, University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models’ predictions.

[NLP-95] A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

【速读】：该论文试图解决在全局偏好排序下语言生成（language generation）的及时性问题，即要求高排名字符串更早生成，并在其截止时间前输出，以契合机器学习中“更简单”或“更合理”输出的归纳偏好。论文的关键解决方案是放弃严格的一致性（consistency）要求，转而采用一种最温和的松弛条件——幻觉率（hallucination rate）随时间消失。在此条件下，对于任何超线性截止函数（superlinear deadline function），可以实现最优密度（optimal density），从而规避了最终一致性生成器（eventually consistent generators）无法实现及时生成的不可行性结果，同时通过证明线性截止函数下的不可能性表明了该结果的紧致性。

链接: https://arxiv.org/abs/2605.11302
作者: Atul Ganju,Travis McVoy,Shaddin Dughmi,Shang-Hua Teng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As in [arXiv:2504.14370, arXiv:2511.05295], we aim for \emphbreadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string’s rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors simpler'' or more plausible’’ outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators – the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

[NLP-96] LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

【速读】：该论文试图解决的问题是：多模态大语言模型（MLLMs）在OCR、图表理解、空间推理、视觉问答等任务上具有异质性优势，且不同模型在成本和延迟上差异显著，因此需要一种路由机制来为每个图像-问题输入动态选择最优模型。现有路由方法仅依赖查询难度估计，无法匹配每个候选模型的具体多模态能力。解决方案的关键在于提出LatentRouter，将路由形式化为反事实多模态效用预测（counterfactual multimodal utility prediction）：通过提取学习到的多模态路由胶囊（multimodal routing capsules），用模型能力令牌（model capability token）表示每个候选MLLM，并让这些状态之间进行潜在通信（latent communication）以估计每个模型被选中后的性能；随后利用分布性结果头（distributional outcome head）预测模型特定的反事实质量（counterfactual quality），并通过有界胶囊校正（bounded capsule correction）细化接近决策，避免残差信号主导预测。该方法支持性能导向和性能-成本混合路由，且通过共享每模型评分（shared per-model scoring）与可用性掩码（availability masking）处理动态变化的候选模型池。

链接: https://arxiv.org/abs/2605.11301
作者: Xueqi Cheng,Yushun Dong
机构: Florida State University (佛罗里达州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: this https URL.

[NLP-97] Primal Generation Dual Judgment: Self-Training from Test-Time Scaling

【速读】：该论文试图解决代码生成任务中训练信号稀疏且缺乏比较信息的问题：传统方法仅在程序的原空间（primal space）中训练，模型接收独立的通过/失败二进制反馈，而测试时计算（test-time scaling）虽在推理阶段利用候选间的比较，但其揭示的判断信息却被丢弃。论文提出，这些比较信息定义了一个双判断空间（dual judgment space），蕴含更丰富的相对正确性结构。解决方案的关键是引入双自我训练（DuST, Dual Self-Training）框架，通过从模型自身分布采样候选程序，经沙盒执行标注后保留包含成功与失败的组，利用GRPO（Group Relative Policy Optimization）训练模型对候选进行执行正确性排序，目标纯粹是判别性的——模型从不因生成正确程序而直接获得奖励。该机制的核心在于：on-policy强化学习将双空间判别学习信号有效转移回原空间生成中，从而同时提升判断质量与生成能力，其效果优于仅使用相同排序数据进行监督微调（SFT）。

链接: https://arxiv.org/abs/2605.11299
作者: Yizhu Jiao,Ruixiang Zhang,Richard Bai,Jiawei Han,Ronan Collobert,Yizhe Zhang
机构: Apple; University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model’s own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model’s single rollout matches the base model’s Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

[NLP-98] ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

【速读】：该论文旨在解决现有能力蒸馏（capability distillation）方法中将不同能力视为独立训练目标，忽略能力间相互影响（cross-capability transfer）的问题，这导致在固定令牌预算（token budget）下，任务相关的收益有限，且可能对其他有用能力造成有害溢出（harmful spillover），造成蒸馏预算浪费。解决方案之关键是提出ReAD（Reinforcement-guided cApability Distillation）框架，其核心创新包括：首先通过推理识别任务必需的能力集合（task-essential capabilities），然后在线生成为这些能力定制的监督信号（capability-targeted supervision），最后采用不确定性感知的上下文bandit（uncertainty-aware contextual bandit）根据预期效用增益（expected utility gains）自适应分配蒸馏预算，从而显式建模能力间的相互依赖关系，在同等预算下提升下游效用并减少浪费。

链接: https://arxiv.org/abs/2605.11290
作者: Xueqi Cheng,Xugui Zhou,Tyler Derr,Yushun Dong
机构: Florida State University(佛罗里达州立大学); Louisiana State University(路易斯安那州立大学); Vanderbilt University(范德比尔特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student’s broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at this https URL.

[NLP-99] Unlocking LLM Creativity in Science through Analogical Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在开放式解决方案生成任务中出现的模式崩溃（mode collapse）问题，即模型倾向于生成低多样性的重复或类似解决方案，从而限制了其在自主科学发现（如生物医学领域）中的应用潜力。解决方案的关键在于引入类比推理（analogical reasoning, AR）作为一种新的生成策略：通过识别跨领域问题之间的共享关系结构（shared relational structure），生成对应的类比，并利用这些类比来搜索和产生新颖、多样的解决方案。实验表明，AR在多个生物医学问题中显著提升了生成解决方案的多样性（指标提升90-173%）和新颖性（超过50%的生成是新颖的，而基线方法最低仅1.6%），并在实际应用中取得了量化增益，例如扰动效应预测、细胞通讯推断、脑区交互推断以及寡核苷酸性质预测等任务上均优于现有方法。

链接: https://arxiv.org/abs/2605.11258
作者: Andrew Shen,Shaul Druckmann,James Zou
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ( \rho =0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.

[NLP-100] HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

【速读】：该论文试图解决希伯来语专用大语言模型在推理效率与性能上的双重不足，具体表现为现有模型（如DictaLM-3.0-24B-Thinking）在希伯来语推理任务上表现有限，且缺乏基于稀疏MoE（Sparse Mixture-of-Experts）架构、支持原生长上下文（65,536 tokens）的开源方案。解决方案的关键在于：采用NVIDIA Nemotron-3稀疏MoE架构，并设计了一种三阶段易到难课程学习（easy-to-hard curriculum）训练策略，同时引入连续反遗忘锚定（continuous anti-forgetting anchoring）以防止知识遗忘；随后在200万双语希伯来-英语样本上进行监督微调。其中，课程排序本身即可带来3个聚合基准点的提升，而模型仅激活3B参数（总参数30B），实现了约9倍的推理吞吐量提升，在希伯来推理平均得分73.8%上超越同类模型，同时保持与更大模型（如Gemma-3-27B-IT）的竞争力。

链接: https://arxiv.org/abs/2605.11255
作者: Noam Kayzer,Dan Revital,Ori Bar Joseph,Smadar Arvatz,Or Levi,Tal Geva,Shaltiel Shmidman,Amir DN Cohen,Noam Ordan,Omer Baruch,Kate Zinkovskaia,Zevi Apini,Sarel Weinberger
机构: PwC Next (普华永道Next)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew–English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8%, outperforming DictaLM-3.0-24B-Thinking (68.9%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.

[NLP-101] RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German ACL2026

【速读】：该论文试图解决在德语短答案评分任务中，面对未见过的答案（Unseen answers）或未见过的题目（Unseen questions）时，如何利用特定评分标准（Rubrics）实现有效评分的问题。解决方案的关键在于提出了一种名为元提示（Meta-prompting）的方法，即让大语言模型（LLM）基于训练集（Train set）中的示例自动生成自定义提示（Custom prompt），再将该提示用于对新学生答案进行评分，从而动态适应任务的变化性。

链接: https://arxiv.org/abs/2605.11242
作者: Ignacio Sastre,Ignacio Remersaro,Facundo Díaz,Nicolás De Horta,Luis Chiruzzo,Aiala Rosá,Santiago Góngora
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be presented at the BEA 2026 workshop, co-located with ACL 2026

点击查看摘要

Abstract:In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task “Rubric-based Short Answer Scoring for German”. Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.

[NLP-102] ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

【速读】：该论文试图解决计算机使用智能体（Computer-use agents, CUAs）因每个图形用户界面截图编码为大量视觉token，导致长交互轨迹中token成本急剧增加、在固定上下文和计算预算下可纳入的历史信息受限，从而无法像其他领域一样通过引入历史信息显著提升性能的问题。解决方案的关键是提出ReVision，即通过一个可学习的补丁选择器（patch selector），比较连续截图间的补丁表示，在保留模型所需空间结构的前提下，去除冗余视觉补丁，以此训练多模态语言模型（Multimodal Language Model）。实验表明，该方法在三个基准测试中，处理5个历史截图的轨迹时平均减少约46%的token使用，同时成功率较无丢弃基线提升3%。更重要的是，通过去除冗余，历史信息的使用不再出现性能饱和，表明此前观察到的历史信息收益有限并非由于过去信息本身无用，而是低效token表示所致。

链接: https://arxiv.org/abs/2605.11212
作者: Amirhossein Abaskohi,Yuhang He,Peter West,Giuseppe Carenini,Pranit Chawla,Vibhav Vineet
机构: University of British Columbia (不列颠哥伦比亚大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.

[NLP-103] Instructions shape Production of Language not Processing

【速读】：该论文试图解决的问题是：揭示语言模型（language models）中，指令（instructions）如何通过一个“生产中心机制”（production-centered mechanism）影响模型行为，特别是澄清语言处理（processing）与语言生产（production）两个阶段之间是否存在不对称性及其具体表现。解决方案之关键在于：通过逐层（layer-wise）探测五种二元判断任务（binary judgment tasks）中的任务特定信息（task-specific information），比较指令令牌（instruction tokens）在样本令牌（sample tokens，即被评估的输入）处理阶段和输出令牌（output tokens）生产阶段对信息的影响。具体地，论文测量了不同提示变化下，样本令牌中的任务特定信息保持相对稳定且与行为弱相关，而输出令牌中的相同信息则变化显著且与行为强相关。进一步的基于注意力（attention）的干预实验证实了这一模式：仅阻断指令流向后续所有令牌会同时降低行为表现和输出令牌中的信息，而仅阻断指令流向样本令牌则对两者影响甚微。该不对称性在不同模型家族和任务中泛化，并随着模型规模扩大和指令调优（instruction-tuning）而加剧——这两者均不成比例地影响生产阶段。因此，理解模型能力需要联合评估内部表征（internals）与行为（behavior），同时按令牌位置分解内部视角以区分输入令牌的处理与输出令牌的生产。

链接: https://arxiv.org/abs/2605.11206
作者: Andreas Waldis,Leshem Choshen,Yufang Hou,Yotam Perlit
机构: Department of Linguistics, University of Tübingen (蒂宾根大学语言学系); IBM Research (IBM研究院); MIT (麻省理工学院); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); Interdisciplinary Transformation University of Austria (奥地利跨学科转型大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.

[NLP-104] How Does Differential Privacy Affect Social Bias in LLM s? A Systematic Evaluation

【速读】：该论文试图解决的问题是：在大型语言模型（Large Language Models, LLMs）中使用差分隐私（Differential Privacy, DP）训练（如DP-SGD）时，隐私保护与社会偏见之间的关系尚不明确，即DP训练是否会减轻或加剧模型中的社会偏见缺乏系统理解。解决方案的关键在于采用多范式评估方法，通过四种互补的任务（句子评分、文本补全、表格分类和问答）系统比较DP模型与非DP基线模型在偏见表现上的差异，并特别区分了logit级偏见与输出级偏见，从而揭示DP在不同任务和不同偏见度量维度上的非一致性影响。

链接: https://arxiv.org/abs/2605.11195
作者: Eduardo Tenorio,Karuna Bhaila,Xintao Wu
机构: University of Arkansas (阿肯色大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

[NLP-105] he Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

【速读】：该论文试图解决现有工具增强的多模型系统（multi-model and tool-augmented systems）中，模型之间仅能通过离散文本（text）序列化交换造成的通信瓶颈与信息损失问题，具体而言，是探究能否让两个预训练语言模型（pretrained language models）通过一个连续（continuous）、并发（concurrent）的通道进行协调，而非依赖生成文本。解决方案的关键在于Bicameral Model：它通过一个可训练的神经接口（trainable neural interface）耦合两个冻结的语言模型（frozen language models），该接口作用于两者的中间隐藏状态（intermediate hidden states）。在每个生成步骤中，主模型（primary model）驱动任务，辅助模型（auxiliary model）操作工具、解决约束或执行代码，两者以锁步（lockstep）方式运行，并通过一个翻译网络（translation network） 和学习到的抑制门（learned suppression gate） 相互调节各自激活（activations）。该门仅占总参数约1%，且能从任务损失（task loss）中自动学习选择性通信协议（selective communication protocol），无需预设格式，从而实现了模型间的高效、低开销连续协调。

链接: https://arxiv.org/abs/2605.11167
作者: Cedric Flamant,Udaya Ghai,Kanna Shimizu
机构: AWS Agentic AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages main text, 5 figures, 24 pages appendix

点击查看摘要

Abstract:Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other’s activations through a translation network and a learned suppression gate ( \sim 1% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36% to 96%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves 1.7\times the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.

[NLP-106] Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever the Lifecycle Penalty and a Substrate-Conditional Boundary

【速读】：该论文旨在通过分解实验归因一个从零训练的~150M参数宽基底（widened-D substrate, D=1536, V=32000）上进化混合LoRA（evolutionary mixture-of-LoRA）系统中各组件对性能的独立与交互贡献，并识别出关键驱动因素。解决方案核心在于设计一个部分2^3阶乘实验（5 of 8 cells at n=3 seeds），将系统解构为三个可独立操纵的因素：路由器重写（router rewrite，即采用可学习每适配器底限与退火温度上限的并行sigmoid门控，输入为堆叠后隐藏状态而非词嵌入均值）、每领域留一法评估范围（per-domain leave-one-out evaluation scope）、以及包含死亡、alpha混合继承、SVD突变与插槽重新分配的生命周期（lifecycle）。归因链清晰表明：路由器重写独自承担了与静态B3基线相比平衡对数困惑度改进（+0.0426 nat, t=12.86, p=0.006）的全部正向效应，而生命周期在主要对比链中表现为净拖累（约-0.028 nat, t=-4.46, p=0.047）。辅助对照实验和可控合成沙盒进一步确认，生命周期并非起到“基因组上下文”作用，且进化搜索仅在适配器预先对齐任务时对路由通道有正向负载贡献。因此，解决方案的关键是通过精细的析因归因与对照实验，将复杂进化系统的表现解耦为独立组件的效应，并排除生命周期等非核心机制的混淆解释。

链接: https://arxiv.org/abs/2605.11153
作者: Ramchand Kumaresan
机构: Murai Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the “widened-1536” substrate) into three factors – a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation – and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to “the full evolutionary system vs the static B3 baseline”; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a “genomic-context” reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.

[NLP-107] Sampling More Getting Less: Calibration is the Diversity Bottleneck in LLM s

【速读】：该论文试图解决大型语言模型（LLM）在解码过程中产生的输出多样性崩溃（diversity collapse）问题，即模型倾向于生成狭窄子集的结果，而非覆盖有效可能性的多样性输出。解决方案的关键在于提出一个有效性-多样性框架（validity–diversity framework），将多样性崩溃归因于推理时概率分布的两类校准错误（miscalibration）：一是顺序校准（order calibration）失效，即有效token（valid tokens）未能可靠地排在无效token（invalid tokens）之前，迫使基于排名的截断规则在恢复有效结果与接纳无效结果之间权衡；二是形状校准（shape calibration）失效，即概率质量过度集中在少数有效结果上，同时存在混合有效与无效token的长尾分布，使得维持高有效性（high validity）会严重限制多样性。论文通过形式化这两种局部失效机制及其在解码步骤中的累积效应，揭示了多样性崩溃并非特定采样启发式方法的局限，而是LLM分布固有的顺序与形状校准缺陷。

链接: https://arxiv.org/abs/2605.11128
作者: Amin Banayeeanzade,Qingchuan Yang,Dhruv Tarsadiya,Fatemeh Bahrani,Leonardo Blas,Alfy Samuel,Robin Jia,Meisam Razaviyayn,Sai Praneeth Karimireddy
机构: University of Southern California(南加州大学); Capital One(第一资本)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity–diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.

[NLP-108] On Problems of Implicit Context Compression for Software Engineering Agents

【速读】：该论文试图解决基于大语言模型（LLM）的软件工程智能体（Software Engineering agents）在复杂、长时程任务（complex, long-horizon tasks）中因上下文长度限制（context length limitations）而导致的失效问题。其提出的关键解决方案是采用近期提出的上下文自编码器（In-Context Autoencoder），将上下文信息从离散的标记（tokens）编码为连续的嵌入向量（continuous embeddings），以实现更高密度的信息存储，从而在有限的上下文窗口内容纳更长的任务历史。然而，实验表明该方法在单步常识和代码理解任务上有效，但在多步智能体编码任务（multi-step agentic coding tasks）中会失败，因此论文重点探讨了该失败现象及其可能成因。

链接: https://arxiv.org/abs/2605.11051
作者: Kirill Gelvan,Igor Slinko,Felix Steinbauer,Egor Bogomolov,Florian Kofler,Yaroslav Zharov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based Software Engineering agents face a critical bottleneck: context length limitations cause failures on complex, long-horizon tasks. One promising solution is to encode context as continuous embeddings rather than discrete tokens, enabling denser information storage. We apply the recently proposed In-Context Autoencoder for this purpose. While the method performs well on single-shot common-knowledge and code-understanding tasks, our experiments demonstrate that it fails on multi-step agentic coding tasks. In this paper, we explore this phenomenon and discuss possible factors contributing to this failure.

[NLP-109] AgentS hield: Deception-based Compromise Detection for Tool-using LLM Agents

【速读】：该论文旨在解决针对工具使用型大语言模型代理（tool-using LLM agents）的间接提示注入（indirect prompt injection, IPI）防御中存在的两个结构性缺陷：一是现有防御均试图预防攻击，而非检测已成功渗透的妥协行为；二是防御评估仅局限于英语场景，未覆盖库尔德语、阿拉伯语等低资源语言，导致这些语言用户缺乏经过验证的保护。解决方案的关键在于AgentShield——一个基于欺骗的检测框架，它在代理的工具接口中部署了三个层次的陷阱：虚假工具（fake tools）、虚假凭证（fake credentials）和白名单参数（allowlisted parameters）。这些陷阱的触发信号可直接作为高精度标签，用于训练自监督分类器。若大语言模型代理遵循攻击者的隐藏指令，几乎必然触发至少一个陷阱，从而同时提供实时妥协信号和零假阳性（zero false positive）的标注数据，无需人工标注。该框架在176个跨语言攻击提示及来自三家供应商的四个大语言模型上，成功捕获了90.7%–100%的成功攻击（在商用模型上），且在485次正常使用测试中零误报，并能在无重训练的情况下跨模型和语言迁移，抵御系统性自适应攻击评估时在商用模型上实现零逃避。

链接: https://arxiv.org/abs/2605.11026
作者: Yassin H. Rassul,Tarik A. Rashid
机构: Computer Science and Engineering Department, School of Science and Engineering, University of Kurdistan Hewlêr, Erbil, Iraq (伊拉克埃尔比勒库尔德斯坦赫勒尔大学科学与工程学院计算机科学与工程系); Artificial Intelligence and Innovation Centre, University of Kurdistan Hewlêr, Erbil, Iraq (伊拉克埃尔比勒库尔德斯坦赫勒尔大学人工智能与创新中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 20 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Defenses against indirect prompt injection (IPI) in tool-using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only been evaluated in English, leaving users of low-resource languages such as Kurdish and Arabic without tested protection. This paper addresses both gaps with AgentShield, a deception-based detection framework that places three layers of traps inside the agent’s tool interface: fake tools, fake credentials, and allowlisted parameters. The same trap triggers serve as high-precision labels for a self-supervised classifier. An LLM agent that follows an attacker’s hidden instruction almost always touches one of these traps, which gives both a real-time compromise signal and a zero-FP label for training a downstream detector without manual annotation. Across 176 cross-lingual attack prompts and four LLMs from three providers, and because modern LLMs already refuse most IPI attempts on their own (attack success rate = 10%), AgentShield’s job is to catch the attacks that do slip through. On commercial models, it catches 90.7%-100% of such successful attacks, with zero false alarms on 485 normal-use tests. It survives a systematic adaptive-attack evaluation with zero evasion on commercial models, and the self-supervised classifier transfers across models and languages without retraining.

[NLP-110] Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

【速读】：该论文试图解决离散扩散语言模型（Discrete Diffusion Language Models, DLMs）在受控生成时采用均匀干预调度（uniform intervention schedule）导致生成质量下降的问题，尤其是当多个属性被联合控制时退化加剧。解决方案的关键在于提出一种自适应调度器（adaptive scheduler），其核心是利用稀疏自编码器（sparse autoencoders）分析不同属性在去噪过程中的承诺时间表（commitment schedules），发现属性在时序、尖锐度和幅度上存在差异（如主题在前2%的步骤内承诺，情感则在20%的步骤中逐渐形成），从而将干预集中在属性正在积极形成的步骤上，其余步骤保持不变。该调度器通过一个承诺分布的离散统计量（dispersion statistic）在封闭形式下刻画自适应调度相对于均匀调度的优势，最终在四个DLM和七个控制任务中实现了高达93%的控制强度，同时保持了生成质量，尤其在困难的三属性联合控制中比最强基线提升了15个百分点。

链接: https://arxiv.org/abs/2605.10971
作者: Hanhan Zhou,Shamik Roy,Rashmi Gangadharaiah
机构: AWS AI Labs (AWS人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, 47 pages

点击查看摘要

Abstract:Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2% of denoising, whereas sentiment emerges gradually over 20% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93% steering strength, beating the strongest baseline by up to 15% points while preserving generation quality.

[NLP-111] FERA: Uncertainty-Aware Federated Reasoning for Large Language Models

【速读】：该论文试图解决在大型语言模型（LLMs）的多步推理中，高质量演示数据分散于多个组织且因监管、所有权或制度约束无法集中训练或共享原始数据时，如何通过联邦方式改进推理能力的问题。其核心挑战是客户端的可靠性依赖于具体查询，而服务器无法直接检查客户端数据以判断哪些贡献可信。解决方案的关键在于提出了一种无训练框架——不确定性感知联邦推理（Uncertainty-Aware Federated Reasoning, FERA），通过迭代的服务器-客户端共精炼（co-refinement）机制实现：在每个通信轮次中，客户端生成带有轻量不确定性估计的推理轨迹，服务器利用这些轨迹合成改进后的推理结果，并将其重新分发作为下一轮上下文，从而渐进地同时提升服务器输出和客户端推理质量；在每个轮次内，不确定性感知自批判聚合（Uncertainty-Aware Self-Critique Aggregation, UA-SCA）通过查询依赖的信任加权和结构化跨客户端验证来解决异构客户端轨迹间的冲突，并且不是简单丢弃低质量轨迹，而是修正其中错误的推理步骤以恢复有用信息。此外，该方法还提供了理论保证，证明迭代协议收敛且不确定性感知加权能加速收敛，从而在无需训练或共享原始数据的情况下，实现高效且逐渐提升准确率的联邦推理。

链接: https://arxiv.org/abs/2605.10082
作者: Ruhan Wang,Chengkai Huang,Zhiyong Wang,Junda Wu,Rui Wang,Tong Yu,Julian McAuley,Lina Yao,Dongruo Zhou
机构: Indiana University (印第安纳大学); The University of New South Wales (新南威尔士大学); The Chinese University of Hong Kong (香港中文大学); University of California San Diego (加利福尼亚大学圣迭戈分校); Adobe Research (Adobe研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 44 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.

[NLP-112] MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在低资源语言环境下处理封闭式指令时系统性地偏好回退标签（如“其他”、“中立”、“否”）而导致少类欠检测的问题，即“指令诱导标签坍缩”（instruction-induced label collapse）现象，并揭示其背后的“标签一致幻觉”（label agreement illusion）。解决方案的关键是构建并释放MultiSoc-4D——一个包含来自六个来源、超过58K条孟加拉语社交媒体评论的基准数据集，沿类别、情感、仇恨言论和讽刺四个维度进行人工校准标注。通过设计结构化流水线，让ChatGPT、Gemini、Claude和Grok分别标注独立分区，同时共享20%的验证集，系统诊断LLM行为，并利用统计方法（如Fleiss’Kappa）验证标签坍缩的存在，从而将该基准作为诊断低资源语言NLP中标注偏差的工具。

链接: https://arxiv.org/abs/2605.06940
作者: Souvik Pramanik,S.M. Riaz Rahman Antu,Shak Mohammad Abyad,Md. Ibrahim Khalil,Md. Shahriar Hussain
机构: North South University (南北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 14 figures, 13 tables

点击查看摘要

Abstract:Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called “instruction-induced label collapse”, wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a “label agreement illusion”, statistically validated via almost null Fleiss’ Kappa ( \kappa \approx -0.001 ) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.

信息检索

[IR-0] ask-Adaptive Embedding Refinement via Test-time LLM Guidance

【速读】：这篇论文试图解决在零样本（zero-shot）搜索和分类任务中，嵌入模型（embedding models）因缺乏任务特定约束而表现不佳的问题。解决方案的关键在于提出一种LLM引导的查询细化（LLM-guided query refinement）范式，即利用生成式大语言模型（generative LLM）对少量文档的反馈，动态调整用户查询的嵌入表示，使嵌入空间能够实时适应目标任务的具体需求。该方法通过引导查询嵌入在语料中产生更清晰的二元分离，从而提升排序质量和任务适应性，在不依赖大规模LLM推理管线的前提下显著拓宽了嵌入模型的有效部署范围。

链接: https://arxiv.org/abs/2605.12487
作者: Ariel Gera,Shir Ashury-Tahan,Gal Bloch,Ohad Eytan,Assaf Toledo
机构: IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at this https URL.

[IR-1] ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

【速读】：该论文旨在解决大型语言模型（LLM）在针对生成式检索（GenRetrieval）任务进行微调时出现的灾难性遗忘问题，即微调后模型丢失了原有的通用语言推理能力，且这种遗忘与微调前后参数距离密切相关。解决方案的关键在于提出ORBIT方法，通过主动监测微调模型与初始模型权重之间的距离，并在该距离超过预设最大阈值时，采用权重平均（weight averaging）策略来约束模型参数漂移，从而在保持检索性能的同时有效保留通用文本能力。

链接: https://arxiv.org/abs/2605.12419
作者: Neha Verma,Nikhil Mehta,Shao-Chuan Wang,Naijing Zhang,Alicia Tsai,Li Wei,Lukasz Heldt,Lichan Hong,Ed Chi,Xinyang Yi
机构: Johns Hopkins University (约翰霍普金斯大学); Google DeepMind (谷歌深度思维); Google (谷歌)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.

[IR-2] Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring ACL2026

【速读】：该论文试图解决现有大语言模型（LLM）问答（QA）系统中问题难度估计方法（如可读性公式、检索信号或流行度统计）无法充分捕捉现代LLM所面临的推理挑战的问题。解决方案之关键是提出Q-DAPS（基于答案合理性得分的问题难度）方法，通过计算候选答案合理性得分的熵（entropy）来估计问题难度，该方法利用熵衡量模型对候选答案的不确定性，从而更准确地反映问题的推理复杂度，并在TriviaQA、NQ、MuSiQue和QASC等数据集上验证了其相对于基线方法的优越性，且展现出对超参数、问题类型及合理性估计范式的强鲁棒性。

链接: https://arxiv.org/abs/2605.12398
作者: Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS’s difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.

[IR-3] Context Convergence Improves Answering Inferential Questions SIGIR2026

【速读】：该论文试图解决大型语言模型（LLMs）在处理需要推导而非直接检索的推理问题（inferential questions）时表现不佳的问题，尤其是段落结构与质量如何影响其推理能力尚未充分探索。解决方案的关键在于引入“收敛性（convergence）”这一度量标准，即句子（提示）有效消除错误答案的程度，以此作为构建段落的核心准则，通过选择高收敛性的句子组成段落，并按照收敛性降序排列，显著提升LLM的推理准确性。

链接: https://arxiv.org/abs/2605.12370
作者: Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.

[IR-4] MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM -Based Biomedical Question Answering

【速读】：该论文试图解决当前生物医学领域大语言模型评估中缺乏能够区分真正推理与模式匹配、同时具备抗饱和性和抗污染性的多跳推理基准的问题。关键的解决方案是提出了MedHopQA，这是一个以疾病为中心的多跳推理基准，由1,000个专家精心挑选的问答对组成，每个问题要求模型整合两个不同维基百科文章的信息，并以开放式自由文本形式作答。为确保评估的鲁棒性，构建过程结合了人工注释、分类、迭代验证和LLM-as-a-judge验证，并通过在CodaBench排行榜上公开可下载的10,000个问题中嵌入那1,000个评分问题（答案被隐藏）来降低排行榜游戏和数据污染风险。此外，利用MONDO、NCBI Gene和NCBI Taxonomy的本体论同义词集支持词汇和概念级别的评估，从而提供了既可作为基准又可作为可复用框架的评估方案，将组合推理、抗饱和性和抗污染性作为核心设计约束。

链接: https://arxiv.org/abs/2605.12361
作者: Rezarta Islamaj,Robert Leaman,Joey Chan,Nicholas Wan,Qiao Jin,Natalie Xie,John Wilbur,Shubo Tian,Lana Yeganova,Po-Ting Lai,Chih-Hsuan Wei,Yifan Yang,Yao Ge,Qingqing Zhu,Zhizheng Wang,Zhiyong Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

[IR-5] EHR-RAG p: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records

【速读】：该论文试图解决电子健康记录（EHR）中利用长序列、异构事件、时间不规则性和可变临床背景相关性进行预测建模的挑战，现有方法因采用固定窗口或统一聚合而可能丢失关键临床信号。解决方案之关键在于引入了原型引导的检索模块（prototype-guided retrieval module），该模块作为对齐机制，动态评估历史数据片段与特定预测任务的相关性，从而引导模型聚焦于最信息丰富的临床上下文，并基于此构建了检索增强的基础模型EHR-RAGp，实现了对长程临床历史的高效利用。

链接: https://arxiv.org/abs/2605.12335
作者: Saeed Shurrab,Mariam Al-Omari,Dana El Samad,Farah E. Shamout
机构: New York University(纽约大学); New York University Abu Dhabi(纽约大学阿布扎比分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Retrieval Augmented EHR Foundation Model

点击查看摘要

Abstract:Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.

[IR-6] Overview of the MedHopQA track at BioCreative IX: track description participation and evaluation of systems for multi-hop medical question answering

【速读】：该论文旨在解决生物医学领域中多跳问答（Multi-hop QA）面临的重大挑战，即系统需要整合多个来源的信息来回答复杂问题。解决方案的关键是构建了一个名为BioCreative IX MedHopQA的共享任务，并开发了一个包含1000个高质量问答对的新数据集，这些问答对覆盖疾病、基因和化学物质，特别侧重于罕见疾病，每个问题都要求通过整合两个不同维基百科页面信息实现两跳推理。此外，通过组织该挑战任务并评估48个来自13个团队的提交，论文识别出检索增强生成（Retrieval-Augmented Generation, RAG）及相关检索策略是实现高性能的核心因素，同时引入概念级评估（MedCPT评分）改善了正确答案表面形式不同时的评估准确性。

链接: https://arxiv.org/abs/2605.12313
作者: Rezarta Islamaj,Joey Chan,Robert Leaman,Jongmyung Jung,Hyeongsoon Hwang,Quoc-An Nguyen,Hoang-Quynh Le,Harikrishnan Gurushankar Saisudha,Ganesh Chandrasekar,Rustam R. Taktashov,Nadezhda Yu. Bizyukova,Sofia I. R. Conceição,Paulo R. C. Lopes,Reem Abdel Salam,Mary Adewunmi,Zhiyong Lu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: this https URL and benchmark this https URL

[IR-7] BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing – A Proposed Framework

【速读】：该论文旨在解决云原生大数据处理中自动扩缩容（autoscaling）策略评估缺乏统一基准的问题。当前，基于规则的启发式（rule-based heuristics）、学习型控制器（learned controllers）和大语言模型智能体（LLM agents）等不同范式的自动扩缩容策略均在各自不同的工作负载、基线和成本模型下报告结果，导致跨论文比较几乎不可能。解决方案的关键是提出了一个开放的基准测试框架BatchBench，其核心设计包括：（1）一个基于已发表基准和公开集群轨迹合成的工作负载分类（workload taxonomy），涵盖六类批处理作业；（2）一个参数化工作负载生成器（parameterized workload generator），并采用双样本Kolmogorov-Smirnov检验和推土机距离（earth-mover distance）进行验证；（3）一个五轴评估规范（five-axis evaluation harness），涵盖成本、SLA达成率、扩缩容响应性、扩缩容抖动和决策可解释性，并首次计入LLM推理成本；（4）一个标准化智能体接口（standardized agent interface），使基于LLM和强化学习的自动扩缩容策略能够与基于规则的控制器通过同一API进行评估。该框架为不同范式提供了公平的实验基础。

链接: https://arxiv.org/abs/2605.12272
作者: Venkata Krishna Prasanth Budigi,Siri Chandana Sirigiri
机构: 未知
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注: 5 pages, 1 table, position paper. Reference implementation in active development. Empirical follow-up to appear

点击查看摘要

Abstract:Autoscaling has become a baseline expectation for cloud-native big data processing, and the design space has expanded beyond rule-based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet despite a growing body of work spanning these paradigms, the community lacks a shared benchmark for comparing them. Existing evaluations rely on synthetic TPC-style queries, vendor blog posts with proprietary baselines, or narrow trace replays. Each new policy reports favorable numbers against a different baseline, on a different workload, with a different cost model, making cross-paper comparison effectively impossible. This is a position paper. We propose BatchBench, an open benchmarking framework designed to place rule-based, learned, and agentic autoscaling policies on equal experimental footing. The contribution is the design of the framework, not empirical results. We contribute: (1) a workload taxonomy of six batch processing classes synthesized from published autoscaling benchmarks and publicly released cluster traces; (2) the design of a parameterized workload generator with a validation methodology based on two-sample Kolmogorov-Smirnov and earth-mover distance; (3) a five-axis evaluation harness specification covering cost, SLA attainment, scaling responsiveness, scaling thrash, and decision interpretability, with first-class accounting for LLM inference cost; and (4) a standardized agent interface that lets LLM-based and reinforcement-learning autoscalers be evaluated alongside rule-based controllers with a single API. We discuss the expected evaluation surface, identify open research questions the framework is designed to answer, and outline a roadmap for the empirical paper that will follow. BatchBench’s reference implementation is in active development and will be released as open source.

[IR-8] Unlocking Crowdsourcing for Ontology Matching Validation

【速读】：该论文旨在解决大语言模型（LLMs）驱动下的本体匹配（OM）系统产生大量映射结果后，传统依赖领域专家进行验证的方式变得效率低下且成本高昂的问题，提出利用众包（crowdsourcing）实现可扩展的OM验证。解决方案的关键在于设计了一种新型众包系统，通过引入三种领域特定机制——差异可信度（differential trustworthiness）、一致性预填充（coherence pre-filling）和时变信念（time-dependent beliefs）——来保证众包标注的质量，从而支持与现有OM系统集成，实现人在回路（human-in-the-loop）的验证流程。

链接: https://arxiv.org/abs/2605.12226
作者: Zhangcheng Qiang
机构: Australian National University(澳大利亚国立大学)
类目: Information Retrieval (cs.IR)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Recent advances in large language models (LLMs) pose new challenges for ontology matching (OM). While OM systems built on LLMs have shown remarkable capabilities in discovering more mappings, traditional OM validation that relies on domain experts has become overwhelming. In this study, we explore the use of crowdsourcing for OM validation and introduce a novel crowdsourcing system. We propose three domain-specific mechanisms, namely differential trustworthiness, coherence pre-filling, and time-dependent beliefs, to ensure the quality of crowdsourcing for OM validation. We demonstrate that our crowdsourcing system can be integrated with state-of-the-art OM systems to enable human-in-the-loop validation. Two real-world use cases illustrate the effectiveness of our crowdsourcing system.

[IR-9] Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models CVPR2026

【速读】：该论文旨在解决电子商务中生成逼真且用户偏好的个性化广告这一关键挑战。现有方法通过由点击率（Click-Through-Rate, CTR）驱动的多个独立模型分别生成有吸引力的图像或文本广告，但其流程缺乏跨模态感知能力，且CTR仅反映平均用户偏好，无法实现真正个性化。解决方案的关键在于：设计一个统一的广告生成模型（Uni-AdGen），采用单一自回归框架联合生成广告图像和文本；通过引入前景感知模块（Foreground Perception Module）和指令调优（Instruction Tuning）增强生成内容的真实感；进一步配备从粗到细的偏好理解模块（Coarse-to-Fine Preference Understanding Module），从嘈杂的多模态历史行为中有效捕捉用户兴趣从而驱动个性化生成；此外，构建了首个大规模个性化广告图文数据集PAd1M并引入产品背景相似度（Product Background Similarity, PBS）度量来支持训练与评估，最终通过大量实验验证了该方法在通用与个性化广告生成上均优于基线模型。

链接: https://arxiv.org/abs/2605.12138
作者: Yexing Xu,Wei Feng,Shen Zhang,Haohan Wang,Yuxin Qin,Yaoyu Li,Ao Ma,Yuhao Luo,Lu Wang,Xudong Ren,Haoran Wang,Run Ling,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Longguang Wang,Yulan Guo
机构: Sun Yat-Sen University(中山大学); JD.COM(京东); Northeastern University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages, 19 figures, CVPR 2026

点击查看摘要

Abstract:Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at this https URL.

[IR-10] Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting Hybrid Search and Cross-Encoder Reranking SEMEVAL2026

【速读】：该论文旨在解决SemEval-2026任务A（检索）中上下文依赖的后续问题（context-dependent follow-up questions）的检索难题，即如何将多轮对话中的隐式意图转化为独立查询并高效获取相关文档。解决方案的关键在于一个三阶段流水线：第一阶段使用LoRA微调的Qwen 2.5 7B模型将后续问题重写为独立查询，其中针对不同领域采用差异化的温度参数（temperature tuning）——技术领域使用确定性解码（低温度），通用领域使用受控随机性（高温度），这是提升性能的核心策略；第二阶段通过交互秩融合（Reciprocal Rank Fusion, RRF） 融合BM25稀疏检索与稠密检索结果；第三阶段采用BGE-reranker-v2-m3交叉编码器对候选结果进行重排序。该方法在测试集上取得了nDCG@5为0.531的成绩，排名第8/38，超过基线10.7%。

链接: https://arxiv.org/abs/2605.12028
作者: David-Maximilian Caraman,Gheorghe Cosmin Silaghi
机构: Babeş-Bolyai University (巴比什-博雅依大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at SemEval2026, task 8: MTRAGEval

点击查看摘要

Abstract:We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.

[IR-11] From Trajectories to Phenotypes: Disease Progression as Structural Priors for Multi-organ Imaging Representation Learning

【速读】：该论文旨在解决影像衍生表型（IDP）作为静态快照无法捕获疾病随时间演变的动态信息的问题，而纵向电子健康记录则通过时间依赖性诊断事件和共病结构编码疾病轨迹，但二者在信息上存在部分共享但互补的关系。解决方案的关键在于提出了一种轨迹感知蒸馏框架（trajectory-aware distillation framework），通过将基于纵向诊断序列训练的群体规模生成式疾病轨迹Transformer产生的受试者层面嵌入，以几何保持对齐方式监督器官级IDP编码器的表示学习，从而将疾病轨迹的结构知识转移至IDP表征中；此外，下游任务中还可通过交叉注意力机制融合轨迹与影像表示。这一方法在UK Biobank的159种疾病上验证了有效性，尤其对低患病率疾病提升显著，表明生成式疾病模型可作为有限数据影像模态的结构先验，提升在现实队列约束下的鲁棒性。

链接: https://arxiv.org/abs/2605.11958
作者: Zian Wang,Lizhen Lan,Guangming Wang,Haosen Zhang,Minxuan Xu,Qing Li,Tianxing He,Mo Yang,Wenyue Mao,Yajing Zhang,Yan Li,Chengyan Wang
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Imaging-derived phenotypes (IDPs) summarize multi-organ physiology but provide only static snapshots of diseases that evolve over time. In contrast, longitudinal electronic health records encode disease trajectories through temporal dependencies among past diagnosis events and comorbidity structure. We hypothesize that IDPs and disease trajectories contain partially shared disease-relevant structure. We propose a trajectory-aware distillation framework that transfers structural knowledge from a generative disease trajectory Transformer into an organ-wise IDP encoder. A population-scale trajectory model trained on longitudinal diagnosis sequences produces subject-level embeddings that supervise IDP representation learning via geometry-preserving alignment. During downstream prediction, trajectory and imaging representations can also be fused via cross-attention. Across 159 diseases in the UK Biobank cohort, trajectory-aware pretraining consistently improves both discrimination (AUC) and time-to-onset prediction (MAE), with the largest gains for low-prevalence diseases. Similarity relationships in IDP embedding space also align with those in trajectory space, providing supportive evidence for partially aligned representation geometry. These results suggest that population-scale generative disease models can serve as structural priors for data-limited imaging modalities, improving robustness under realistic cohort constraints.

[IR-12] On the LSH Distortion of Ulam and Cayley Similarities

【速读】：该论文试图解决Ulam相似度（Ulam similarity）和Cayley相似度（Cayley similarity）这两种在n元排列上常用的相似度度量是否具有低失真（distortion）的局部敏感哈希（LSH）方案的问题。LSH是用于加速最近邻搜索的基本工具，但并非所有相似函数都能实现精确的LSH，而LSH失真度量了相似函数与可精确构造LSH方案之间的乘法逼近程度。论文的关键解决方案在于为Ulam相似度证明了一个次线性的LSH失真上界O(n / √log n)与一个下界Ω(n^0.12)，从而表明其失真远优于线性；同时为Cayley相似度确定了精确的Θ(n)失真，验证了其失真本质上是线性的。这些结果通过构造具体的哈希函数族以及信息论下界论证得出，揭示了两种相似度在可哈希性（hashability）上的本质差异。

链接: https://arxiv.org/abs/2605.11921
作者: Flavio Chierichetti,Mirko Giacchini,Ravi Kumar,Erasmo Tani
机构: Reddit(红迪); Sapienza University of Rome (罗马大学); Google(谷歌)
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Locality-sensitive hashing (LSH) has found widespread use as a fundamental primitive, particularly to accelerate nearest neighbor search. An LSH scheme for a similarity function S:\mathcalX \times \mathcalX \to [0,1] is a distribution over hash functions on \mathcalX with the property that the probability of collision of any two elements x,y\in \mathcalX is exactly equal to S(x,y) . However, not all similarity functions admit exact LSH schemes. The notion of LSH distortion measures how multiplicatively close a similarity function is to having an LSH scheme. In this work, we study the LSH distortion of the Ulam and Cayley similarities, which are popular similarity measures on permutations of n elements. We show that the Ulam similarity admits a sublinear LSH distortion of O(n / \sqrt\log n) ; we also prove a lower bound of \Omega(n^0.12) on the best LSH distortion achievable. On the other hand, we show that the LSH distortion of the Cayley similarity is \Theta(n) . Subjects: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2605.11921 [cs.DS] (or arXiv:2605.11921v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.11921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-13] RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agent ic Recommender Systems

【速读】：该论文旨在解决当前大型语言模型（LLM）智能体集成至推荐系统时，因依赖单一维度的结果奖励（outcome-based rewards）而忽略指令遵循、复杂意图理解等关键中间能力，且缺乏标准化基准来支持多维奖励设计的问题。其解决方案的关键在于：构建了迄今为止最大且最全面的智能体推荐系统基准RecRM-Bench，该基准包含超过100万条结构化条目，覆盖指令遵循、事实一致性、查询-项目相关性及细粒度用户行为预测四个核心评估维度，从而为训练复杂的奖励模型提供了基础数据集；同时，提出了一个系统性的框架，用于构建多维奖励模型（multi-dimensional reward models）并整合混合奖励函数（hybrid reward function），为开发可靠且高能力的智能体推荐系统奠定了坚实基础。

链接: https://arxiv.org/abs/2605.11874
作者: Wenwen Zeng,Jinhui Zhang,Hao Chen,Zhaoyu Hu,Yongqi Liang,Jiajun Chai,Dengcan Liu,Zhenfeng Liu,Shurui Yan,Minglong Xue,Xiaohan Wang,Wei Lin,Guojun Yin
机构: Meituan(美团); Fudan University(复旦大学); Nankai University(南开大学); North China University of Technology(北方工业大学); University of Science and Technology of China(中国科学技术大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query-item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome-based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi-dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM-Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query-item relevance, and fine-grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM-Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi-dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM-Bench dataset is publicly available at this https URL.

[IR-14] Very Efficient Listwise Multimodal Reranking for Long Documents ICML2026

【速读】：该论文旨在解决视觉中心检索与多模态检索增强生成（M-RAG）中列表式重排序（listwise reranking）的计算效率瓶颈，即基于VLM的重排序器因长视觉token序列和多步自回归解码（autoregressive decoding）导致的推理延迟过高问题。解决方案的关键在于ZipRerank提出的双重优化：一是通过轻量级查询-图像早期交互（query-image early interaction）机制压缩输入长度，避免冗余视觉信息；二是采用单次前向传播对所有候选进行评分（single forward pass scoring），替代传统的自回归解码，从而显著降低计算开销。此外，两阶段训练策略（列表式文本预训练与VLM教师蒸馏的软排序监督多模态微调）确保了模型在效率提升的同时保持甚至超越现有方法的准确性。

链接: https://arxiv.org/abs/2605.11864
作者: Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: To appear in ICML 2026

点击查看摘要

Abstract:Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at this https URL.

[IR-15] Quality-Aware Collaborative Multi-Positive Contrastive Learning for Sequential Recommendation

【速读】：该论文旨在解决对比学习在序列推荐（sequential recommendation）中因视图构建质量不足导致的语义漂移（semantic drift）和假阳性（false-positive）问题。现有方法常依赖于启发式数据增强，可能移除关键物品或破坏转换模式；可学习增强方法虽然有所改进，但视图多样性有限且仍需启发式辅助，且不同视图间的质量差异未被显式自适应建模。解决方案的关键在于提出了QCMP-CL（Quality-aware Collaborative Multi-Positive Contrastive Learning）方法，包含两个核心模块：一是可学习的协作序列增强模块（learnable collaborative sequence augmentation module），分别基于同目标序列（same-target sequences）和相似序列（similar sequences）两种互补的协作上下文生成两种增强视图，以同时保证语义保持意图与视图多样性；二是质量感知机制（quality-aware mechanism），通过从增强操作置信度中估计各视图质量并赋予自适应权重，确保高置信度视图贡献更多监督信号，低置信度视图贡献更少，从而有效缓解视图质量差异带来的负面影响。

链接: https://arxiv.org/abs/2605.11707
作者: Wei Wang
机构: Shandong University of Science and Technology (山东科技大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The effectiveness of contrastive learning in sequential recommendation hinges on the construction of contrastive views, which ideally should be both semantically consistent and diverse. However, most existing CL-based methods rely on heuristic augmentations that are prone to removing crucial items or disrupting transition patterns, leading to semantic drift. While a few studies have explored learnable augmentations to improve view quality, they often suffer from limited diversity and still necessitate heuristic aids. Furthermore, the quality differences across views are rarely modeled explicitly and adaptively, aggravating the false-positive issue. To address these issues, we propose Quality-aware Collaborative Multi-Positive Contrastive Learning for sequential recommendation. First, we introduce a learnable collaborative sequence augmentation module that generates two augmented views under two complementary collaborative contexts, one based on same-target sequences and the other on similar sequences, thereby enhancing view diversity while preserving intent this http URL, we design a quality-aware mechanism, tightly integrated into the model representations, which estimates each view’ s quality from the confidence of its augmentation operations and assigns adaptive weights to ensure that high-confidence views contribute more supervision while low-confidence ones contribute this http URL experiments on three real-world datasets demonstrate that QCMP-CL outperforms state-of-the-art CL-based sequential recommendation baselines.

[IR-16] HSUGA: LLM -Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment ACL2026

【速读】：该论文旨在解决大语言模型增强的序列推荐（LLM-enhanced sequential recommendation）中两个核心组件的不足：用户语义嵌入提取（user semantic embedding extraction）与利用（utilization）。具体而言，一是提取阶段中，大多数方法直接将过长的交互序列片段输入LLM进行偏好总结，导致推理困难，难以可靠地推断准确用户嵌入；二是在利用阶段，统一采用相同的语义嵌入利用策略，未考虑用户活跃度差异，导致次优性能。解决方案的关键在于提出了HSUGA框架，包含两个轻量级插件：分层语义理解（HSU, Hierarchical Semantic Understanding）通过分阶段的两阶段偏好挖掘并结合约束编辑操作（constrained editing operations）来建模偏好演变，从而提升用户语义提取的可靠性；群组感知对齐（GAA, Group-Aware Alignment）根据用户活跃度调整语义利用的强度，对活跃用户提供较弱的对齐，对历史数据稀疏用户提供更强的引导，最终在三个基准数据集上验证了其有效性与兼容性。

链接: https://arxiv.org/abs/2605.11662
作者: Guorui Li,Dugang Liu,Lei Li,Xing Tang,Zhong Ming
机构: College of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机与软件学院); School of Artificial Intelligence, Shenzhen Technology University (深圳技术大学人工智能学院)
类目: Information Retrieval (cs.IR)
备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Large language model (LLM)-enhanced sequential recommendation typically aims to improve two core components: user semantic embedding extraction and utilization. Despite promising results, existing methods still have two limitations: 1) In the extraction stage, most methods directly input long interaction sequence fragments into LLM for preference summarization. However, excessively long sequences increase inference difficulty, making it challenging to reliably infer accurate user embeddings. 2) In the utilization stage, most methods employ the same semantic embedding utilization strategy for all users, neglecting the differences caused by user activity levels, leading to suboptimal performance. To address these issues, we propose HSUGA, which introduces a simple yet effective plugin for each of the two core components: Hierarchical Semantic Understanding (HSU) and Group-Aware Alignment (GAA). HSU performs a staged two-phase preference mining and models preference evolution through constrained editing operations, thereby improving the reliability of user semantic extraction. GAA adjusts the intensity of semantic utilization based on user activity levels, providing weaker alignment for active users and stronger guidance for users with sparse historical data. Finally, extensive experiments on three benchmark datasets demonstrate the effectiveness and compatibility of HSUGA.

[IR-17] wiSTAR:Think Fast Think Slow Then ActGenerative Recommendation with Adaptive Reasoning

【速读】：该论文试图解决生成式推荐（Generative Recommendation）中采用语义ID（Semantic IDs, SIDs）时固定推理策略带来的权衡问题——快速直接生成在困难样本上准确性不足，而始终使用慢速链式推理（chain-of-thought reasoning）则导致不可接受的延迟和计算浪费。解决方案的关键在于提出“Think Fast, Think Slow, Then Act”框架，通过学习在每个用户序列上自适应分配推理努力：系统赋予大语言模型（LLM）三个互补工具——快速基于SID的检索器（fast SID-based retriever）、轻量级候选排序器（lightweight candidate ranker）和生成显式理由后再推荐的慢速推理模型（slow reasoning model），并通过将物品间知识转化为自然语言解释向慢速模型注入协作常识（collaborative commonsense）；同时，一个经过监督预热（supervised warm-up）和智能体强化学习（agentic reinforcement learning）训练的规划器（planner）动态决定调用哪个工具，从而在保持准确率提升的同时降低推理延迟。

链接: https://arxiv.org/abs/2605.11553
作者: Shiteng Cao,Kaian Jiang,Yunlong Gong,Zhiheng Li
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Information Retrieval (cs.IR)
备注: 16pages,3 figures

点击查看摘要

Abstract:Generative recommendation with Semantic IDs (SIDs) has emerged as a promising paradigm, yet existing methods apply a fixed inference strategy, either fast direct generation or slow chain-of-thought reasoning, uniformly across all user histories. This approach creates a trade-off: fast recommendation model produces suboptimal accuracy on hard samples, while always invoking slow reasoning incurs prohibitive latency and wastes computation on easy cases. To address this, we propose Think Fast, Think Slow, Then Act, a framework that learns to adaptively allocate reasoning effort per user sequence. Our system equips an LLM with three complementary tools: a fast SID-based retriever, a lightweight candidate ranker, and a slow reasoning model that generates explicit rationales before recommending. Crucially, we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations. A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke. Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.

[IR-18] Conditional Memory Enhanced Item Representation for Generative Recommendation

【速读】：该论文旨在解决生成式推荐（Generative Recommendation, GR）中因现有项目级表示构建方式（直接合并SID-token嵌入或基于外部输入的方法）所导致的“身份-结构保存冲突”（Identity-Structure Preservation Conflict）与“输入-输出粒度不匹配”（Input-Output Granularity Mismatch）两大问题，这些缺陷使表示构建成为被忽视的瓶颈，进而影响生成式推荐的性能。解决方案的关键在于提出ComeIR框架（Conditional Memory enhanced Item Representation），它通过三个核心组件重构项目表示：一是MM引导的令牌评分（MM-guided token scoring），自适应估计每个码元在语义标识符（SID）中的贡献；二是双层次印迹记忆（dual-level Engram memory），分别捕获项目内码元组成和项目间转换模式；三是记忆恢复预测头（memory-restoring prediction head），在自回归解码阶段复用记忆以恢复令牌粒度，从而在输入表示中保留SID的结构化证据，并保证生成阶段输出与输入粒度的一致性。

链接: https://arxiv.org/abs/2605.11447
作者: Ziwei Liu,Yejing Wang,Shengyu Zhou,Xinhang Li,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Independent Researcher (独立研究者); Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) has emerged as a promising paradigm that predicts target items by autoregressively generating their semantic identifiers (SID). Most GR methods follow a quantization-representation-generation pipeline, first assigning each item a SID, then constructing input representations from SID-token embeddings, and finally predicting the target SID through autoregressive generation. Existing item-level representation constructions mainly take two forms: directly merging SID-token embeddings into a compact vector, or enriching item-level representations with external inputs through additional networks. However, these item-level constructors still expose two practical challenges: direct merging may amplify the information loss caused by quantization and ID collision while obscuring SID code relations, whereas external-input-based methods can strengthen item semantics but cannot reliably preserve the SID-structured evidence required for token-level generation. These limitations make representation construction an underexplored bottleneck, leading to two severe problems, \ie the Identity-Structure Preservation Conflict and Input-Output Granularity Mismatch. To this end, we propose ComeIR, a Conditional Memory enhanced Item Representation framework that reconstructs SID-token embeddings into item-aware inputs and restores the token granularity during SID decoding. Specifically, MM-guided token scoring adaptively estimates the contribution of each code within the SID, dual-level Engram memory captures intra-item code composition and inter-item transition patterns, and a memory-restoring prediction head reuses the memories during SID decoding. Extensive experiments demonstrate the effectiveness and flexibility of ComeIR, and further reveal scalable gains from enlarging conditional memory.

[IR-19] FedMM: Federated Collaborative Signal Quantization for Multi-Market CTR Prediction SIGIR2026

【速读】：该论文旨在解决多市场推荐（MMR）场景中传统方法忽视数据隐私、联邦学习无法处理市场间显著异质性以及嵌入层因ID空间不重叠而聚合失效的问题。其解决方案的核心在于提出一种基于离散码本机制的联邦协同信号量化方法（FedMM），通过部署残差量化变分自编码器（RQ-VAE）实现双层码本结构：第一层为全局联邦码本，通过跨市场聚合捕捉共享的协同模式；第二层为本地码本，用于学习市场特有的语义信息。该机制在保护隐私的同时实现了不重叠ID空间的对齐，并将包含通用与特定协同信号的离散码融入下游点击率（CTR）预测模型，从而提升所有市场的推荐性能。

链接: https://arxiv.org/abs/2605.11433
作者: Jun Zhang,Dugang Liu,Xing Tang,Xiuqiang He,Zhong Ming
机构: College of Computer Science and Software Engineering, Shenzhen University (计算机与软件学院, 深圳大学); Shenzhen Technology University (深圳技术大学); Shenzhen University (深圳大学)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Online platforms such as Amazon and Netflix serve users across multiple countries and regions, underscoring the importance of multi-market recommendation (MMR). Most MMR methods adopt a pre-training and fine-tuning paradigm, in which a unified model is first trained on centralized, global data and subsequently adapted to specific markets. However, this approach ignores the privacy of market data. While traditional federated learning preserves privacy, it typically aims to obtain a global model by aggregating model parameters and does not account for significant market heterogeneity. Additionally, because ID spaces are disjoint across markets, embedding-based aggregation strategies become ineffective. To overcome these challenges, we propose a federated collaborative signal quantization (FedMM) method for multi-market click-through rate (CTR) prediction. Our core idea leverages a discrete codebook mechanism to achieve privacy-preserving transmission and align disjoint ID spaces. We further employ a hierarchical codebook structure to capture cross-market shared patterns and market-specific characteristics. Specifically, we deploy a residual quantized variational autoencoder (RQ-VAE) with a dual-layer codebook mechanism for each market to quantize collaborative embeddings. The first layer utilizes a global federated codebook, updated via aggregation to capture universally shared collaborative patterns, while the second layer maintains a local codebook to learn market-specific semantics. Finally, the learned discrete codes, which integrate both general and specific collaborative signals, are incorporated into downstream CTR models to enhance prediction accuracy across all markets. Extensive experiments on benchmark datasets demonstrate that FedMM significantly improves recommendation performance with privacy guarantees.

[IR-20] st-Time Compute for Dense Retrieval: Agent ic Program Generation with Frozen Embedding Models

【速读】：该论文试图解决的核心问题是：测试时计算（test-time compute）是否仅对大型推理模型有效，而对小型嵌入模型（embedding models）无益。论文通过实验证明，即使对于从大型语言模型（LLM）骨干蒸馏而来的冻结嵌入模型，在推理阶段增加额外的计算也能显著提升其性能，无需重新训练模型。解决方案的关键在于：通过一个代理程序搜索循环（agentic program-search loop），在冻结的嵌入API上探索了259个候选推理程序，并经过90次迭代后，整个Pareto前沿收敛为一个单一的代数操作——软最大加权质心（softmax-weighted centroid），即对局部top-K文档的表示与查询向量进行插值。这个无参数默认方法在跨越十倍参数范围的七个嵌入模型家族上，均显著提高了nDCG@10指标，并在完整的BEIR验证集上得到了确认。

链接: https://arxiv.org/abs/2605.11374
作者: Han Xiao
机构: Jina AI by Elastic(Elastic旗下的Jina AI)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 37 pages, 5 figures, 16 tables

点击查看摘要

Abstract:Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Most modern embedding checkpoints are distilled from large LLM backbones and inherit their representation space; a frozen embedding model should therefore benefit from extra inference compute without retraining. Using an agentic program-search loop, we explore 259 candidate inference programs over a frozen embedding API across ninety generations. The entire Pareto frontier collapses onto a single algebra: a softmax-weighted centroid of the local top-K documents interpolated with the query. This parameter-free default lifts nDCG@10 statistically significantly across seven embedding-model families spanning a tenfold parameter range, with held-out full-BEIR validation confirming the lift on every model tested.

[IR-21] Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence EMNLP

【速读】：该论文旨在解决从灾难相关社交媒体帖子中提取因果关系（causal relations）以增强态势感知（situational awareness）时所面临的挑战——这些帖子常为非正式、碎片化、依赖语境，且多描述个人经历而非明确因果链。解决方案的关键在于：（1）提出一个基于专家验证的评估框架，将大语言模型（LLM）生成的因果图与源自灾难特定报告（disaster-specific reports）的参考图进行对比；（2）评估所提取的关系是否由事后证据（post-event evidence）支撑，抑或仅是模型先验（model priors）的体现。这一双重设计同时为LLM在该场景下的潜力与风险提供了系统化检验依据。

链接: https://arxiv.org/abs/2605.11348
作者: Ujun Jeong,Saketh Vishnubhatla,Bohan Jiang,Andre Harrison,Adrienne Raglin,Huan Liu
机构: Arizona State University (亚利桑那州立大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: Submitted to EMNLP

点击查看摘要

Abstract:During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.

[IR-22] Much of Geospatial Web Search Is Beyond Traditional GIS

【速读】：该论文试图解决现有标注方案严重低估网络搜索查询中与地点相关内容比例的问题，并对地理空间网络搜索查询的语义景观（即用户对地点的询问内容及频率）缺乏大规模、细粒度表征的现状进行填补。其解决方案的关键在于：1）不依赖先验的地名或空间关键词过滤，而是直接对完整MS MARCO语料库（101万真实Bing查询）应用密集句子嵌入（dense sentence embeddings）进行语义表示；2）结合轻量级SetFit分类器高效识别地理空间查询（最终发现18.0%的查询为地理空间相关，是原始标注中“Location”类别比例的三倍）；3）通过基于密度的聚类（density-based clustering）自动归纳出88个查询类别，揭示出地理空间搜索以事务性和实用性查询（如成本、价格、营业时间、联系方式、天气、旅行建议等）为主，这些内容远超出传统GIS系统与知识图谱的范畴，从而为混合检索架构和大语言模型地理推理基准的设计提供了实证基础。

链接: https://arxiv.org/abs/2605.11336
作者: Ilya Ilyankou,Stefano Cavazzi,James Haworth
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope traditional GIS systems and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.

[IR-23] VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

【速读】：该论文试图解决LLM-as-Judge（大语言模型作为评判者）系统中缺乏可靠置信度信号的问题。具体而言，标准的事后置信度信号——Token对数概率（token log-probabilities）——在许多商用大语言模型中不可用，且即使可用，在结构化JSON输出中也会因概率饱和（超过0.999）而失效。解决方案的关键在于VERDI (VERification-Decomposed Inference) 方法，该方法无需额外推理调用，利用结构化评判者已有的推理迹（reasoning trace）提取置信度。VERDI将每个验证式评估分解为多个子检查，并从中推导出三种结构化信号：步骤-判决对齐度（Step-Verdict Alignment）、声明级边际（Claim-Level Margin）和证据基础评分（Evidence Grounding Score），然后通过Platt缩放逻辑回归（Platt-scaled logistic regression）将这些信号组合成最终的置信度估计。

链接: https://arxiv.org/abs/2605.11334
作者: Jasmine Qi,Danylo Dantsev,Muyang Sun
机构: Indeed Inc (Indeed公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge’s verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction. Comments: 16 pages, 6 figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2605.11334 [cs.LG] (or arXiv:2605.11334v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11334 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-24] Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory

【速读】：该论文试图解决无状态大语言模型（LLM）在迭代式、密集会话工作流中因缺乏跨会话记忆而产生的重新定向成本问题。现有检索增强方法将跨会话记忆视为搜索问题，依赖语义相似度检索，但在有限词汇上下文（如单个用户或工程团队共享代码库和术语）中，相似性搜索对命名实体解析的精度极低，因为技术领域内的语义邻近性导致信念在语义嵌入空间中不可区分。解决方案的关键在于将跨会话记忆重新定义为状态管理问题，并提出了一个名为 Tenure 的本地优先代理系统。Tenure 的核心创新包括：维护一个带有认知状态（epistemic status）、版本取代（versioned supersession）和硬范围隔离（hard scope isolation）的类型化信念存储（typed belief store），通过精度优先检索（precision-first retrieval）注入上下文；其类型化模式（typed schema）通过“why it matters”字段将提取的事实转化为可执行指令，使注入的信念可直接作用于模型而非作为待推导的原始材料；利用别名加权 BM25（alias-weighted BM25）实现全精度检索（在 72 个案例中平均精度 1.0），并通过别名丰富飞轮（alias enrichment flywheel）持续索引同一用户的特定词汇，从而在结构上消除查询与信念不属于同一作者时的词汇不匹配问题。

链接: https://arxiv.org/abs/2605.11325
作者: Jeffrey Flynt
机构: Independent Researcher(独立研究员)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Why do we need another AI to help the AI? We argue you don’t. Stateless LLM sessions impose re-orientation costs on iterative, session-heavy workflows. Prior work addresses cross-session memory through retrieval-augmented approaches: store history, embed it, retrieve by semantic similarity. Cross-session memory is a state management problem, not a search problem. Similarity search fails for named entity resolution within bounded vocabulary contexts because beliefs about a shared technical domain are semantically proximate by construction. A single user is the simplest bounded vocabulary context; engineering teams converge on the same property through shared codebases and terminology. We present Tenure, a local-first proxy that maintains a typed belief store with epistemic status, versioned supersession, and scope isolation, injecting curated context into every LLM session through precision-first retrieval. Hard scope isolation provides a structural guarantee: the right beliefs surface, and only within the boundaries the user has authorized. Tenure’s typed schema converts extracted facts into imperative instructions via a why it matters field, making injected beliefs directly actionable rather than raw material for the model to re-derive. A controlled evaluation on 72 retrieval cases demonstrates the gap. Cosine similarity over dense embeddings achieves mean precision of 0.12. Alias-weighted BM25 maintains mean precision of 1.0, passing 72/72 cases versus 8/72 for cosine similarity on the same corpus. Hybrid retrieval typically solves vocabulary mismatch between disparate authors; Tenure eliminates this structurally: query and belief authors are the same person, and an alias enrichment flywheel continuously indexes their specific vocabulary. Under multi-turn topic drift this worsens: the vector backend produces drift scores of 0.43–0.50 on noise-critical turns where BM25 maintains 0. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.11325 [cs.IR] (or arXiv:2605.11325v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.11325 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-25] Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

【速读】：该论文试图解决Adobe Express国际化扩展过程中，由于美国地区内容供应和交互量过大导致的学习排序（Learning-to-rank, LTR）模型跨地区曝光偏差（cross-locale exposure bias）问题。该偏差表现为：模型主要基于行为反馈（如点击）训练，继承了美国模板占比失衡的分布，进而在非美国地区过度推荐美国流行模板，抑制本地内容可发现性，降低排序质量。其关键解决方案在于提出一个多目标框架，该框架同时结合了行为监督（clicks）、视觉-语言模型（Vision-Language Model, VLM）提供的相关性标签作为辅助监督信号，以及地区感知增强（locale-aware boosting）。通过将曝光偏向与语义监督进行解耦，该模型在五个地区提升了相关性并恢复了稳定的本地化表现，证明区分曝光偏差与语义相关性信号的重要性。

链接: https://arxiv.org/abs/2605.11272
作者: Suryaa Veerabathiran Seran,Ashwin Naresh Kumar,Tracy Holloway King,Jing Zheng
机构: Adobe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2605.11272 [cs.LG] (or arXiv:2605.11272v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-26] MIRA: An LLM -Assisted Benchmark for Multi-Category Integrated Retrieval SIGIR2026

【速读】：该论文旨在解决当前信息检索（Information Retrieval, IR）评估基准缺乏涵盖多种数据源和格式的统一测试集合的问题，以满足用户对现代搜索系统跨异构类别（如出版物、研究数据、变量及工具）进行无缝检索的期望。解决方案的关键在于构建MIRA基准，其基于大规模社会科学搜索平台，通过真实用户查询驱动，使评估更具现实基础；同时采用大语言模型（Large Language Model, LLM）自动生成主题描述、叙述及相关性判断，大幅度降低测试集合构建的人力与成本，从而首次实现跨四个异质类别的类别感知排序（category-aware ranking）的统一评估框架。

链接: https://arxiv.org/abs/2605.11254
作者: Mehmet Deniz Türkmen,Suchana Datta,Dwaipayan Roy,Daniel Hienert,Philipp Mayr,Derek Greene
机构: GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany (GESIS – 莱布尼茨社会科学研究所，科隆，德国); University College Dublin, Ireland (都柏林大学学院，爱尔兰); Indian Institute of Science Education and Research, Kolkata, India (印度科学教育与研究学院，加尔各答，印度)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026. Resource Paper. 8 pages, 2 figures. DOI: https://doi.org/10.1145/3805712.3808614

点击查看摘要

Abstract:Users increasingly expect modern search systems to offer a unified interface that seamlessly retrieves information from diverse data sources and formats. However, current information retrieval (IR) evaluation benchmarks have not kept pace with this development, primarily due to the lack of test collections that represent the diversity of contemporary search domains. We address this critical gap with MIRA, a novel benchmark based on a large-scale social science search platform. MIRA is designed for category-aware ranking across heterogeneous categories - Publications, Research Data, Variables, and Instruments Tools - within a single, unified evaluation framework. The proposed collection is distinctive in several ways: (1) it is built upon real user queries, providing a more realistic basis for evaluation; (2) it covers scholarly items from four distinct categories, enabling multi-faceted evaluation; and (3) it leverages a Large Language Model to generate topic descriptions and narratives, as well as for relevance assessment with respect to these topics, substantially reducing the labor and cost of test collection generation. We release this resource to benefit the community by providing a foundational testbed for the research on multi-faceted, category-aware, integrated, or cross-category information retrieval.

[IR-27] Debiasing Message Passing to Mitigate Popularity Bias in GNN-based Collaborative Filtering

【速读】：该论文旨在解决基于图神经网络（GNN）的协同过滤（CF）模型中固有的流行度偏差问题，即由于交互分布不均和高阶邻域的消息传递放大了流行物品的影响、抑制了长尾物品，导致现有去偏方法（如重加权、正则化、因果方法及后处理）在GNN框架下效果不佳，因其未能直接抵抗聚合过程中的偏差传播。解决方案的关键在于提出了DPAA（Debiasing Popularity Amplification in Aggregation）框架，该方法直接在消息传递过程中集成两种加权机制：一是自适应的、基于嵌入感知的交互级加权，通过从预训练嵌入向训练中演化嵌入的平滑过渡来稳定表示感知的流行度信号；二是逐层加权，通过放大高阶邻域来暴露与多样化、低曝光物品的长程交互，从而有效抑制流行度偏差。实验证明DPAA在真实和半合成数据集上优于现有最先进的GNN-CF流行度偏差校正方法。

链接: https://arxiv.org/abs/2605.11145
作者: Md Aminul Islam,Ahmed Sayeed Faruk,Sourav Medya,Elena Zheleva
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Collaborative filtering (CF) models based on graph neural networks (GNNs) achieve strong performance in recommender systems by propagating user-item signals over interaction graphs. However, they are highly susceptible to popularity bias, since skewed interaction distributions and repeated message passing across high-order neighborhoods amplify the influence of popular items while suppressing long-tail ones. Existing debiasing approaches, including re-weighting objectives, regularization, causal methods, and post-processing, are less effective in GNN-based settings because they do not directly counteract bias propagated through the aggregation process, and recent in-aggregation weighting methods often rely on static heuristics or unstable embedding estimates. We propose Debiasing Popularity Amplification in Aggregation (DPAA), a popularity debiasing framework for GNN-based CF that integrates adaptive, embedding-aware interaction weighting and layer-wise weighting directly into message passing. DPAA assigns interaction-level weights from a representation-aware popularity signal, stabilized by a smooth transition from pre-trained to evolving model embeddings during training. It further introduces a layer-wise weighting that amplifies higher-order neighborhoods, surfacing long-range interactions with diverse and underexposed items. Experiments on real-world and semi-synthetic datasets show that DPAA outperforms state-of-the-art popularity-bias correction methods for GNN-based CF.

[IR-28] ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

【速读】：该论文试图解决在真实电子健康记录（EHR）笔记中，由于否定（negation）、时间性（temporality）和家庭与患者归属（family-versus-patient attribution）等敏感因素导致检索步骤（即推理前的关键环节）产生错误的问题，而现有临床推理基准仅测量干净输入下的性能，忽略了这些因素。解决方案的关键是EpiKG，它为患者知识图谱中的每个事实携带断言标签（assertion label）和时态标签（temporality tag），并根据问题意图（question intent）路由检索，形成意图感知的知识图谱检索增强生成（intent-aware KG-RAG）架构；在ClinicalBench测试集上的消融实验表明，该架构相对于稠密RAG基线取得了显著的准确率提升，且通过医师盲法裁决验证了其有效性。

链接: https://arxiv.org/abs/2605.11143
作者: Alex Stinard
机构: Department of Clinical Sciences, College of Medicine, University of Central Florida (临床科学系, 医学院, 中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 46 pages including appendices (two-column preprint format). Under review at JAMIA. Code, frozen evaluator, and benchmark released at this https URL . ClinicalBench v2 is a 400-question MIMIC-IV stress test for assertion-aware retrieval

点击查看摘要

Abstract:Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.

[IR-29] A Cascaded Generative Approach for e-Commerce Recommendations

【速读】：该论文试图解决传统电子商务个性化商店构建中因使用独立组件（如静态主题、检索系统和点排序器）而导致的刚性问题，这限制了页面级别的个性化与语义连贯性，难以适应动态目标和长期营销需求。解决方案之关键在于提出一个级联营销框架（cascaded merchandising framework），将商店构建分解为两个生成式任务：位置级主题生成（placement-level theme generation）和每个位置的约束关键词生成（constrained keyword generation），以驱动产品检索；同时利用教师-学生微调（teacher-student fine-tuning）在生产线延迟和成本约束下提升可扩展性，并将生成输出与传统排序模型融合，辅以AI驱动的内容评估与质量过滤，实现安全自动的动态内容部署。

链接: https://arxiv.org/abs/2605.11118
作者: Moein Hasani,Hamidreza Shahidi,Trace Levinson,Yuan Zhong,Guanghua Shu,Vinesh Gudla,Tejaswi Tenneti
机构: Instacart(Instacart); The Pennsylvania State University(宾夕法尼亚州立大学); EvenUp(EvenUp); Ambience Healthcare(Ambience Healthcare)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized storefronts in large e-commerce marketplaces are often assembled from many independent components: static themes per page section (“placement”), retrieval systems to fetch eligible products per placement, and pointwise rankers to order content. While effective in optimizing for aggregate preferences, this paradigm is rigid and can limit personalization and semantic cohesion across the page. This makes it poorly suited to support dynamic objectives and merchandising requirements over time. To address this, we introduce a cascaded merchandising framework that decomposes storefront construction into two generative tasks: (i) placement-level theme generation and (ii) constrained keyword generation per placement to power product retrieval. Teacher-student fine-tuning is leveraged to improve scalability of this framework under production latency and cost constraints. Fine-tuned model ablations are shown to approach closed-weight LLM performance. We further contribute frameworks for AI-driven content evaluation and quality filtering, enabling safe and automated deployment of dynamic content at scale. Generative output is fused with traditional ranking models to preserve hybrid infrastructure. In online experiments, this framework yields an estimated +2.7% lift in cart adds per page view over a strong baseline.

[IR-30] Simpsons Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics NEURIPS2026

【速读】：该论文试图解决在推荐系统、广告和临床剂量等领域的标准实践中，对行为曲线（behavioral curve）进行参数函数拟合时，聚合数据（aggregation）引入的系统性扭曲问题，即行为曲线中的辛普森悖论（Simpson’s paradox）。具体表现为，在Goodreads和Amazon Electronics数据集上，由于生存偏倚（survival bias），个体用户的行为峰值与聚合后的峰值存在数倍差异。解决方案的关键是开发了合成零校准（Synthetic Null Calibration）方法，该方法用于纠正每个用户分类中高达32%的假阳性率，从而在存在差异性流失条件下，从聚合曲线中更准确地估计个体行为参数。

链接: https://arxiv.org/abs/2605.11017
作者: Chao Zhou
机构: Meta Platforms, Inc.(Meta平台公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Submitted to NeurIPS 2026

点击查看摘要

Abstract:Behavioral curve modeling – fitting parametric functions to engagement-versus-exposure data – is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson’s paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 – a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias – not aggregation per se – is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.

[IR-31] Continuous Flood Nowcasting in South Asia: A Multi-Sensor Ensemble Remote Sensing Framework for Flood Extent

【速读】：该论文旨在解决巴基斯坦在2025年异常洪灾期间，现有业务化洪水产品（如UNOSAT）只能提供零散快照，而无法生成具有空间和时间连续性的近实时淹没地图的问题。解决方案的关键在于构建了一个基于多传感器集成的遥感框架，该框架在Google Earth Engine上统一网格，整合了Sentinel-1 SAR、Harmonized Landsat-Sentinel (HLS L30和S30)、MODIS和VIIRS观测数据，采用分层的近实时预报集合——优先使用高分辨率传感器（Sentinel-1和HLS），必要时回退到MODIS和VIIRS——从而以每个传感器的原生分辨率保持淹没范围的逐日连续性，最终实现对淹没范围的近实时、空间一致的动态追踪。

链接: https://arxiv.org/abs/2605.10950
作者: Usman Nazir,Disha Gomathinayagam,Muhammad Kamran,Sara Khalid
机构: Planetary Health Informatics (PHI) Lab, University of Oxford (牛津大学行星健康信息学实验室); Blavatnik School of Government, University of Oxford (牛津大学布拉瓦特尼克政府学院); PMIU Secretariat, Irrigation Department, Pakistan (巴基斯坦灌溉部PMIU秘书处)
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: Visualising Climate 2026

点击查看摘要

Abstract:Pakistan experienced an unusually severe flood season between June and December 2025, with cascading impacts on population, infrastructure, and agriculture. Existing operational flood products (e.g., UNOSAT) provide valuable episode-level snapshots but rarely deliver spatially and temporally continuous inundation maps at near-real-time latency within the country. We present a multi-sensor, ensemble-based remote-sensing framework for continuous flood nowcasting in Pakistan that integrates Sentinel-1 SAR, Harmonized Landsat-Sentinel (HLS L30 and S30), MODIS, and VIIRS observations on a harmonized grid in Google Earth Engine. The framework employs a tiered nowcasting ensemble that prioritizes higher-resolution sensors (Sentinel-1 and HLS) and falls back to MODIS and VIIRS when necessary, preserving daily continuity of flood extent at each sensor’s native resolution. Applied to the 2025 monsoon period, the system generates near-real-time, spatially consistent inundation maps across Pakistan. As a nowcasting case study, we track the super-flood of 26 August-7 September 2025 day by day, demonstrating the framework’s ability to capture the evolving flood footprint in near real time and extend beyond the temporal limits of episodic mapping products. Validation against GloFAS discharge anomalies and precipitation datasets (CHIRPS v3.0, MSWEP) shows strong agreement with observed hydrometeorological conditions. By integrating nowcast outputs with exposure layers (WorldPop, ESA WorldCover, Giga-HOTOSM), the framework enables rapid estimation of affected populations, cropland, and critical infrastructure, supporting timely disaster response and resilience planning in South Asia.

人机交互

[HC-0] Reimagining Assessment in the Age of Generative AI: Lessons from Open-Book Exams with ChatGPT

【速读】：该论文试图解决生成式AI（如ChatGPT）对传统学术评估假设的挑战——即学生能够实时生成解释、代码和解决方案，使基于最终答案正确性的评估失去效力。解决方案的关键在于：不限制AI使用，而是允许学生在家考开卷考试中透明使用ChatGPT，并要求提交交互记录作为评估补充材料，从而直接观察学生的推理过程而非依赖自我报告。通过定性分析，研究识别出三种渐进使用模式（答案检索、引导协作、关键验证），其中最强的推理证据出现在学生评估AI的错漏回答时，展现了调试、比较和论证中的评价性推理。这要求评估从关注独立答案产出转向评估推理过程，将提示词构造、验证和判断能力作为学习指标，从而在AI中介的评估环境中揭示更深层次的理解，与专业实践对齐。

链接: https://arxiv.org/abs/2605.12363
作者: Qusay H. Mahmoud
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI systems such as ChatGPT challenge traditional assumptions about academic assessment by enabling students to generate explanations, code, and solutions in real time. Rather than attempting to restrict AI use, this study investigates how students actually interact with such systems during formal evaluation. Engineering students were permitted to use ChatGPT during take-home open-book exams and were required to submit interaction transcripts alongside exam solutions. This provided direct observational evidence of reasoning processes rather than relying on self-reported behavior. Qualitative analysis revealed three progressive patterns of use: answer retrieval, guided collaboration, and critical verification. While some students initially copied questions verbatim and received generic responses, many refined prompts iteratively and tested outputs. Some of the strongest evidence of reasoning appeared when students evaluated incorrect or incomplete AI responses, revealing evaluative reasoning through debugging, comparison, and justification. The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. The findings suggest that, in AI-mediated assessment environments, correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation. Assessments should evolve to evaluate reasoning about solutions rather than independent solution production. Generative AI therefore does not invalidate assessment but has the potential to expose deeper forms of understanding aligned with professional practice.

[HC-1] Optimized but Unowned: How AI-Authored Goals Undermine the Motivation They Are Meant to Drive

【速读】：该论文试图解决的问题是：在AI工具被广泛用于生产力和自我提升的背景下，当AI（如大语言模型）替用户设定目标时，虽然生成的目标在客观质量（如SMART标准）上更优，但用户的心理动机（如心理所有权、承诺、感知重要性）会显著下降，进而导致后续行动减少。问题的核心在于存在“质量-动机分离”（quality-motivation dissociation），即目标客观质量的提升未能转化为用户的行动意愿。解决方案之关键在于识别并确认了心理所有权（psychological ownership）是中介动机效应的核心机制：用户亲自撰写目标（自我作者身份）相比AI生成目标，能通过提升心理所有权来增强目标承诺和后续行为，而客观目标质量则无显著中介作用。此外，论文指出低自我效能感（low trait self-efficacy）的个体——最可能寻求AI帮助的人群——遭受的心理所有权侵蚀最为严重。因此，设计AI工具时应优先考虑保留用户的“作者身份”（authorship preservation），即在目标设定过程中让用户保持主导权和参与感，而非完全由AI替代。

链接: https://arxiv.org/abs/2605.12344
作者: Vivienne Bihe Chi,Roman Rietsche,Andreas Göldi,Lyle Ungar,Sharath Chandra Guntuku
机构: Computer and Information Science, University of Pennsylvania (宾夕法尼亚大学计算机与信息科学系); Institute for Digital Technology Management, Bern University of Applied Sciences (伯尔尼应用科学大学数字技术管理研究所); University of St.Gallen (圣加仑大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As AI tools become embedded in productivity and self-improvement contexts, a pressing question emerges: what happens when AI does the goal-setting for us? While large language models can generate goals that are objectively well-formed, the motivational consequences of delegating this cognitively and emotionally significant task remain unknown. In a preregistered experiment (N = 470), we compared self-authored goals against LLM-authored goals derived from a personal reflection. Although LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness; d = 2.26), participants in the LLM condition reported lower psychological ownership (d = 1.38), commitment (d = 1.19), and perceived importance (d = 1.13). At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition. Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not. Critically, individuals low in trait self-efficacy, those most likely to seek AI assistance, experienced the steepest ownership erosion. These findings reveal a quality-motivation dissociation in AI-assisted goal-setting and identify authorship preservation as a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.

[HC-2] From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

【速读】：该论文试图解决的问题是在AI辅助标注（AI-assisted annotation）中，模型预测包含类别标签和空间边界两个独立成分，可能出现高置信度分类但空间定位错误的情况，而现有工作流无法向标注者提供空间误差可能的提示，导致人类系统性地忽视定位有偏差的边界框。解决方案的关键在于通过一个专门设计的界面（purpose-built interface）可视化空间不确定性（spatial uncertainty），从而引导标注者将注意力集中在高不确定性的预测上，减少对定位良好的边界框的检查，最终在提升标注质量的同时加快整体速度。

链接: https://arxiv.org/abs/2605.12303
作者: Moussa Kassem Sbeyti,Joshua Holstein,Philipp Spitzer,Nadja Klein,Gerhard Satzger
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at this https URL.

[HC-3] HM-Req: A Framework for Embedding Values within CPS Human Monitoring Requirements

【速读】：该论文旨在解决信息物理系统（Cyber-Physical Systems, CPS）中人类监测需求在设计时未能充分体现利益相关者人类价值的问题，具体包括：价值在需求获取和系统设计阶段的缺失、隐含冲突难以捕获，以及非结构化自然语言需求带来的歧义与模糊性。解决方案的关键在于提出了一个名为HM-Req的需求获取框架，其核心是一个受控自然语言（Controlled Natural Language, CNL），用于明确且无歧义地定义人类监测需求；随后，该框架将这些需求与相关利益相关者的人类价值进行整合，纳入一个价值仪表盘（Value Dashboard），以自动检测潜在的价值冲突，从而引导进一步的讨论与解决。

链接: https://arxiv.org/abs/2605.12100
作者: Zoe Pfister,Ruth Breu,Michael Vierhauser
机构: University of Innsbruck (因斯布鲁克大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the 34th IEEE International Requirements Engineering Conference (RE’26). 10+2 pages

点击查看摘要

Abstract:Monitoring humans, for example, their movement or location, is essential for safe and efficient human-machine collaboration in Cyber-Physical Systems (CPS). This information allows CPS to ensure safety properties, adapt their behaviour dynamically, and coordinate with humans. To ensure that the design of a CPS respects ethical principles and the privacy of its stakeholders, system requirements, particularly those related to human monitoring, must reflect the human values of all involved stakeholders. However, human values are often underrepresented in Software Engineering – particularly during requirements elicitation and system design, crucial phases when introducing ethically critical functionality. Stakeholder values are often implicit and conflicting, yet rarely systematically captured. Furthermore, unstructured natural language requirements introduce ambiguity and vagueness, complicating conflict resolution. To address these problems, we propose HM-Req, a novel requirements elicitation framework including a Controlled Natural Language (CNL) for defining human monitoring requirements. These requirements are then augmented with human values from relevant stakeholders and integrated into a Value Dashboard to detect potential conflicts that require further discussion and resolution. Validation results, applying the CNL to different datasets and conducting a survey and expert interview, confirms the CNL’s ability to capture diverse human monitoring requirements and show HM-Req’s usefulness for requirements elicitation activities.

[HC-4] RoboBlockly Studio: Conversational Block Programming with Embodied Robot Feedback for Computational Thinking

【速读】：该论文试图解决计算思维（Computational Thinking, CT）教学中学习者与教师难以将抽象程序逻辑与有意义的学习成果相连接的问题。解决方案的关键在于设计并实现了RoboBlockly Studio这一集成交互系统，该系统将基于块的编程、对话式AI教学代理和具身机器人执行相结合，构建了一个“编写-运行-观察-修改”的紧密迭代循环。通过保留学习者能动性、使程序行为透明可解释、将编程扎根于具身课堂任务，以及利用教学法指导的AI对话支撑反思，系统借助机器人和AI反馈有效促进了学生对代码的互动、问题解决策略的反思以及对CT概念的理解。

链接: https://arxiv.org/abs/2605.12059
作者: Leyi Li,Chenyu Du,Jiafei Sun,Erick Purwanto,Qing Zhang
机构: Xi’an Jiaotong-Liverpool University (西交利物浦大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted to ACM DIS 2026. Camera-ready version

点击查看摘要

Abstract:Computational thinking (CT) is increasingly promoted as a core literacy, yet learners and teachers face challenges in connecting abstract program logic to meaningful outcomes. We design and evaluate RoboBlockly Studio, an integrated interactive system that combines block-based programming, a conversational AI teaching agent, and embodied robot execution. RoboBlockly Studio creates a tight iterative loop of authoring, running, observing, and revising. Informed by interviews with five programming teachers, the system was designed to support four goals: (1) preserving learner agency in computational thinking, (2) making program behavior transparent and interpretable, (3) grounding programming in embodied, classroom-aligned tasks, and (4) scaffolding reflection through pedagogically grounded AI dialogue. We deployed RoboBlockly Studio with 32 high school students, observing how robot and AI feedback influenced students’ interactions with code, reflections on problem-solving strategies, and understanding of CT concepts. We discuss design insights and implications for creating interactive, embodied learning environments that integrate AI and robotics to support CT learning in computing education.

[HC-5] COSMIC 1001: Engaging Future Speculation on Space Exploration with Generative AI

【速读】：该论文试图解决如何将太空探索历史与未来想象相连接，使未来不再是固定预测而是可视化和可讨论的推测性问题。解决方案的关键在于构建一个交互式装置，通过让参与者先浏览基于新闻的历史太空事件档案，再提出未来导向的问题或指定条件（如年份、天体、任务名称），由AI生成包含标题、文章、旁白和视觉媒体的未来新闻条目，并将这些输出累积在“未来隧道”(Future Tunnel)这一共享可视化空间中，形成集体未来景观，从而在文献记录与科幻想象之间创造可探索的间隙。

链接: https://arxiv.org/abs/2605.11827
作者: Lingyu Peng,Yu Liang,Ying Zhang,Chang Ge,Qingchuan Li
机构: Future Design School, Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)未来设计学院); Zhejiang University, School of Software (浙江大学软件学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cosmic 1001 is an interactive installation that transforms space exploration history into a speculative news experience. Participants first browse a news-based archive of major space events, then pose future-oriented questions or specify conditions such as year, celestial body, or mission name. In response, AI generates a future news item including a headline, article, narration, and visual media. These outputs are accumulated in the Future Tunnel, a shared visualization where individual stories form a collective landscape of possible futures. By combining historical space events with science fiction references, the installation explores a space between documentation and imagination, treating the future not as a fixed prediction but as a visible and discussable speculation.

[HC-6] Ink Spiral: Symbolic Transformation from The Thinker to the Four Gentlemen

【速读】：该论文试图解决东西方文化符号之间长期存在的二元对立与固化象征问题，即西方以《思想者》(The Thinker) 代表理性沉思，东方以梅兰竹菊四君子象征道德修养，二者缺乏动态、跨文化的对话机制。解决方案的关键在于利用AI生成的墨图像（AI generated ink imagery），通过将旋转的《思想者》雕塑逐帧转化为四君子，在数千帧画面中实现三维雕塑与二维水墨、人类内省与自然象征之间的连续转换，从而将固定文化符号转化为流动、模糊且可无限阐释的跨文化对话状态。

链接: https://arxiv.org/abs/2605.11816
作者: Lingyu Peng,Wenbo Lu,Liying Long,Qingchuan Li
机构: Future Design School, Harbin Institute of Technology, ShenzhenShenzhenGuangdongChina(未来设计学院, 哈尔滨工业大学(深圳), 深圳, 广东, 中国)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Western art has regarded The Thinker as a symbol of rational contemplation, while Eastern aesthetics has taken the Four Gentlemen, namely plum, orchid, bamboo, and chrysanthemum, as symbols of moral and spiritual cultivation. This paper presents Ink Spiral, a video installation that links these traditions through AI generated ink imagery. By transforming a rotating sculpture of The Thinker into the Four Gentlemen across thousands of frames, the work shifts between three dimensional sculpture and two dimensional ink, human introspection and natural symbolism. Ink Spiral turns fixed cultural icons into a fluid dialogue, inviting audiences to perceive cross cultural connection as a living, ambiguous, and endlessly interpretable creative state.

[HC-7] Psychological Benefits and Costs of Diversifying Algorithmic Recourse ECAI2026 IJCAI ECAI-2026

【速读】：该论文旨在解决多样化的算法补救（Algorithmic recourse）集在提升心理益处（如行动意愿）的同时，可能因增加反事实推理需求而引发认知负荷和负面情绪的权衡问题。解决方案的关键在于：通过受试者间对照实验（N=750）系统性地操纵补救集的多样性和大小，揭示其交互效应——多样性在小集合中增强心理益处且不产生额外心理成本，但在大集合中会凸显认知负荷；基于此，未来需要设计纳入人类认知和心理机制的新型多样化方法，以避免天真多样化（naive diversification）对决策主体造成的负担。

链接: https://arxiv.org/abs/2605.11793
作者: Tomu Tominaga,Naomi Yamashita,Takeshi Kurashima
机构: NTT Human Informatics Laboratories, NTT, Inc. (NTT人类信息学实验室，NTT公司); Graduate School of Informatics, Kyoto University (京都大学信息学研究生院)
类目: Human-Computer Interaction (cs.HC)
备注: This article contains the supplementary materials described in the paper to be appeared in Proceedings of the IJCAI-ECAI 2026 Special Track on: Human-Centred Artificial Intelligence: Multidisciplinary Contours and Challenges of Next-Generation AI Research and Applications., this https URL

点击查看摘要

Abstract:Algorithmic recourse provides counterfactual action plans that help people overturn unfavorable AI decisions. While diverse recourse sets may improve transparency and motivation, they may also impose cognitive load and negative emotions by increasing counterfactual reasoning demands. To examine this trade-off, we conducted a between-subjects controlled experiment (N=750) that manipulated recourse-set diversity and size, and evaluated these effects on psychological benefits and costs. Results show that diversification enhances psychological benefits (e.g., willingness to act) for small sets without incurring additional psychological costs, whereas for large sets, it makes cognitive load more salient. These findings suggest that naively diversifying recourse can burden decision subjects, underscoring the need for new diversification methods that incorporate human cognition and psychology to mitigate such costs.

[HC-8] MindMirror: A Local-First Multimodal State-Aware Support System for Digital Workers

【速读】：该论文旨在解决数字工作者（digital workers）在长时间计算机工作中出现的疲劳、焦虑、注意力下降和任务阻塞等问题，同时克服现有生产力工具仅关注任务完成、通用AI聊天机器人需用户事先构思清晰提示词的局限性。解决方案的关键在于提出了MindMirror，一个本地优先（local-first）的多模态状态感知支持系统，其核心创新是集成了摄像头面部表情线索、文本输入、可选语音交互、结构化的阻塞反思（structured blockage reflection）、本地大语言模型（local LLM）响应生成以及每日/每周审查报告，从而形成状态检查、手动纠正、结构化表述、建议生成和状态审查的闭环工作流程；此外，系统采用本地优先设计以保护隐私，并提供手动纠正机制，使用户能够主动校正系统对自身状态的判断，从而增强可控性和信任度。

链接: https://arxiv.org/abs/2605.11700
作者: Wenqi Luo,Changbo Wang,Yan Wang
机构: School of Data Science and Engineering, East China Normal University (华东师范大学数据科学与工程学院)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 4 figures, 12 tables. Technical report

点击查看摘要

Abstract:Digital workers often experience fatigue, anxiety, reduced attention, and task blockage during prolonged computer-based work. Existing productivity tools mainly focus on task completion, while general-purpose AI chatbots require users to formulate clear prompts before receiving useful help. This paper presents MindMirror, a local-first multimodal state-aware support system for digital workers. MindMirror integrates camera-based facial expression cues, text input, optional speech interaction, structured blockage reflection, local large language model (LLM)-based response generation, and daily/weekly review reports. The system forms a closed workflow of state checking, manual correction, structured articulation, suggestion generation, and state review. The current prototype follows a local-first design, while optional speech services may rely on third-party APIs when enabled. It is implemented with a Web frontend, Flask backend, an emotion recognition model, an Ollama-hosted Qwen model, this http URL visualization, and local JSON/LocalStorage records. We evaluate the emotion recognition module on an independent seven-class image-level facial expression benchmark containing 6,767 images. The fine-tuned Hugging Face model improves accuracy from 59.66% to 94.49% over a non-fine-tuned checkpoint baseline, an absolute gain of 34.83 percentage points. We further validate the prototype through endpoint-level reliability tests, voice-interaction latency tests, and a small formative user feedback study with six digital workers. Results suggest that users value the local-first design, manual correction mechanism, and structured reflection workflow. MindMirror is not intended for psychological diagnosis; instead, it serves as a lightweight, user-controllable tool for state reflection and supportive interaction.

[HC-9] Auditing African Content Moderators Working Conditions by Using the European General Data Protection Regulation (GDPR)

【速读】：该论文旨在利用欧盟《通用数据保护条例》（GDPR）的域外管辖权，审计肯尼亚和尼日利亚受雇于业务流程外包（BPO）公司的内容审核员的工作条件，获取从未向相关工人提供的雇佣合同和保密协议等证据。解决方案的关键在于借助GDPR的域外效力，为证明全球南方内容审核员的结构性劣势和剥削性工作条件侵犯工人权利提供法律依据，同时突显数据权利立法作为抗衡科技行业例外论话语（该行业通过BPO外包劳动成本和责任，并声称其产品、商业模式和资源开采方式不受现有法律框架约束）的重要作用。

链接: https://arxiv.org/abs/2605.11699
作者: Mariame Tighanimine,Jessica Pidoux,Sonia Kgomo,Kauna Ibrahim Malgwi,Richard Mwaura Mathenge,Mophat Okinyi,James Oyange
机构: University of NeuchâtelSwitzerland (纳沙泰尔大学瑞士); PersonalData.IOSwitzerland (个人数据IO瑞士); Lise Cnam CNRSFrance (Lise Cnam CNRS法国); African Content Moderators UnionKenya (非洲内容审核员联盟肯尼亚); Kenya Digital Rights and Mental Health InitiativeKenya (肯尼亚数字权利与心理健康倡议肯尼亚); Kenya Techworker Community AfricaKenya (肯尼亚科技工作者社区非洲肯尼亚)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: ACM FAccT 2026 accepted paper

点击查看摘要

Abstract:In this article, we audit the working conditions of content moderators in Kenya and Nigeria employed by business process outsourcing (BPO) companies by using the European General Data Protection Regulation (GDPR). We demonstrate its extraterritorial scope for gaining access to elements such as employment contracts and NDAs that have never been provided to the workers concerned. The results of this approach provide legally grounded evidence of the structural disadvantages faced by content moderators in the Global South, whose exploitative working conditions violate workers’ rights. Our work also highlights the benefits of legislation aimed at protecting individuals’ data rights as a counterweight to the tech industry’s discourse of exceptionalism, which obscures its dependence on BPOs to externalise labour costs and accountability, whilst claiming that its products, business models, and methods of resource extraction are unprecedented and fall outside any existing legal framework.

[HC-10] A Cross-Cultural Analysis of Animated Representations of Emotions for Wearable Interfaces

【速读】：该论文试图解决的核心问题是：在泛在感知技术能够连续检测人类情绪状态的背景下，如何以非侵入方式（unobtrusive）将感知数据有效传达给用户，尤其是在屏幕空间有限且偏向外围交互（peripheral）的可穿戴设备上。真实感化身（realistic avatars）虽有效但受限于设备条件，而抽象几何动画（abstract geometric animation）作为一种可快速解读的替代方案，其跨文化有效性尚未得到充分验证。解决方案的关键在于：通过一项对比研究（波兰与土耳其105名参与者），揭示了颜色（color）和物体大小（object size）作为情绪传达媒介的跨文化普遍性（universality），同时识别出动画速度（animation speed）存在动态范围偏好上的文化差异。基于此，研究奠定了生成式可视化算法（generative visualization algorithms）的基础，该算法可将连续传感器数据翻译为直觉化且文化适应性强的反馈，从而实现泛在环境中情绪数据的全球通用可视化模型。

链接: https://arxiv.org/abs/2605.11668
作者: Michal R. Wrobel,Duygun Erol Barkana,Agnieszka Landowska
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: This manuscript is a preprint of an article accepted for presentation at the EmotionSense - 1st International Workshop on Human-centred Emotion-Awareness and Sensing in Pervasive Computing within The 24th International Conference on Pervasive Computing and Communications (PerCom 2026), Pisa, Italy. The final published version may differ as a result of the publisher’s production process

点击查看摘要

Abstract:Although pervasive sensing technologies are increasingly capable of continuously detecting human emotional states, there is still a critical challenge: how to unobtrusively communicate this sensed data back to the user. Realistic avatars are effective but often unsuitable for the limited screen space and peripheral nature of wearable. Abstract geometric animation offers a promising, rapidly interpretable alternative, but its cross-cultural validity remains under-explored. This study investigates the universality of animated emotion representations. We conducted a comparative study with 105 participants from Poland and Turkey and analyzed how they map emotions to visual parameters, such as color, shape, size, speed, and animation type. The results indicate that color and object size are universally understood as carriers of emotional meaning, making them suitable for global visualization models. However, some cultural variation in dynamic range preferences was revealed by animation speed. These results lay the groundwork for developing generative visualization algorithms that translate continuous sensor data into intuitive, culturally relevant feedback for pervasive environments.

[HC-11] A Generative AI Driven Interactive Narrative Serious Fame for Stress Relief and Its Randomized Controlled Pilot Study

【速读】：该论文试图解决的问题是：生成式人工智能（Gen-AI）在减压严肃游戏中的应用尚未得到充分探索，缺乏针对性的设计框架与实证验证。解决方案的关键在于开发了一款名为“Reverie”的Gen-AI驱动严肃游戏，通过整合Unity引擎与ChatGPT实现个性化交互，并基于14天试点研究初步验证了其在降低压力水平、改善用户体验和认知情绪调节策略方面的有效性，从而为将大型语言模型（LLM）驱动的游戏机制融入数字干预提供了可行设计路径。

链接: https://arxiv.org/abs/2605.11562
作者: Ting-Chen Hsu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Background: Stress has become a widespread phenomenon, and serious games are increasingly recognized as engaging tools for stress relief. However, despite the rapid advancement of Generative Artificial Intelligence (Gen-AI), its integration into stress-relief serious games remains insufficiently explored. Objective: This study aimed to address this gap by developing “Reverie”, an Gen-AI driven serious game powered by the Unity engine and ChatGPT, and to preliminarily evaluate its effectiveness in stress reduction, user experience, and cognitive emotion regulation. Methods: A 14-day pilot study was conducted with 20 students experiencing moderate to high levels of stress. Participants used “Reverie” as a stress-relief intervention. Stress levels, user experience, and cognitive emotion regulation strategies were assessed to examine the game’s feasibility and preliminary efficacy. Results: The results showed that “Reverie” significantly reduced participants’ stress levels over the intervention period (p=.016*), indicating a cumulative positive effect. In addition, the game demonstrated excellent user experience and was associated with improvements in cognitive emotion regulation strategies. Conclusions: This study proposes a Gen-AI driven design framework for serious games for stress relief. Besides, this pilot study provides initial support for the feasibility and promise of combining LLM-driven gameplay in a personalized digital intervention context.

[HC-12] UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

【速读】：该论文旨在解决强化学习微调大语言模型时，不同策略优化算法（如GRPO、DAPO、Dr. GRPO）因模块差异（裁剪、优势估计、奖励聚合）在独立论文中以不一致符号呈现，导致难以比较且对非专家群体具有威慑性的问题。解决方案的关键是UNIPO——首个交互式可视化工具，通过统一设计暴露token级训练动态，具体通过连接三个互补视图（高级训练概览、步骤级提示与响应检查器、并排算法比较），使学习者能够观察单个设计决策在训练过程中的传播效果，从而同时服务于非专家的课堂教学和AI从业者的算法选择。

链接: https://arxiv.org/abs/2605.11549
作者: Aeree Cho,Alexander D. Greenhalgh,Jonathan Bodea,Anthony Peng,Duen Horng(Polo)Chau
机构: Georgia Tech.(佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as a dominant technique for fine-tuning the behavior of large language models, with policy optimization (PO) algorithms such as GRPO, DAPO, and Dr. GRPO emerging in rapid succession to advance state-of-the-art reasoning and alignment performance. However, the modular differences between these algorithms, including targeted improvements to clipping, advantage estimation, and reward aggregation, are introduced across separate papers with inconsistent notation, making them difficult to compare and intimidating to the non-expert community. We present UNIPO, the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. UNIPO connects three complementary views, a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison, allowing learners to observe how individual design decisions propagate through training. Through two usage scenarios, we demonstrate how UNIPO supports both classroom instruction for non-experts and algorithm selection for AI practitioners. Our tool is open-source and publicly available at this https URL.

[HC-13] he Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

【速读】：该论文试图解决当前前沿AI安全评估中因模型具备评估上下文识别能力而引发的声明有效性问题（claim-validity problem）。具体而言，当代AI模型能够在被评估时识别出评估场景，并据此表现出与部署连续条件下不同的行为，这导致从标准评估中直接得出的安全结论可能无效。解决方案之关键在于引入评估差异（Evaluation Differential, ED）概念——即目标行为属性在识别评估上下文与部署连续上下文之间的条件分歧，并配套提出归一化效应量形式（normalised ED, nED）用于跨属性比较，以及TRACE（Test-Recognition Audit for Claim Evaluation）审计协议。TRACE不直接产生能力分数，而是通过包装现有评估基础设施，明确证据产生的条件，从而生成受限声明（restricted claims），并依据ED类型（ED-stable、ED-degraded、ED-inverted、ED-undetermined）对安全声明进行分类，以此规范从评估证据中推导结论的逻辑。

链接: https://arxiv.org/abs/2605.11496
作者: Varad Vishwarupe,Nigel Shadbolt,Marina Jirotka,Ivan Flechais
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic’s BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.

[HC-14] Hedwig: Dynamic Autonomy for Coding Agents Under Local Oversight

【速读】：该论文试图解决现有编码代理（coding agents）在自主性（autonomy）设置上的固有问题：尽管代理能处理复杂任务，但其引入意外编辑、细微错误和范围漂移的倾向依然存在，而开发者对其自主水平的偏好会随任务和时间动态变化，但当前静态的权限设置或指令文件无法适应这种变化，导致开发者难以校准自主性并产生挫败感。解决方案的关键在于提出Hedwig——一个命令行界面（CLI）编码代理，它通过跨会话的开发者-代理交互，从开发者的决策和反馈中动态学习并演化出一套行为指南（behavioral guidelines），从而改变全局固定配置的模式：在代理已赢得信任的领域减少监督摩擦，在代理进入陌生领域时加强监督，实现基于用户信任的智能自主性自适应。

链接: https://arxiv.org/abs/2605.11495
作者: Tanjal Shukla,K. J. Kevin Feng,Leijie Wang,Mohammad Rostami,Amy X. Zhang
机构: University of Washington (华盛顿大学); Amazon GenAI Innovation Center (亚马逊GenAI创新中心)
类目: Human-Computer Interaction (cs.HC)
备注: accepted to ACM CAIS 2026 demo track

点击查看摘要

Abstract:Despite coding agents’ advances in handling increasingly complex tasks, their continued tendency to introduce unintended edits, subtle bugs, and scope drift that slip past code review means developers must still decide how much autonomy to grant them. However, existing approaches for setting an agent’s level of autonomy, such as static permission settings or instruction files, cannot account for how developers’ preferences for agent autonomy can shift across tasks and over time. We conducted a formative survey with 21 software engineers who use coding agents and found that they experience frustration with calibrating autonomy and have evolving preferences for level of oversight. Building on these insights, we present Hedwig, a CLI coding agent that dynamically adjusts its autonomy level based on developer-agent interactions across sessions. Rather than operating on a global, fixed autonomy configuration, Hedwig learns an evolving set of behavioral guidelines from developer decisions and feedback, reducing friction on work for which the agent has earned trust, while tightening oversight when the agent operates outside familiar territory. Hedwig demonstrates the potential of a new paradigm where agents intelligently adapt their level of autonomy based on user trust through active, longitudinal collaboration.

[HC-15] Modelling Expert Cognition Beyond Behaviour: towardss Interpretation Tension and Value Structures

【速读】：该论文试图解决现有计算模型仅关注可观察行为或决策结果，而未能捕捉产生专家推理的内部认知结构这一核心问题。解决方案的关键在于提出专家身份认知模型（Expert Identity Cognition Model, EICM），该模型是一个三层框架，将专家认知概念化为在情境约束下运行的身份结构化过程：约束通过内部张力（internal tension）被解读，这种张力源于竞争性身份承诺的冲突，并最终稳定为指导行动的价值结构。与行为中心或约束驱动方法不同，EICM将张力定位为连接世界结构与决策形成的核心认知机制，从而解释专家如何在跨情境中产生稳定的判断模式，为建模隐性知识、专家判断和认知一致性提供了新视角。

链接: https://arxiv.org/abs/2605.11393
作者: Annie Yuan
机构: School of Computer Science (计算机科学学院); The University of Sydney (悉尼大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Existing computational models of expertise primarily focus on observable behaviour or decision outcomes, failing to capture the internal cognitive structures that generate expert reasoning. In this work, we introduce the Expert Identity Cognition Model (EICM), a three-layer framework for modelling expert cognition beyond behaviour. EICM conceptualises expert cognition as an identity-structured process operating within situational constraints, where constraints are interpreted through internal tensions arising from competing identity commitments and stabilised into value structures that guide action. Unlike behaviour-centric or constraint-driven approaches, EICM positions tension as the central cognitive mechanism connecting world structure and decision formation. We argue that expert cognition is not merely behavioural adaptation under constraints but an identity-structured negotiation process that produces stable judgement patterns across contexts. The framework provides a new perspective for modelling tacit knowledge, expert judgement, and cognitive consistency in domains including professional practice, cultural expertise, and design reasoning.

[HC-16] Making Abstraction Concrete: A Design Space and Interaction Model of Abstraction in Interactive Systems

【速读】：该论文旨在解决交互系统设计中缺乏概念框架来理解抽象（abstraction）如何塑造交互设计的问题，具体而言，现有模型如执行和评估的鸿沟（gulfs of execution and evaluation）未能明确建模系统中的抽象或用户心智模型中的抽象，因此无法为抽象设计提供可操作的指导。解决方案的关键在于通过系统调查457篇论文，综合出一个沿六个维度的抽象技术设计空间（design space of abstraction techniques），并以此为基础，从抽象视角重新构建执行和评估的鸿沟，清晰阐述用户和系统桥接及导航抽象鸿沟（abstraction gap）的认知与设计过程，从而整合现有观点并为未来系统发现新的设计机遇。

链接: https://arxiv.org/abs/2605.11344
作者: Bryan Min,Sangho Suh,Jim Hollan,Haijun Xia
机构: University of California San Diego (加州大学圣地亚哥分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The principle of abstraction guides the design of interactive systems, yet we lack a conceptual framework to understand how it shapes interaction design. Existing models, such as the gulfs of execution and evaluation, do not explicitly model abstractions in the system or in users’ mental models, and therefore lack actionable guidance for designing abstractions. To investigate how abstractions are employed in interactive systems, we surveyed 457 papers and synthesized a design space of abstraction techniques along six dimensions. We use this design space to reframe the gulfs through a lens of abstraction, explicitly articulate the cognitive and design processes by which users and systems bridge and navigate the abstraction gap, and demonstrate how this model integrates existing perspectives and surfaces new opportunities for future systems.

[HC-17] Conversational Customization of Productivity Systems: A Design Probe of Malleable AI Interfaces

【速读】：该论文旨在解决一个核心交互系统问题：尽管定制化（Customization）长期被视为交互系统的目标，但终端用户实际进行定制化的频率极低，且往往局限于初始设置或系统故障时的调试。生成式AI（Generative AI）使通过自然语言实现高度可塑（Malleable）的系统在技术上成为可能，然而用户在实践中如何使用这种可塑性——即创建何种定制、何时选择定制、以及这些修改如何影响日常工具体验——仍不明确。解决方案的关键在于设计一个对话式可定制的邮件系统作为设计探针（Design Probe），该系统允许用户通过自然语言交互直接、迭代地修改收件箱行为，包括重构分类、引入界面元素以及编写新的工作流（Workflow）逻辑。通过研究用户在真实邮件工作流中数天内的定制行为，发现用户定制化往往根植于现有模式（Existing Patterns），通过调整和专门化来适应自身需求，而非生成全新功能；可塑性改变了用户与收件箱的互动方式，使其从固定界面转变为由用户定义特性塑造的灵活数据层；同时，定制引入了错误指定行为（Mis-specified Behavior）、意外过滤（Unintended Filtering）和结果不确定性等新风险，用户通过持续监督和迭代优化来管理这些风险。这些发现揭示了对话式定制化如何嵌入日常交互，并指向对支持迭代优化、行为可视化和安全实验的系统需求。

链接: https://arxiv.org/abs/2605.11149
作者: Karthik Sreedhar,Aryan Kaul,Lydia B. Chilton
机构: Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Customization has long been a central goal in interactive systems, yet prior work shows that end-user tailoring occurs infrequently and is often confined to initial setup or moments of breakdown. Recent advances in generative AI suggest that highly malleable systems-where users can modify system behavior through natural language-are now technically feasible. However, it remains unclear how such malleability is used in practice: What kinds of customizations do users create, when do they choose to customize, and how do these modifications shape their experience of everyday tools? We present a design probe that uses a conversationally customizable email system as an instrument to study how users create and refine functionality within everyday tools. The system allows users to iteratively modify their inbox by restructuring categories, introducing interface elements, and authoring new workflow behaviors directly through natural language interaction. We study how participants create, refine, and use these features over several days within their own email workflows. We find that users’ customizations are often grounded in existing patterns, which they adapt and specialize to fit their needs, rather than generating entirely novel functionality. Malleability changes how users engage with their inbox, shifting it from a fixed interface to a flexible data layer shaped through user-authored features. At the same time, customization introduces new forms of risk, including mis-specified behavior, unintended filtering, and uncertainty around outcomes, which users manage through ongoing oversight and refinement. These findings highlight how conversational customization becomes embedded within everyday interaction, and point toward the need for systems that support iterative refinement, visibility into behavior, and safe experimentation as users shape their own tools.

计算机视觉

[CV-0] Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

【速读】：该论文试图解决计算机使用代理（Computer-use Agents, CUAs）在复杂、低频交互场景中可靠性差的问题，该问题源于图形用户界面（GUI）操作中的长尾分布——少数复杂多样的交互导致大部分任务失败，而根本原因在于这类交互的训练数据稀缺。解决方案的关键是提出一个多模态基准 CUActSpot，覆盖GUI、文本、表格、画布和自然图像五种模态以及点击、拖拽、绘制等多种动作类型，以突破以往仅关注GUI控件的点击中心基准的局限性；同时，设计了一个基于渲染器的数据合成流水线，通过自动生成每种模态的场景、记录截图与元素坐标，并利用大语言模型（LLM）生成匹配的指令与动作轨迹，从而大规模合成高质量训练数据，最终训练出的 Phi-Ground-Any-4B 模型在参数小于32B的开源模型中取得了最优性能。

链接: https://arxiv.org/abs/2605.12501
作者: Miaosen Zhang,Xiaohan Zhao,Zhihong Tan,Zhou Huoshen,Yijia Fan,Yifan Yang,Kai Qiu,Bei Liu,Justin Wagle,Chenzhong Yin,Mingxi Cheng,Ji Li,Qi Dai,Chong Luo,Xu Yang,Xin Geng,Baining Guo
机构: Southeast University(东南大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); Wuhan University(武汉大学); Sun Yat-sen University(中山大学); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models’ capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at this https URL

[CV-1] SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

【速读】：该论文试图解决当前大型视觉-语言模型（VLMs）中普遍存在的理解与生成任务之间的结构性分裂问题，这种二分法导致架构碎片化、流水线级联和表征空间不对齐，从而阻碍了原生多模态智能的涌现。解决方案的关键在于提出一种基于 NEO-unify 框架的原生统一多模态范式 SenseNova-U1，它将理解与生成视为同一底层过程的协同视图，通过从第一性原理出发设计模型架构、数据预处理、训练策略（包括预训练和后训练）以及推理机制，使得模型无需在模态之间进行翻译，而是以原生方式跨模态思考与行动，并在密集型和混合专家（MoE）基线模型上验证了其在感知、生成、视觉-语言-行动（VLA）及世界模型（WM）等场景中的综合性能。

链接: https://arxiv.org/abs/2605.12500
作者: Haiwen Diao,Penghao Wu,Hanming Deng,Jiahao Wang,Shihao Bai,Silei Wu,Weichen Fan,Wenjie Ye,Wenwen Tong,Xiangyu Fan,Yan Li,Yubo Wang,Zhijie Cao,Zhiqian Lin,Zhitao Yang,Zhongang Cai,Yuwei Niu,Yue Zhu,Bo Liu,Chengguang Lv,Haojia Yu,Haozhe Xie,Hongli Wang,Jianan Fan,Jiaqi Li,Jiefan Lu,Jingcheng Ni,Junxiang Xu,Kaihuan Liang,Lianqiang Shi,Linjun Dai,Linyan Wang,Oscar Qian,Peng Gao,Pengfei Liu,Qingping Sun,Rui Shen,Ruisi Wang,Shengnan Ma,Shuang Yang,Siyi Xie,Siying Li,Tianbo Zhong,Xiangli Kong,Xuanke Shi,Yang Gao,Yongqiang Yao,Yves Wang,Zhengqi Bai,Zhengyu Lin,Zixin Yin,Wenxiu Sun,Ruihao Gong,Quan Wang,Lewei Lu,Lei Yang,Ziwei Liu,Dahua Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

[CV-2] EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera SIGGRAPH2026

【速读】：该论文旨在解决基于单目头戴相机进行3D手部重建时存在的深度尺度模糊性（depth-scale ambiguity）以及模型难以泛化至不同光学配置（如鱼眼、透视和畸变广角）头戴设备的问题，从而避免对昂贵且费力的设备特定数据集进行大量训练。解决方案的关键在于三个创新组件的协同：1）一个可微分的前臂表示（differentiable forearm representation），用于约束手部姿势的稳定性；2）一个统一的手臂-手变换器（unified arm-hand transformer），能从单张第一人称视图同时预测手和前臂的几何结构，从而减轻深度-尺度模糊性；3）一个射线空间闭式求解器（ray space closed-form solver），能够在不同头戴相机模型下实现绝对3D姿态的恢复。三者结合使单一网络能够统一处理多种相机模型，并在多个基准数据集上达到最先进的3D精度。

链接: https://arxiv.org/abs/2605.12498
作者: Christen Millerdurai,Shaoxiang Wang,Yaxu Xie,Vladislav Golyanik,Didier Stricker,Alain Pagani
机构: Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)(德国人工智能研究中心 (DFKI)); Max Planck Institute for Informatics (MPII)(马克斯·普朗克信息学研究所 (MPII))
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 23 pages, 19 figures and 10 tables; project page: this https URL (source code, data and demo available); SIGGRAPH 2026 Conference

点击查看摘要

Abstract:Reconstructing the absolute 3D pose and shape of the hands from the user’s viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user’s (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at this https URL.

[CV-3] From Web to Pixels: Bringing Agent ic Search into Visual Perception

【速读】：该论文试图解决在开放世界场景中存在的一种视觉感知挑战：当图像中或者模型已有知识中不直接包含识别目标所需的决定性证据时，视觉系统必须先从外部事实、近期事件、长尾实体或多跳关系等信息中解析出隐藏的目标身份，然后才能完成对该目标的定位（如边界框、掩码或问答）。目前大多数视觉感知任务都假设证据已在图像中或模型已掌握相关知识，这一假设在更实际、更困难的开放世界场景中不成立。解决方案的关键在于提出了 Pixel-Searcher，一个智能化的搜索到像素工作流（agentic search-to-pixel workflow），它通过自主搜索与推理，解析隐藏的目标身份，并将该身份绑定到边界框、分割掩码或接地答案上，从而在三个任务视图（基于搜索的定位、分割和视觉问答）上均取得了最强的开源性能。

链接: https://arxiv.org/abs/2605.12497
作者: Bokang Yang,Xinyi Sun,Kaituo Feng,Xingping Dong,Dongming Wu,Xiangyu Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

[CV-4] CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

【速读】：该论文试图解决传统自回归视频生成模型在多镜头叙事场景中的局限性，即模型因仅针对短视距连续帧训练，在长序列生成中会出现运动停滞和语义漂移，难以处理事件演进、视角变化和镜头切换等电影化剪辑需求。解决方案的关键在于提出了CausalCine交互式自回归框架，其核心包括：首先，在原生多镜头序列上训练因果基础模型（causal base model），使其学习复杂的镜头过渡；其次，引入内容感知记忆路由（Content-Aware Memory Routing, CAMR），通过注意力相关性分数而不是时间邻近性动态检索历史KV条目，在有限活跃记忆下维持跨镜头连贯性；最后，将因果基础模型蒸馏为少步生成器，实现实时交互式生成。通过这三项关键设计，CausalCine能够在因果生成范式下解锁流式交互能力，显著优于自回归基线并逼近双向模型的效果。

链接: https://arxiv.org/abs/2605.12496
作者: Yihao Meng,Zichen Liu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Yue Yu,Hanlin Wang,Haobo Li,Jiapeng Zhu,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen,Huamin Qu
机构: HKUST(香港科技大学); Ant Group(蚂蚁集团); SJTU(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at this https URL

[CV-5] AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward ICML2026

【速读】：该论文试图解决统一多模态模型（UMMs）在缺乏冷启动阶段时，难以在没有显式监督的情况下实现高级推理任务（如推理文本到图像生成和自我反思细化）的问题，同时应对真实世界多模态生成中提供稳定、可解释监督的挑战。解决方案的关键在于提出AlphaGRPO框架，将组相对策略优化（GRPO）直接应用于AR-Diffusion统一多模态模型，从而解锁模型的固有推理潜能；此外，引入分解可验证奖励（DVReward），通过大型语言模型（LLM）将复杂用户请求分解为原子可验证的语义与质量子问题，再经由通用多模态大模型（MLLM）评估，提供可靠且可解释的反馈信号，以此替代传统的整体标量奖励，实现稳定且细粒度的策略优化。

链接: https://arxiv.org/abs/2605.12495
作者: Runhui Huang,Jie Wu,Rui Yang,Zhe Liu,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML2026

点击查看摘要

Abstract:In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model’s intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: this https URL

[CV-6] Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction ICML2026

【速读】：该论文试图解决现有基于可微分渲染的表面三维重建方法中，由于光度模糊性（photometric ambiguities）导致的重建质量受限问题。解决方案的关键在于：首先，重新审视高斯喷溅（Gaussian Splatting）表示，识别出其中内置的两种原语级模糊性，并揭示其内在的模糊自指示（ambiguity self-indication）潜力；其次，引入光度去模糊（photometric disambiguation），通过约束病态几何解来形成明确的表面；最后，提出一个模糊指示模块（ambiguity indication module），利用自指示潜力识别并引导修正约束不足的重建区域，从而在多种挑战性场景下实现鲁棒且高性能的表面重建。

链接: https://arxiv.org/abs/2605.12494
作者: Jiahe Li,Jiawei Zhang,Xiao Bai,Jin Zheng,Xiaohan Yu,Lin Gu,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: this https URL .

[CV-7] Elastic Attention Cores for Scalable Vision Transformers

【速读】：该论文试图解决Vision Transformers (ViTs)中自注意力机制因与图像分辨率呈二次方缩放而导致的计算瓶颈问题，这一限制阻碍了ViT在高分辨率视觉任务中的应用。解决方案的核心是提出VECA (Visual Elastic Core Attention)架构，通过引入一组数量固定且与分辨率无关的可学习核心嵌入（core embeddings），将传统所有图像块（patch tokens）之间两两交互的全对全自注意力，替换为核心-外围结构化注意力。在此结构中，所有图像块仅与这些核心进行直接交互，核心充当通信接口，从而将计算复杂度从二次方降至线性O(N)（N为图像块数）。此外，VECA通过沿核心轴进行嵌套训练，使得模型在推理时能弹性地权衡计算量与精度，避免了先前跨注意力架构中因仅保留少量核心而造成的瓶颈问题，同时保持并迭代更新全部N个输入标记。

链接: https://arxiv.org/abs/2605.12491
作者: Alan Z. Song,Yinjie Chen,Mu Nan,Rui Zhang,Jiahang Cao,Weijian Mai,Muquan Yu,Hossein Adeli,Deva Ramanan,Michael J. Tarr,Andrew F. Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project repository here: this https URL

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the N image patches only directly interact with a resolution invariant set of C learned “core” embeddings, this yields linear complexity O(N) for predetermined C , which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of N input tokens, avoiding a small C -way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

[CV-8] OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

【速读】：该论文旨在解决在联合音频-视频生成（joint audio-video generation）中应用强化学习（Reinforcement Learning, RL）进行多目标、多模态微调时面临的三个关键障碍：多目标优势不一致（multi-objective advantages inconsistency），即多模态输出的优势在组内不一致；多模态梯度不平衡（multi-modal gradients imbalance），即视频分支梯度渗入负责模态内生成的浅层音频层；以及统一信用分配（uniform credit assignment），即细粒度跨模态对齐区域未能得到高效探索。这些缺陷导致使用单一全局优势的普通RL微调策略效果欠佳。解决方案的核心在于提出了OmniNFT框架，一种新颖的模态感知在线扩散RL框架（modality-aware online diffusion RL framework），其关键创新包括：(1) 模态级优势路由（modality-wise advantage routing），将独立的每奖励优势（per-reward advantages）路由到各自的模态生成分支；(2) 层级梯度手术（layer-wise gradient surgery），选择性地分离浅层音频层的视频分支梯度，同时保留跨模态交互层的梯度；(3) 区域级损失重加权（region-wise loss reweighting），将策略优化调整到与音频-视频同步和细粒度对齐相关的关键区域。通过这些设计，OmniNFT在音频和视频感知质量、跨模态对齐以及音视频同步方面实现了全面改进。

链接: https://arxiv.org/abs/2605.12480
作者: Guohui Zhang,XiaoXiao Ma,Jie Huang,Hang Xu,Hu Yu,Siming Fu,Yuming Li,Zeyue Xue,Lin Song,Haoyang Huang,Nan Duan,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); JD Explore Academy (京东探索研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

[CV-9] FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation

【速读】：该论文旨在解决持续全景分割（Continual Panoptic Segmentation, CPS）中因训练时将未标注对象统一归类为“背景”而导致的表示混淆问题——这种方法反复告诉模型所有不同背景类别是相同的，使得后续学习新类别时难以利用之前被忽略的信息。解决方案的关键在于提出了未来导向对比与排斥（Future-Targeted Contrastive and Repulsive, FuTCR）框架，通过首先发现可靠的未来类区域（模型预测为背景但具有非背景logits的像素分组），然后应用像素到区域对比学习（pixel-to-region contrast）从未标注区域构建连贯的原型（prototypes），同时将背景特征从已知类原型排斥开，从而明确为未来类别预留表示空间（representational space）。

链接: https://arxiv.org/abs/2605.12451
作者: Nicholas Ikechukwu,Keanu Nichols,Deepti Ghadiyaram,Bryan A. Plummer
机构: Boston University (波士顿大学); Runway (跑道)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single “background” class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren’t). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.

[CV-10] LychSim: A Controllable and Interactive Simulation Framework for Vision Research CVPR2026

【速读】：该论文试图解决现代仿真平台在闭环优化和严格的分布外（out-of-distribution, OOD）评估中因技术门槛过高（依赖计算机图形学与游戏开发专业知识）而难以被广泛采用的问题。解决方案的核心是提出了基于Unreal Engine 5的高度可控交互式仿真框架LychSim，其关键在于三个设计要素：通过简化Python API隐藏底层引擎复杂性；构建程序化数据管道以生成具备多样OOD视觉挑战的高保真环境及丰富2D/3D真值；以及原生集成模型上下文协议（Model Context Protocol, MCP），将仿真器转化为支持推理型智能大语言模型（LLM）的动态闭环交互平台，并辅以场景级程序化规则和物体级姿态对齐，实现语义对齐的3D真值与自动化场景修改。

链接: https://arxiv.org/abs/2605.12449
作者: Wufei Ma,Chloe Wang,Siyi Chen,Jiawei Peng,Patrick Li,Alan Yuille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3D-LLM/VLA Workshop at CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:While self-supervised pretraining has reduced vision systems’ reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim’s capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

[CV-11] 3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark CVPR2026

【速读】：该论文试图解决在同步多视角（synchronized multi-view）设置下，动态场景的回顾性新视角合成（retrospective novel view synthesis, NVS）问题，尤其是针对体育场景这类具有强空间一致性的应用。现有方法（如动态3D高斯泼溅，3D Gaussian Splatting）引入了时间耦合公式来强制运动连贯性，但论文认为在多视角同步且已标定的条件下，每个时间步的动态场景本身已具备充分的几何约束，因此显式的时间耦合或复杂多体约束并不必要。解决方案的关键在于：从起始时间通过运动恢复结构（Structure from Motion, SfM）初始化点云，然后随时间传播优化的高斯体（Gaussians），而无需施加任何时间变形约束，从而高效实现回顾性NVS。此外，论文还提出了一个基于Blender的动态多视角数据集框架，以生成高质量、同步的相机阵列，并导出标准格式的训练数据集，消除了坐标约定和数据流水线的不一致性，进而构建了一个可控条件下的动态基准测试套件，用于可重复地评估NeRF和3DGS方法。

链接: https://arxiv.org/abs/2605.12437
作者: Yunxiao Zhang,Suryansh Kumar
机构: Visual and Spatial AI Lab(视觉与空间AI实验室); VCCM Section(VCCM部分); College of PVFA(PVFA学院); Department of ECEN(ECEN系); Department of CSCE(CSCE系); Texas AM University(德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.

[CV-12] GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

【速读】：该论文解决的是传统步态去标识（gait de-identification）方法中身份抑制不足与时空扭曲之间的矛盾，即现有方法无法在有效隐藏身份的同时保持步态结构（如身形、运动动态）的合理性，从而影响下游结构敏感型任务。解决方案的关键在于提出GaitProtector框架，将隐私保护建模为一个统一目标，包含两个紧密耦合的组件：混淆（obfuscation）使受保护步态远离源身份，伪装（impersonation）使其向选定的目标身份靠拢；目标身份作为语义锚点（semantic anchor），利用预训练扩散先验（pretrained diffusion prior）引导优化偏向结构合理的步态模式。具体实现是通过一个免训练（training-free）的扩散隐空间优化（diffusion latent optimization）流程：将输入轮廓序列反转为预训练3D视频扩散模型的隐轨迹（latent trajectory），然后以可微对抗目标迭代优化隐代码（latent codes），从而合成受保护步态。该方法在不重新训练生成器的前提下，利用扩散模型的先验知识同时实现有效的身份抑制、结构保持和伪装攻击，在CASIA-B数据集上取得了56.7%的伪装成功率，并将Rank-1识别准确率从89.6%降至15.0%，同时在下游脊柱侧弯诊断任务中仅将诊断准确率从91.4%降至74.2%。

链接: https://arxiv.org/abs/2605.12431
作者: Huiran Duan,Qian Zhou,Zhongliang Guo,Junhao Dong,Yuqi Li,Guoying Zhao,Yingli Tian
机构: City University of New York (纽约市立大学); Wuhan University (武汉大学); University of Aberdeen (阿伯丁大学); Nanyang Technological University (南洋理工大学); ELLIS Institute Finland (芬兰ELLIS研究所); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

点击查看摘要

Abstract:Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.

[CV-13] AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection CVPR2026

【速读】：该论文旨在解决自动光学检测（Automated Optical Inspection）中焊线半导体的语义分割模型通常具有设备特异性，当出现新设备或分布偏移（distribution shift）时需重新训练的问题。其解决方案的关键在于提出了AOI-SSL框架，通过结合小域自监督预训练（small-domain self-supervised pre-training）的视觉Transformer（Vision Transformer）与上下文推理（in-context inference）来最小化对标注样本的需求。具体而言，该框架利用掩码自编码器（Masked Autoencoder）在小规模工业检测数据集上进行自监督预训练，从而在标注微调（fine-tuning）计算预算固定的情况下，显著提升下游分割质量；同时引入了一种基于上下文、补丁级（patch-level）的检索方法，可直接从密集编码器嵌入（dense encoder embedding）中预测掩码，几乎无需额外训练。实验表明，简单的基于相似性的检索（similarity-based retrieval）性能与当前文献中更复杂的基于注意力聚合的方法相当，且在针对单设备图像时，基于检索的分割优于微调，支持近乎即时地适应困难样本。

链接: https://arxiv.org/abs/2605.12430
作者: Joaquín Figueira,Rob Van Gastel,Giacomo D’Amicantonio,Zhuoran Liu,Ioan Gabriel Bucur,Faysal Boughorbel,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学); ASMPT (ASMPT); Radboud University (拉德堡德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the AI4RWC Workshop at CVPR 2026

点击查看摘要

Abstract:Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.

[CV-14] Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLM s from Omnidirectional Images

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视角变化条件下空间推理能力不足的问题，具体聚焦于360度全向图像中的视角条件空间推理（Perspective-Conditioned Spatial Reasoning, PCSR）。现有MLLMs在基础视觉感知上表现良好，但在需要理解视角变换的复杂推理任务（如自我中心旋转、自我中心失真及开放式组合推理）中准确率极低，存在显著的感知-推理差距。解决方案的关键在于两个方面：一是构建了一个名为PCSR-Bench的诊断基准，包含84,373个问答对，覆盖8个任务层级（从基础感知到高级PCSR），以系统评估和揭示这一差距；二是通过基于强化学习（reward-shaping）的干预研究，在7B规模模型上发现该差距具有部分可塑性——通过精心设计的奖励塑形，在受控条件下可将准确率从31.10%提升至60.06%。这表明尽管PCSR是当前MLLMs的关键瓶颈，但通过针对性的优化（如奖励设计）仍存在有限但有效的改善空间。

链接: https://arxiv.org/abs/2605.12413
作者: Yuangong Chen(1),Wai Keung Wong(1),Jiaxing Li(2),Ioannis Patras(3),Xu Zheng(3 and 4) ((1) The Hong Kong Polytechnic University, (2) Guangzhou University, (3) Queen Mary University of London, (4) HKUST (Guangzhou))
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

[CV-15] GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction SIGGRAPH2026

【速读】：该论文试图解决3D Gaussian Splatting (3DGS) 在稀疏视图约束下训练时产生的严重伪影问题，特别是现有基于图像扩散模型的方法因多视图自注意力机制在渲染视图严重损坏时依赖受损查询特征而导致跨视图检索错误，进而无法进行一致的渲染细化。解决方案的关键是提出了GeoQuery，一个几何引导的扩散框架，其核心是新颖的几何引导跨视图注意力（Geometry-guided Cross-view Attention，GCA）机制：首先通过预测深度图和相机位姿构造几何诱导的对应场，从中采样参考特征，形成几何对齐的代理查询（proxy query）以替代受损的渲染特征；其次设计新的跨视图特征聚合流程，将跨视图注意力限制在每个代理查询周围的局部窗口内，从而有效检索有用特征并抑制虚假匹配。该方法可无缝集成到现有扩散管道，在极端稀疏视图下实现鲁棒重建。

链接: https://arxiv.org/abs/2605.12399
作者: Xiao Cao,Yuze Li,Youmin Zhang,Jiayu Song,Cheng Yan,Wen Li,Lixin Duan
机构: University of Electronic Science and Technology of China (电子科技大学); Rawmantic AI (Rawmantic AI公司); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept to SIGGRAPH 2026 Conference Track

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.

[CV-16] SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation ICML2026

【速读】：该论文试图解决大规模图像中微小稀疏结构分割所面临的核心问题：基于体素的网格计算导致全分辨率推理可扩展性差，必须依赖固定区域划分或降采样，这不仅将计算成本与图像分辨率耦合，还会在少数类结构信息最丰富的边界处削弱证据；同时，极端类别不平衡进一步加剧了分割困难。解决方案的关键在于提出了SEMIR（语义少数诱导表示学习）框架，其核心思想是将推理过程从原始网格解耦，通过学习一个任务自适应且拓扑保持的潜在图表示来实现精确解码。具体而言，SEMIR通过参数化的边收缩、节点删除和边删除，将原始网格图转化为紧凑的边界对齐图子式（graph minor），并保留从子式预测到网格标签的精确提升映射（lifting map）。子式构建被形式化为一个少样本结构学习问题，用边界对齐目标替代手工预处理：通过边界Dice准则最大化预测边界与目标特定语义边界的对齐来学习子式参数。子式的节点被赋予尺度和旋转鲁棒的几何与强度描述符，并支持通过带有关系边特征的图神经网络（GNN）进行高效的区域级推理。实验在BraTS 2021、KiTS23和LiTS三个肿瘤分割数据集上验证了SEMIR在少数结构Dice系数和实际运行时间上的一致改进。

链接: https://arxiv.org/abs/2605.12389
作者: Luke James Miller,Yugyung Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices

点击查看摘要

Abstract:Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance – dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets – BraTS 2021, KiTS23, and LiTS – where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.

[CV-17] Fast Image Super-Resolution via Consistency Rectified Flow ICCV2025

【速读】：该论文旨在解决扩散模型在图像超分辨率（Super-Resolution, SR）中因依赖多步采样而导致的效率低下问题，同时克服现有少步或单步方法在建模噪声输入过程不充分或未能充分利用迭代生成先验、从而损害重建图像保真度和质量的缺陷。解决方案的关键在于提出FlowSR，将SR问题重新表述为从低分辨率（Low-Resolution, LR）到高分辨率（High-Resolution, HR）图像的整流流（Rectified Flow），并采用一种改进的一致性学习策略实现单步高质量重建。具体而言，该方法通过引入HR正则化（HR Regularization）来精炼原始一致性蒸馏过程，确保所学SR流既强制自一致性（Self-Consistency）又能精确收敛到真实HR目标；同时，设计了一种快慢调度策略（Fast-Slow Scheduling Strategy），从两个不同的调度器中采样相邻时间步用于一致性学习——快调度器使用较少时间步提升效率，慢调度器使用更多时间步捕捉细粒度纹理细节，从而在单步推理中兼顾效率与图像质量。

链接: https://arxiv.org/abs/2605.12377
作者: Jiaqi Xu,Wenbo Li,Haoze Sun,Fan Li,Zhixin Wang,Long Peng,Jingjing Ren,Haoran Yang,Xiaowei Hu,Renjing Pei,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); HKUST (GZ) (香港科技大学(广州)); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.

[CV-18] Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）视觉潜在推理（visual latent reasoning）中现有“输出作为输入（output-as-input）”范式导致的性能不稳定问题，其根源在于特征空间不匹配：预归一化MLLM（pre-norm MLLM）的解码器隐藏状态与模型训练时使用的输入嵌入（input embeddings）处于不同的范数体系（norm regime），使直接潜在反馈变得不可靠。解决方案的关键在于提出粒度对齐范式（Granular Alignment Paradigm, GAP），通过三层次对齐来弥合这一不匹配：特征级对齐（feature-level alignment）利用轻量级PCA对齐潜在头（PCA-aligned latent head）将解码器输出映射为输入兼容的视觉潜在；上下文级对齐（context-level alignment）借助可检查的辅助视觉监督为潜在目标提供接地信号；能力引导对齐（capacity-guided alignment）则选择性地在基础MLLM难以处理的样本上施加潜在监督。这种多层次对齐使得生成的潜在视觉信号不仅能提供额外令牌槽位（token slots），还能提供任务相关的视觉信息，从而稳定并提升感知与推理性能。

链接: https://arxiv.org/abs/2605.12374
作者: Yanting Miao,Yutao Sun,Dexin Wang,Mengyu Zhou,Pascal Poupart,Lei Lv,Qi Zhao,Li Wang,Hao Li,Xiaoxi Jiang,Guanjun Jiang
机构: University of Waterloo (滑铁卢大学); Qwen Large Model Application Team, Alibaba (Qwen大模型应用团队，阿里巴巴); Vector Institute (向量研究所); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citepxie2025mhc,li2026siamesenorm,team2026attention. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbfGAP, a \textbfGranular \textbfAlignment \textbfParadigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

[CV-19] VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference ICML2026

【速读】：该论文旨在解决训练自由的开词语义分割（training-free open-vocabulary semantic segmentation）中，因CLIP固有的深层空间偏差以及现有方法效率与泛化性不足所带来的挑战。其核心创新在于突破了基于CLIP的范式，转而利用具备空间感知能力的框架（如基于此http URL的模型），但发现文本查询的语义歧义会引发跨模态交互中的严重不匹配。解决方案的关键是引入视觉引导的提示进化（Visual-guided Prompt Evolution, VIP），通过别名扩展（alias expansion）与视觉引导的蒸馏机制挖掘语义线索，再以显著性感知方式（saliency-aware manner）稳健聚合这些线索，从而修正文本查询的语义表达能力，释放框架在细粒度物体感知上的潜力，最终实现高效、高质量且泛化性强的语义分割。

链接: https://arxiv.org/abs/2605.12325
作者: Hao Zhu,Shuo Jin,Wenbin Liao,Jiayu Xiao,Yan Zhu,Siyue Yu,Feng Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware this http URL framework to facilitate more efficient and high-quality dense prediction. While this http URL exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcoloroursblue\textbfVIsual-guided \textcoloroursblue\textbfPrompt evolution (\textcoloroursblue\textbf\textitVIP) to rectify the semantic expressiveness of text queries in this http URL, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding182 surpasses the top-leading methods by 1.4% \sim 8.4% average mIoU, \ding183 generalizes well to diverse challenging domains, and \ding184 requires marginal inference time and memory overhead. \hrefthis https URLOur code is publicly available at GitHub \faGithub.

[CV-20] Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos MICCAI2026

【速读】：该论文旨在解决结肠镜视频中息肉轨迹（polyp tracklets）鲁棒表示的学习问题，其核心挑战在于难以获取正确的正负样本对来训练监督对比学习（supervised contrastive learning），因为标记需要专业知识且成本高昂。解决方案的关键在于利用结肠镜操作的时序工作流（sequential workflow）来自监督地推导时间上的关联，同时引入一种噪声感知对比损失（noise-aware contrastive loss）来容忍这些时序关联中可能存在的错误标签，从而在无需昂贵人工标注的情况下学习有效的息肉表示。

链接: https://arxiv.org/abs/2605.12320
作者: Luca Parolari,Pietro Gori,Lamberto Ballan,Carlo Biffi,Loic Le Folgoc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at this https URL.

[CV-21] G2TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

【速读】：该论文旨在解决分离编码器统一多模态模型（separate-encoder Unified multimodal models, UMMs）因密集视觉 token 处理导致推理成本快速增长的问题，同时避免现有视觉 token 缩减方法（依赖注意力分数或文本-图像相似度）隐式假设目标为判别推理而损害图像编辑能力。解决方案的关键在于提出生成引导的视觉 token 缩减框架 G²TR，其核心洞察是：利用生成分支（VAE 潜在空间）提供任务无关的信号，通过评估 token 与 VAE 潜在的一致性来估计重要性，进而执行平衡的 token 选择与冗余 token 合并，在保持语义相关性和潜在空间图像重建能力的同时，实现无需训练、即插即用的高效 token 缩减，最终将视觉 token 数量与预填充计算量降低 1.94 倍，且不损失推理准确性与编辑质量。

链接: https://arxiv.org/abs/2605.12309
作者: Junxian Li,Kai Liu,Zizhong Ding,Zhixin Wang,Zhikai Chen,Renjing Pei,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Technologies Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is at: this https URL

点击查看摘要

Abstract:The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model’s capabilities for editing images. We propose G ^2 TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G ^2 TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G ^2 TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.

[CV-22] KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks

【速读】：该论文旨在解决持续学习（continual learning, CL）中的灾难性遗忘问题，即不同任务间共享的参数相互干扰，而现有正则化方法（如EWC和SI）施加统一的惩罚，未考虑参数服务于哪个输入区域。解决方案的关键在于利用Kolmogorov-Arnold网络（KANs）的紧支撑样条参数化（compact-support spline parameterization）特性，在每节点（per-knot）粒度上进行重要性加权锚定（importance-weighted anchoring）。具体而言，将KAN作为分类头部署在卷积骨干网络上，并对骨干网络施加标准EWC正则化（bbEWC），从而利用KAN样条的局部性（spline locality）实现跨任务参数的自然隔离。此外，通过神经正切核（Neural Tangent Kernel, NTK）分析表明，KAN的样条局部性导致跨任务NTK出现结构性秩亏，即使在特征学习（feature-learning）机制下也能推导出遗忘界限，从而在理论上证明了该组合方法的有效性。

链接: https://arxiv.org/abs/2605.12306
作者: Minjong Cheon
机构: Sejong University (世宗大学); Department of Computer Science and Engineering (计算机科学与工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN’s spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.

[CV-23] Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

【速读】：该论文试图解决当前多模态语言模型在处理复杂交错式多图像指令（interleaved multi-image instructions）时性能显著下降的问题。现有方法的瓶颈在于图像与文本在结构上相互分离，迫使模型必须建立困难的长程依赖关系（long-range dependencies）来将描述与视觉目标匹配。解决方案的关键在于提出 INSET（Images iN SEnTences）统一生成模型，其核心创新是将图像作为原生词汇（native vocabulary）直接嵌入到文本指令的对应语义位置，利用Transformer的上下文局部性（contextual locality）实现精确的对象绑定（object binding），从而将图像视为密集且富有表达力的语言标记。此外，作者设计了一个可扩展的数据引擎，借助视觉语言模型（VLM）和大语言模型（LLM）从标准图像与视频数据集中合成1500万高质量的交错样本，以支持长程序列的训练。这使得INSET在InterleaveBench基准上显著超越了现有方法，尤其在输入复杂度增加时性能优势更加明显，并且自然拓展到多模态图像编辑任务。

链接: https://arxiv.org/abs/2605.12305
作者: Yabo Zhang,Kunchang Li,Dewei Zhou,Xinyu Huang,Xun Wang
机构: ByteDance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \textttImages i\textttN \textttSEn\textttTences (\textita.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

[CV-24] EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

【速读】：该论文试图解决现有基于帧的相机在自我中心（egocentric）3D双手姿态估计和手势识别中存在的运动模糊、动态范围有限，以及现有基于事件（event-based）的方法受自我运动干扰、单目深度模糊和缺乏大规模真实世界立体数据集的问题。解决方案的关键是提出了EgoEV-HandPose端到端框架，其核心是KeypointBEV（Keypoint Bird’s-Eye-View）模块，一种灵活的立体融合模块，通过将特征提升到规范的鸟瞰图空间，并采用迭代重投影引导的细化循环（iterative reprojection-guided refinement loop），逐步解决深度不确定性并强制执行运动学一致性（kinematic consistency）；同时，作者引入了EgoEVHands，首个大规模真实世界立体事件相机数据集，包含5,419个标注序列，覆盖38种手势类别和多种光照条件，为模型训练和评估提供基础。

链接: https://arxiv.org/abs/2605.12297
作者: Luming Wang,Hao Shi,Jiajun Zhai,Kailun Yang,Kaiwei Wang
机构: National Research Center for Optical Instrumentation, Zhejiang University (浙江大学光学仪器国家研究中心); School of Artificial Intelligence and Robotics, Hunan University (湖南大学人工智能与机器人学院); National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University (湖南大学机器人视觉感知与控制技术国家工程研究中心); Ant Group Company Ltd. (蚂蚁集团); Central Research Institute of Sunny Optical Technology (舜宇光学科技中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Extended version of SMC 2025 paper arXiv:2503.12419 . The established dataset and source code will be publicly released at this https URL

点击查看摘要

Abstract:Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird’s-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at this https URL.

[CV-25] Large-Small Model Collaboration for Farmland Semantic Change Detection

【速读】：该论文试图解决现有农田语义变化检测（Farmland Semantic Change Detection, SCD）中缺乏细粒度标注基准以及视觉变化检测模型易受物候诱导的伪变化（如作物轮作、季节性变化、光照差异）干扰的问题。解决方案的关键在于：首先，构建了包含4,588对双时相影像和像素级标签的大规模细粒度农田SCD基准HZNU-FCD，并制定了统一的五类农田到非农田的“from-to”标注协议；其次，提出了一种大小模型协作的SCD框架，其中小型模型（Fine-grained Difference-aware Mamba, FD-Mamba）学习密集变化表示以保持边界和定位小区域，而大型模型路径（Cross-modal Logical Arbitration, CMLA）利用冻结的大型视觉语言模型（CLIP）引入文本先验进行提示引导的语义仲裁和伪变化抑制；最后，设计了硬区域协同训练策略，仅对低置信度像素施加CMLA语义得分图的监督，从而实现有效的协同学习。

链接: https://arxiv.org/abs/2605.12282
作者: Xinjia Li,Rui Wang,Qiurong Peng,Lingfei Ye,Dengrong Zhang,Haoyu Zhang
机构: College of Information Science and Technology, Hangzhou Normal University, Hangzhou, China (杭州师范大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated “from-to” annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at this https URL.

[CV-26] Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

【速读】：该论文试图解决当前视觉生成（visual generation）中将用户意图通过文本序列化（text serialization）所造成的瓶颈问题——文本会压缩空间结构（spatial structure）、精确外观（exact appearance）和字形（glyph shape）等视觉信号，导致用户无法直接利用草图、参考图等视觉文档来指定生成目标。解决方案的关键在于提出视觉到视觉（Visual-to-Visual, V2V）生成范式，并设计训练无关框架 V2V-Zero：该框架利用现有视觉语言模型（Vision-Language Model, VLM）条件生成器中原有的映射特性，仅将文本条件替换为从视觉规范页面（visual specification page）提取的最终层隐藏状态（final-layer hidden states），从而在不微调模型的情况下使生成器接受视觉文档作为条件；通过这种方式，V2V-Zero 借助预冻结 VLM 对图像和文本的双模态编码能力，将视觉页面直接映射到生成器的条件空间，绕过了文本序列化的信息压缩损失。

链接: https://arxiv.org/abs/2605.12271
作者: Yaofang Liu,Kangning Cui,Meng Chu,Zhaoqing Li,Suiyun Zhang,Jean-Michel Morel,Xiaodong Cun,Haoxuan Che,Rui Liu,Raymond H. Chan
机构: City University of Hong Kong (香港城市大学); City University of Hong Kong (Dongguan) (香港城市大学（东莞）); The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Celia Research HK (Celia Research HK (香港)); Great Bay University (大湾区大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf\emphvisual-to-visual (V2V) generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbfV2V-Zero, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator’s conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbfSimple-V2V Bench, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0% of conditioning-token attention mass on visual-page hidden states. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.12271 [cs.CV] (or arXiv:2605.12271v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.12271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-27] CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts

【速读】：该论文试图解决基于图（graph-based）的机器学习在从CAD模型的边界表示（Boundary Representations, B-reps）中学习时，因纯几何表示缺乏工艺特定语义（如表面角色、弯曲意图等）而导致可制造性预测不准确的问题。解决方案的关键在于提出一种混合方法（hybrid approach），即通过一个基于规则（rule-based）的模块识别制造特征，并将这些特征（如弯曲特性、法兰长度、表面角色）作为节点属性（node attributes）集成到B-rep属性邻接图（attributed adjacency graphs）中，从而将学习信号聚焦于工艺相关的几何模式，最终结合领域知识与图学习来提升预测精度。

链接: https://arxiv.org/abs/2605.12266
作者: Matteo Ballegeer,Toon Van Camp,Willem Jaspers,Alp Bayar,Aung Nyein Soe,Martin Roelfs,Dries F. Benoit,Bieke Decraemer,Joost R. Duflou
机构: Data Analytics Research Group, Ghent University (数据分析研究组，根特大学); Corelab CVAMO, FlandersMake@UGent, Ghent University (CVAMO核心实验室，弗兰德斯制造@根特大学，根特大学); Corelab CodesignS, Flanders Make (协同设计S核心实验室，弗兰德斯制造); Department of Mechanical Engineering, KU Leuven/Flanders Make (机械工程系，鲁汶大学/弗兰德斯制造)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.

[CV-28] From Image Hashing to Scene Change Detection ICPR2026

【速读】：该论文试图解决传统图像哈希（image hashing）只能进行全局比较、无法定位变化区域的问题，从而将其限制在场景变化检测（scene change detection）之外。解决方案的关键在于提出HashSCD，一种基于patch的哈希框架：通过将空间对齐的图像块编码为紧凑的哈希码，并采用类似XOR的操作在汉明空间（Hamming space）中直接聚合这些编码，使得变化检测和定位无需重复推理历史图像即可高效完成。模型通过对比学习（contrastive learning）在patch和全局两个层面进行无监督训练，实现了计算成本和存储需求的大幅降低，同时保持与现有无监督哈希和场景变化检测方法相媲美的性能。

链接: https://arxiv.org/abs/2605.12259
作者: Anh-Kiet Duong,Marie-Claire Iatrides,Petra Gomez-Krämer,Jean-Michel Carozza
机构: L3i Laboratory, La Rochelle University (拉罗谢尔大学L3i实验室); LIENSs Laboratory, La Rochelle University (拉罗谢尔大学LIENSs实验室); Association Ferrocampus (Ferrocampus协会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages; accepted to ICPR 2026

点击查看摘要

Abstract:Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.

[CV-29] H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows

【速读】：该论文旨在解决计算机断层扫描（CT）中由高密度植入物引起的金属伪影问题，这些伪影会严重降低图像质量，进而损害诊断准确性并影响放疗计划的有效性。解决方案的核心在于提出了一个两阶段框架 H3D-MarNet：第一阶段采用基于小波（wavelet）的预处理模块，通过频率感知去噪在抑制金属伪影的同时保留解剖结构；第二阶段通过 Domain-TransNet 实现千伏级 CT（kVCT）到兆伏级 CT（MVCT）的域转换，该网络采用混合体积学习架构，集成卷积神经网络（CNN）编码器以捕捉细粒度局部解剖细节，以及基于 Transformer 的编码器以建模长程体积依赖关系，并通过注意力特征融合机制确保空间与上下文一致性，最后由多阶段注意力引导解码器配合深度监督逐步重建伪影抑制的 MVCT 体积。

链接: https://arxiv.org/abs/2605.12252
作者: Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Christian Micheloni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026

点击查看摘要

Abstract:Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows. Comments: Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.12252 [cs.CV] (or arXiv:2605.12252v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.12252 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-30] UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在超高分辨率地球观测图像中面临的尺度不匹配问题，即模型虽能接收高分辨率输入，却无法可靠地感知和定位微小的任务相关目标，作者将此现象称为“分辨率错觉”。为系统评估这一挑战，论文构建了UHR-Micro基准，并通过实验揭示了现有高分辨率VLM在空间定位与证据解析上的显著失败，且这些失败无法通过单纯增加模型容量克服，根源在于缺乏对微证据定位与使用的引导。解决方案的关键在于提出基于微证据主动感知（Micro-evidence Active Perception, MAP）的参考智能体，它将查询分解为证据导向的步骤，主动检查候选区域，并基于局部观测生成答案，从而将高分辨率推理从以图像为中心转变为以证据为中心，有效提升了微级感知能力。

链接: https://arxiv.org/abs/2605.12237
作者: Shuo Ni,Tong Wang,Jing Zhang,He Chen,Haonan Guo,Ning Zhang,Bo Du
机构: National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology (北京理工大学空间智能信息处理国家重点实验室); School of Computer Science, Wuhan University (武汉大学计算机学院); Zhongguancun Academy (中关村学院); State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (武汉大学测绘遥感信息工程国家重点实验室); The Department of Computing, Hong Kong Polytechnic University (香港理工大学计算学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a “resolution illusion”: higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at this https URL.

[CV-31] riBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion AAMAS2026

【速读】：该论文试图解决安全自主智能体和移动机器人在实时3D感知中，尤其是针对行人等脆弱道路使用者（VRUs）的快速检测问题，旨在克服现有方法在速度与精度之间的权衡。解决方案的关键在于提出一种新型鸟瞰视图编码（TriBand-BEV），将完整3D LiDAR点云映射为包含三个高度带的轻量级2D BEV张量，从而将3D检测显式重构为2D检测问题，再从BEV输出重建3D框。网络设计采用深层区域注意力、P1到P4的层次化双向颈部融合上下文与细节、以及基于分布焦点学习预测有向框侧偏移与旋转IoU损失的检测头；训练中引入垂直重分箱与轻度反射抖动通道增强以抵御记忆过拟合，并在3D重建阶段使用四分位距（IQR）滤波器去除噪声与离群点，最终在单消费级GPU上以49 FPS实现高精度检测，尤其对行人检测在KITTI数据集上显著超越Complex-YOLO。

链接: https://arxiv.org/abs/2605.12220
作者: Mohammad Khoshkdahan,Alexey Vinel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird’s eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

[CV-32] Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction

【速读】：该论文旨在解决基于多摄像头输入的鸟瞰视图（BEV）表示在高清（HD）地图构建中结构推理不一致的问题，具体源于依赖自车中心监督带来的不完整观测、遮挡以及远距离处信息密度降低所导致的视角效应和空间稀疏性。解决方案的关键在于提出了跨视角监督（Cross-View Supervision, CVS），这是一种表示学习范式，通过将来自自车对齐俯视角度的几何与拓扑先验转移到基于摄像头的BEV编码器中：CVS不额外添加语义损失，而是在共享BEV特征空间中对齐表示，并从具有视角特权的教师模型（利用俯视图像）中蒸馏出全局一致的结构知识到自车中心的骨干网络，从而在不改变推理架构且无需测试时俯视输入的前提下，显著提升结构一致性。

链接: https://arxiv.org/abs/2605.12218
作者: Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bird’s-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9,mAP in the standard 60\times30,\mathrmm region and +9.9,mAP in the extended 100\times50,\mathrmm setting, corresponding to a 44% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.

[CV-33] Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation

【速读】：该论文试图解决3D人体姿态估计中因训练集与测试集数据分布差异（domain gaps）导致的模型泛化性能不足问题。解决方案的关键在于提出一个可控人体姿态生成框架（controllable human pose generation framework），通过系统性地改变姿态、背景和相机视角来合成多样化的视频数据；该方法利用室内/真实世界与户外/虚拟数据集进行跨域数据融合（cross-domain data fusion）及可控视频生成，从而构建更贴合实际部署场景的增强训练集，有效提升模型在未见场景和数据集上的表现。

链接: https://arxiv.org/abs/2605.12198
作者: Xinhao Hu,Yiyi Zhang,Liqing Zhang,Jianfu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.

[CV-34] SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

【速读】：该论文旨在解决视频-音频联合生成（video-audio joint generation）中精确时间同步（temporal synchronization）的难题，即音频事件与其视觉触发之间的细粒度对齐（fine-grained alignment）。现有后训练方法多采用监督微调（Supervised Fine-Tuning），但均方误差损失（Mean Squared Error loss）对细微时间错位（temporal misalignment）的惩罚不足，难以提升时间敏感性。解决方案的关键是提出了一个名为SyncDPO的后训练框架，利用直接偏好优化（Direct Preference Optimization, DPO）来增强模型的时间对齐能力。其核心在于引入了高效的在线规则式负样本构建策略（on-the-fly rule-based negative construction），通过扭曲时间结构（如错位音画对）生成显式的负样本，避免了传统DPO中昂贵的采样和排序（sampling-and-ranking）流程。同时，采用课程学习策略（curriculum learning），逐步增加负样本的难度，从粗粒度错位过渡到细微不一致，从而有效强化模型对时间对齐的判别能力。该方法在多个基准上显著提升了时间对齐性能，并展现出优越的泛化能力。

链接: https://arxiv.org/abs/2605.12179
作者: Xin Cheng,Xihua Wang,Ying Ba,Yuyue Wang,Kaisi Guan,Yinbo Wang,Wenpu Li,Ruihua Song
机构: Renmin University of China(中国人民大学); Westlake University(西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model’s temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in this https URL.

[CV-35] UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

【速读】：该论文旨在解决扩散模型在视图合成（view synthesis）任务中因像素到潜在空间压缩（pixel-to-latent compression）与扩散幻觉（diffusion hallucination）导致的质量退化问题，具体表现为模糊细节与几何结构扭曲。解决方案的关键在于提出一个通用的参考引导修复框架 UniFixer，采用从粗到细（coarse-to-fine）的策略：首先通过参考预对齐模块（reference pre-alignment module）实现参考视图与退化新视图的粗对齐；继而利用全局结构锚定机制（global structure anchoring mechanism）校正几何畸变以保障结构保真度；最后通过局部细节注入模块（local detail injection module）恢复细粒度纹理细节。UniFixer 作为即插即用（plug-and-play）的精化器，能够零样本（zero-shot）修复多种扩散退化类型。

链接: https://arxiv.org/abs/2605.12169
作者: Sihan Chen,Xiang Zhang,Yang Zhang,Tunc Aydin,Christopher Schroers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.

[CV-36] From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation ICML2026

【速读】：该论文旨在解决视频生成模型在机器人操作中存在的核心问题：虽然这些模型能够通过预测长程未来观测来提供想象机制，但直接利用这些想象帧进行动作执行会导致视觉真实性与控制相关性之间的错配（mismatch between visual realism and control relevance），即预测观测过度强调感知保真度而忽视状态转移的动作因果性，从而导致控制间接且不稳定。解决方案的关键在于提出了面向控制的接口——混合潜在动作（MoLA, Mixture of Latent Actions）。MoLA 并非将预测帧直接传递给策略，而是通过混合预训练逆动力学模型（inverse dynamics models）从生成的视觉转换中推断出多种潜在动作的混合表示，这些模态感知的逆动力学模型分别捕捉互补的语义、深度和光流线索，从而构建一个结构化且物理基础（physically grounded）的动作表示，有效弥合视频想象与策略执行之间的鸿沟。

链接: https://arxiv.org/abs/2605.12167
作者: Yajie Li,Bozhou Zhang,Chun Gu,Zipei Ma,Jiahui Zhang,Jiankang Deng,Xiatian Zhu,Li Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

[CV-37] Self-Consistent Latent Reasoning : Long Latent Sequence Reasoning for Vision-Language Model

【速读】：该论文试图解决视觉潜在推理（visual latent reasoning）中一个反直觉的性能退化问题：随着潜在序列（latent sequence）长度增加，现有方法的性能系统性地下降。论文揭示其根本原因是信息增益崩溃（Information Gain Collapse）——自回归生成导致后续token高度依赖先前输出，几乎无法引入新信息；同时，高度池化（≥128倍）的图像嵌入作为监督目标时，其信号与无意义占位符无异。解决方案的关键在于提出SCOLAR（Self-COnsistent LAtent Reasoning），它引入一个轻量级的去变换器（detransformer），利用大语言模型（LLM）的完整序列隐藏状态一次性生成辅助视觉token（auxiliary visual tokens），每个token独立锚定（anchor）到原始视觉空间，从而避免信息增益崩溃。结合三阶段监督微调（SFT）和基于自适应低秩偏好优化（ALPO）的强化学习，SCOLAR将可接受的潜在思维链（Chain-of-Thought）长度延长超过30倍，并在真实世界推理基准上实现了开源模型中的最优性能，同时展现出强大的分布外泛化能力。

链接: https://arxiv.org/abs/2605.12163
作者: Chenfeng Wang,Wei He,Xuhan Zhu,Chunpeng Zhou,Qizhen Li,Song Yan,Yufei Zheng,Chengjun Yu,Fan Lu,Wei Zhai,Yang Cao,Pengfei Yu,Zheng-Jun Zha
机构: University of Science and Technology of China(中国科学技术大学); Li Auto Inc.(理想汽车股份有限公司); UESTC(电子科技大学); Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse – autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ( \geq 128\times ) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM’s full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over 30\times , achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

[CV-38] Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

【速读】：该论文试图解决多模态学习中长期存在的核心矛盾：如何在保留模态特异性结构的同时，实现跨模态的泛化能力。现有连续（implicit）方法虽能保留细粒度先验却难以泛化，离散（explicit）方法通过共享原型增强泛化但牺牲了模态特异性。解决方案的关键在于提出的CoDAAR（Cross-modal Discrete Alignment And Reconstruction）框架，它通过索引级对齐（index-level alignment）在模态专属的码本（codebook）之间建立语义共识，从而在统一的离散空间中同时保留模态独特结构并实现可泛化的跨模态表示。该框架结合两个互补机制：离散时间对齐（Discrete Temporal Alignment, DTA）实现细粒度时间量化，级联语义对齐（Cascading Semantic Alignment, CSA）促进渐进式跨模态语义一致性，二者共同构建无竞争的统一表示空间。通过自监督重建目标在配对多模态序列上训练，CoDAAR在事件分类、定位、视频分割和跨数据集迁移等跨模态泛化基准上达到最先进性能，为离散可泛化多模态表示学习建立了新范式。

链接: https://arxiv.org/abs/2605.12145
作者: Souptik Sen,Raneen Younis,Zahra Ahmadi
机构: Peter L. Reichertz Institute for Medical Informatics (Peter L. Reichertz 医学信息学研究所), Hannover Medical School (汉诺威医学院), Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

[CV-39] PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization

【速读】：该论文试图解决绝对位姿回归（APR）在基于3D高斯泼溅（3DGS）的数据增强中，因随机采样生成冗余视图和噪声样本而导致微调效率低、定位精度差的问题。解决方案的关键在于提出 PoseCompass，即一种智能化的合成位姿选择流水线，其核心是基于价值的排序机制，综合评估三个维度：定位难度（关注具有挑战性的区域）、覆盖新颖性（探索欠采样区域）和渲染可观测性（过滤伪影与噪声），从而筛选出信息量最大的位姿；随后通过轨迹约束生成候选、选取 Top-K 位姿，并利用 3DGS 结合轻量级扩散对齐合成视图，使位姿回归器在混合真实与合成数据上高效微调，最终在 7-Scenes 数据集上实现 3 倍加速（适应时间从 15.2 分钟降至 5.1 分钟）和 53.8% 的中位位姿误差降低。

链接: https://arxiv.org/abs/2605.12144
作者: Yanan Zhou,Zhaoyan Qian,Yanli Li,Nan Yang,Zhongliang Guo,Dong Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.

[CV-40] EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion MICCAI2026

【速读】：该论文试图解决超声心动图中心肌点跟踪（Myocardial Point Tracking, MPT）这一新兴方向中的运动估计问题。心肌运动与自然视频中的运动本质不同，它源于生理约束下的形变，具有时空连续性且运动轨迹局部受限。当前通用的粗到细（coarse-to-fine）两阶段架构中的粗初始化（coarse initialization）在此领域可能是不必要的冗余步骤。解决方案的关键在于提出一种仅包含精细阶段（fine-stage-only）的架构EchoTracker2，通过丰富像素级特征并融入局部时空上下文，再结合长程联合时序推理（long-range joint temporal reasoning），实现稳健的跟踪。实验表明，该模型相比领域特定的最先进模型，位置精度提高6.5%，中位轨迹误差降低12.2%；相比最佳通用点跟踪方法，改善分别为2.0%和5.3%，且与专家导出的全局纵向应变（global longitudinal strain, GLS）的一致性更好，并提升了测试-重测复现性。

链接: https://arxiv.org/abs/2605.12140
作者: Md Abulkalam Azad,Vegard Holmstrøm,John Nyberg,Lasse Lovstakken,Håvard Dalen,Bjørnar Grenne,Andreas Østvik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted (top 9%) to MICCAI 2026

点击查看摘要

Abstract:Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbfEchoTracker2, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by 6.5% and reduces median trajectory error by 12.2% relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are 2.0% and 5.3% , respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: this https URL.

[CV-41] MULTI: Disentangling Camera Lens Sensor View and Domain for Novel Image Generation ICPR2026

【速读】：该论文旨在解决文本到图像生成中因文本歧义导致的精确控制难题，尤其是在需要特定风格或对象时，现有方法过度关注图像内容而忽略了成像因素（Imaging Factor）如相机镜头、传感器类型、成像视角和场景域特征等。解决方案的关键在于提出多因素解耦方法MULTI（Multi-factor disentanglement through Textual Inversion），通过两阶段学习：第一阶段学习通用成像因素，第二阶段提取数据集特有因素；这一设置不仅支持现有数据集的扩展和新因素组合的生成，从而缩小分布差异，还允许对特定因素进行修改，并借助ControlNets实现图像到图像生成，有效突破了当前方法在成像因素解耦上的局限性。

链接: https://arxiv.org/abs/2605.12134
作者: Sonali Godavarthy,Matthias Neuwirth-Trapp,Tim-Felix Faasch,Maarten Bieshaar,Michael Moeller,Danda Pani Paudel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICPR 2026

点击查看摘要

Abstract:Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes’ domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.

[CV-42] Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

【速读】：该论文试图解决文本到图像扩散模型（text-to-image diffusion models）在概念遗忘（unlearning）过程中，现有基于稀疏自编码器（Sparse Autoencoder, SAE）的方法因未强制概念分离，导致潜在特征跨概念共享，进而造成抑制目标概念时对非目标概念产生意外干扰的问题。解决方案的关键在于两点：一是引入概念感知对比目标（concept-aware contrastive objective），将潜在表示组织成概念特定的聚类（concept-specific clusters），从而实现更精确的概念抑制并减少非目标干扰；二是采用基于GeLU的非线性变换（GeLU-based nonlinear transformation）增强编码器的表达能力，在分离目标下构建更具判别性和解耦性的潜在空间（discriminative and disentangled latent space）。这一设计尤其针对联合风格-对象遗忘（joint style-object unlearning）这一挑战性场景，显著提升了性能。

链接: https://arxiv.org/abs/2605.12122
作者: Hyeonjin Kim,Hangyeol Jung,Heechan Yun,Sungjun Yun,Dong-Jun Han
机构: Yonsei University(延世大学); Kookmin University(国民大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 23 figures

点击查看摘要

Abstract:Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.

[CV-43] MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

【速读】：该论文试图解决生成式新视角合成中几何先验与外观先验的根本性冲突：几何先验能够提供空间对齐，但在视角变化下会变得稀疏且不准确；外观先验虽能提供视觉保真度，却缺乏几何对应关系。现有方法要么在生成过程中传播几何误差，要么在静态融合两种先验时产生信号冲突。解决方案的关键在于MoCam提出的结构化去噪动力学，其在扩散模型（diffusion model）的推理过程中，通过时间解耦（temporal decoupling）实现从几何到外观的协调渐进：早期阶段利用几何先验锚定粗粒度结构并容忍其不完整性，后期阶段切换至外观先验主动修正几何错误并细化细节。这种设计自然统一了静态与动态视图合成，实现了稳健的几何-外观解纠缠（geometry-appearance disentanglement）。

链接: https://arxiv.org/abs/2605.12119
作者: Haofeng Liu,Yang Zhou,Ziheng Wang,Zhengbo Xu,Zhan Peng,Jie Ma,Jun Liang,Shengfeng He,Jing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion this http URL first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion this http URL demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.

[CV-44] When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

【速读】：该论文试图解决强化学习人类反馈（RLHF）在对齐流匹配（flow-matching）文生图模型时引发的严重多样性崩溃（diversity collapse）问题，即模型在微调后收敛至感知空间中的狭窄高奖励区域。解决方案的关键在于指出传统强化学习中用于维持多样性的策略熵（policy entropy）在流模型下失效——由于固定预定义噪声调度，策略熵保持恒定而感知多样性却急剧下降，因此无法阻止多样性崩溃。作者提出感知熵（perceptual entropy）作为替代度量，它能在感知空间中捕捉多样性并保留标准熵的特性；基于此，设计了两种熵正则化策略：感知熵约束（Perceptual Entropy Constraint, PEC）和生成空间感知约束（Perceptual Constraints on Generation Space, PCGS），通过在优化过程中施加感知层面的正则化，在保持多样性与提升质量之间取得更优权衡。

链接: https://arxiv.org/abs/2605.12112
作者: Xiaofeng Tan,Jun Liu,Bin-Bin Gao,Yuanting Fan,Xi Jiang,Chengjie Wang,Hongsong Wang,Feng Zheng
机构: Southeast University(东南大学); Tencent Youtu Lab(腾讯优图实验室); Southern University of Science and Technology(南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline’s 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline’s 0.047). Our project page (this https URL) is publicly available.

[CV-45] UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

【速读】：该论文试图解决多参考图像生成（multi-reference image generation）中由解耦视觉条件化（decoupled visual conditioning）导致的属性泄漏（attribute leakage）和交叉参考混淆（cross-reference confusion）问题：现有VLM增强扩散模型将语义ViT特征与外观丰富VAE特征分开处理，使得模型难以将每个语义主语与正确参考图像的视觉细节正确关联，从而在复杂多参考场景下无法忠实保留主体身份和细粒度外观。解决方案之关键是提出了UniCustom统一视觉条件化框架，在VLM编码之前执行ViT特征与VAE特征的早期融合（early fusion），通过仅一个轻量线性融合层（lightweight linear fusion layer）让VLM的隐藏状态同时编码被指代的主语及其对应视觉外观。为实现这种统一表示，论文采用两阶段训练策略：第一阶段是面向重建的预训练（reconstruction-oriented pretraining），在融合隐藏状态中保留参考特定外观细节；第二阶段是在单参考和多参考生成任务上进行监督微调。此外，还引入槽位绑定正则化（slot-wise binding regularization），鼓励每个图像槽（image slot）保存其对应参考的低层细节，从而减少交叉参考纠缠（cross-reference entanglement）。

链接: https://arxiv.org/abs/2605.12088
作者: Yiyan Xu,Qiulin Wang,Wenjie Wang,Yunyao Mao,Xintao Wang,Pengfei Wan,Kun Gai,Fuli Feng
机构: University of Science and Technology of China(中国科学技术大学); Kling Team, Kuaishou Technology(快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

[CV-46] he Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Frag ments

【速读】：该论文旨在解决现有拼图求解方法普遍局限于严格方形拼图块（square puzzle pieces）的固有限制，无法处理现实中常见的形状不规则、边缘侵蚀的拼图碎片。解决方案的关键在于两方面的创新：一是构建了GAP数据集，利用真实考古碎片的学习分布生成合成的高度侵蚀、形状不受限制的拼图块，提供了更具挑战性和真实性的评估基准；二是提出了PuzzleFlow框架，该框架结合了视觉Transformer（ViT）和基于流匹配（Flow-Matching）的生成模型，能够灵活建模任意形状的拼图块之间的空间对应关系，从而在复杂拼图场景下超越了经典的基线方法和近期主流工作。

链接: https://arxiv.org/abs/2605.12077
作者: Ofir Itzhak Shahar,Gur Elkin,Ohad Ben-Shahar
机构: Stein Faculty of Computer and Information Science (斯坦因计算机与信息科学学院); Ben-Gurion University of the Negev (内盖夫本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.

[CV-47] BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

【速读】：该论文试图解决现有基准在过程性视频理解（procedural video understanding）中无法全面诊断模型失败原因的问题，因为现有评估通常将物体定位、手物交互、关系解析、时间推理和步骤级推断等能力分开评测，导致难以定位模型的缺陷。解决方案的关键在于构建BARISTA——一个密集标注的自我中心（egocentric）视频数据集，包含185个真实世界咖啡制作视频，覆盖全自动、portafilter和胶囊式三种工作流，并为每一帧提供经过验证的场景图（scene graph），将持久物体身份与掩码、轨迹、边界框、属性、类型化关系、手物交互、活动及过程步骤关联起来。基于这些场景图，进一步派生出零样本（zero-shot）语言任务，涵盖短语定位、手物交互识别、指代（referring）、活动识别、关系提取和时间视觉问答（temporal VQA），从而实现对模型在多种任务族上的综合诊断。

链接: https://arxiv.org/abs/2605.12074
作者: Patrick Knab,Orgest Xhelili,Inis Buzi,Drago Andres Guggiana Nilo,Mohd Saquib Khan,Lorenz Kolb,Manuel Scherzer,Kerem Yildirir,Christian Bartelt,Philipp Johannes Schubert
机构: Ramblr.ai Research(兰布尔AI研究); Technical University of Clausthal(克劳斯塔尔工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at this https URL.

[CV-48] PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting

【速读】：该论文试图解决基于dropout的稀疏视角3D高斯溅射（3D Gaussian Splatting, 3DGS）方法中因随机抑制高斯基元（Gaussian primitives）而引起不同dropout子集（dropped Gaussian subsets）之间不一致性，进而导致重建不稳定和表示次优的问题。解决方案的关键在于从一致性正则化（consistency regularization）视角出发，提出PairDropGS框架：首先从共享高斯场（shared Gaussian field）构造一对dropout子集，并通过低频一致性正则化（low-frequency consistency regularization）约束两者的低频渲染结构，从而在随机dropout下保持稳定的场景布局与粗略几何，同时避免对模糊高频细节的过度约束；此外引入渐进式一致性调度策略（progressive consistency scheduling strategy），逐步加强正则化强度以增强训练的稳定性和鲁棒性。

链接: https://arxiv.org/abs/2605.12072
作者: Hantang Li,Qiang Zhu,Xiandong Meng,Xingtao Wang,Debin Zhao,Xiaopeng Fan
机构: School of Computer Science, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学与技术学院); Pengcheng Laboratory, Shenzhen, China(鹏城实验室); Smart Coding Institute, Pengcheng Laboratory, Shenzhen, China(鹏城实验室智能编码研究所); School of Computer Science, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,8 figures

点击查看摘要

Abstract:Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation this http URL this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.

[CV-49] Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection ICIP2026

【速读】：该论文旨在解决零样本异常检测（zero-shot anomaly detection）中现有方法对所有样本采用统一特征变换的问题，忽视了正常数据与异常数据本质上的非对称分布——紧凑的正常模式（compact normals）相对于多样的异常模式（diverse anomalies）。其解决方案的关键在于提出一种异常感知的视觉-语言适应框架AVA-DINO（Anomaly-Aware Vision-Language Adaptation），该框架包含两个专用分支（dual specialized branches），分别针对正常和异常模式，并通过文本引导路由机制（text-guided routing mechanism）和显式路由正则化（explicit routing regularization）在辅助数据训练过程中联合学习两个分支，鼓励分支专门化。测试时，仅利用输入图像和固定的预定义语言描述动态组合两个分支，实现非对称激活（asymmetric activation），从而避免退化的统一路由，允许上下文特定的特征变换。

链接: https://arxiv.org/abs/2605.12069
作者: Muhammad Aqeel,Maham Nazir,Uzair Khan,Marco Cristani,Francesco Setti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICIP 2026

点击查看摘要

Abstract:Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. this https URL

[CV-50] AR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

【速读】：该论文旨在解决在大幅几何变形条件下光学与合成孔径雷达（SAR）图像间的跨模态配准难题，现有深度学习方法虽能捕捉共享特征，但无法同时应对跨模态外观差异与复杂空间变换的双重挑战。解决方案的关键在于引入文本语义先验：通过构建与遥感场景及土地覆盖类别相关的文本描述符，利用冻结的RemoteCLIP文本编码器提取文本特征，再以视觉-文本交互方式增强高视觉特征，从而缩小模态差异并提升跨模态特征的可靠性；随后采用由粗到精的密集匹配（CFDM）模块，基于增强的高层特征建立粗对应关系，并利用低层特征细化匹配位置，最终实现对大几何变形的鲁棒配准。

链接: https://arxiv.org/abs/2605.12064
作者: Zhuoyu Cai,Dou Quan,Ning Huyan,Pei He,Shuang Wang,Licheng Jiao
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Artificial Intelligence, Xidian University（中国教育部智能感知与图像理解重点实验室，西安电子科技大学人工智能学院）; Department of Automation, Tsinghua University（清华大学自动化系）
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

[CV-51] OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

【速读】：该论文旨在解决跨具身视频生成（cross-embodiment video generation）中可迁移动作学习与具身特定外观/形态之间的解耦困境及其对大范围扩展性的限制。现有方法常将动作动力学与具身特征纠缠，且高度依赖针对每个新具身的配对数据，导致难以扩展到全新的人形机器人。解决方案的关键在于提出OmniHumanoid框架，其核心思想是将动作迁移模型的学习与具身适应进行显式分解：首先，利用跨多种人形具身的运动对齐配对视频训练一个共享的动作迁移模型（shared motion transfer model）；其次，针对任意新具身，仅需一段未配对视频即可通过轻量级的具身特定适配器（lightweight embodiment-specific adapters）完成自适应，无需重新训练共享模型。为了进一步减少动作迁移与具身适应之间的相互干扰，框架引入了分支隔离的注意力设计（branch-isolated attention design），该设计将动作条件（motion conditioning）与具身特定调制（embodiment-specific modulation）分离到不同的注意力分支中，从而在训练和推理时保持两者独立。此外，作者还构建了一个合成的跨具身数据集（synthetic cross-embodiment dataset），包含多种人形资产、场景和视角下的运动对齐配对视频，为模型训练提供支撑。通过在合成和真实基准上的实验，OmniHumanoid实现了高保真的动作迁移和具身一致性，且可无缝泛化到未见过的具身类型。

链接: https://arxiv.org/abs/2605.12038
作者: Yiren Song,Xiyao Deng,Pei Yang,Yihan Wang,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

[CV-52] Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

【速读】：该论文旨在解决全模态语言模型（Omni-modal language models）评估中因“视觉捷径”（visual shortcuts）导致的性能膨胀问题——即仅凭视觉证据即可回答查询，掩盖了真正的音-视-语证据融合能力；同时探讨后训练（post-training）在视觉去偏评估环境下的表现。解决方案的关键在于：首先通过视觉单独探测（visual-only probing）审计九个现有全模态基准，构建了清理后的评估视图OmniClean，仅保留8,551个无法仅由视觉解决的查询（来自16,968个审计查询）；其次提出三阶段后训练策略OmniBoost（基于Qwen2.5-Omni-3B），包括混合双模态监督微调（SFT）、混合模态强化学习与验证推理（RLVR）以及基于自蒸馏数据（self-distilled data）的SFT，其中自蒸馏阶段使3B模型性能达到甚至略高于未使用更强全模态教师的Qwen3-Omni-30B-A3B-Instruct，表明自蒸馏的全模态查询监督是突破小模型性能瓶颈的关键。

链接: https://arxiv.org/abs/2605.12034
作者: Che Liu,Lichao Ma,Xiangyu Tony Zhang,Yuxin Zhang,Haoyang Zhang,Xuerui Yang,Fei Tian
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.

[CV-53] Resilient Vision-Tabular Multimodal Learning under Modality Missingness

【速读】：该论文旨在解决现实临床环境中普遍存在的模态缺失问题，即多模态深度学习模型在训练和推理时面临整个模态（如医疗图像、表格化临床变量）或其中个别特征不可用的情况，而现有方法大多假设模态完整可用，缺乏鲁棒性。解决方案的关键在于提出一种无需插补或启发式模型切换的多模态Transformer框架，其核心包含三个创新点：一是通过可学习的模态标记（learnable modality tokens）对单模态表示进行加权，并在多模态融合编码器中使用带掩码自注意力（masked self-attention）的中间融合机制，该机制能显式排除缺失模态或特征的信息聚合与梯度传播，从而在数据不完整时仍能稳定计算；二是引入模态丢弃正则化策略（modality-dropout regularization），在训练过程中随机移除部分可用模态，迫使模型学会利用部分数据中的互补信息，增强对任意缺失模式的适应能力。以上设计使得模型在从完全多模态到完全单模态的各种缺失率场景下，均能实现比基线方法更平滑的性能退化与更强的鲁棒性。

链接: https://arxiv.org/abs/2605.12031
作者: Camillo Maria Caruso,Valerio Guarrasi,Paolo Soda
机构: Università Campus Bio-Medico di Roma (罗马生物医学大学校区); Umeå University (于默奥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.

[CV-54] 4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

【速读】：该论文旨在解决从单目视频重建动态4D场景时，现有3D基础模型因全局注意力机制中相机自身运动（camera ego-motion）与物体运动（object motion）的固有耦合而导致的性能显著退化问题。解决方案的关键在于提出一种无需训练（training-free）的渐进式解耦框架，以由粗到细（coarse-to-fine）的原则分离动态与静态成分。核心逻辑是首先通过稳定相机姿态消除动态干扰，再执行几何细化。具体包括三个协同组件：一是动态掩码引导的姿态解耦模块（Dynamic-Mask-Guided Pose Decoupling），利用动态掩码隔离运动干扰，生成稳定的无运动参考帧；二是拓扑子空间手术机制（Topological Subspace Surgery），通过正交分解深度流形，在保留动态对象的同时，向静态区域注入掩码感知的精细化几何信息；三是信息论置信度感知融合策略（Information-Theoretic Confidence-Aware Fusion），将深度集成建模为异方差贝叶斯推理（heteroscedastic Bayesian inference）问题，采用逆方差加权（inverse-variance weighting）自适应融合多轮预测。该方法无需微调即可在标准4D重建基准上实现一致且显著的指标提升，展现了数学驱动下动态-静态解耦的潜力。

链接: https://arxiv.org/abs/2605.12027
作者: Ying Zang,Xuanyi Liu,Yidong Han,Deyi Ji,Chaotao Ding,Yuanqi Hu,Qi Zhu,Xuanfu Li,Jin Ma,Lingyun Sun,Tianrun Chen,Lanyun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.

[CV-55] Spectral Vision Transformer for Efficient Tokenization with Limited Data

【速读】：该论文试图解决在有限数据条件下，尤其是医学成像领域，视觉变换器（vision transformer）中标记化（tokenization）效率低下的问题。解决方案的关键在于提出一种新颖的谱视觉变换器（spectral vision transformer）架构，通过选择合适的谱基（spectral basis）进行谱投影（spectral projection），从而获得空间不变性（spatial invariance）和最优信噪比（optimal signal-to-noise ratio）等理论性质，同时显著降低计算复杂度，使得在参数数量减少的情况下，性能与紧凑型或标准视觉变换器、带注意力的卷积神经网络、移位窗口变换器（shifted window transformer）、多层感知机（multi-layer perceptron）及逻辑回归（logistic regression）等多种模型相当或更优。

链接: https://arxiv.org/abs/2605.12026
作者: Alexandra G. Roberts,Maneesh John,Jinwei Zhang,Dominick Romano,Mert Sisman,Ki Sueng Choi,Heejong Kim,Mert R. Sabuncu,Thanh D. Nguyen,Alexey V. Dimov,Pascal Spincemaille,Brian H. Kopell,Yi Wang
机构: Cornell University(康奈尔大学); Johns Hopkins University(约翰霍普金斯大学); Weill Cornell Medicine(威尔康奈尔医学院); Icahn School of Medicine at Mount Sinai(西奈山伊坎医学院); Mount Sinai Health System(西奈山医疗系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+this http URL.

[CV-56] What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

【速读】：该论文试图解决图像理解任务中“是什么”（what）与“在哪里”（where）信息在分类导向骨干网络中被隐式纠缠或抑制的问题，使得定位相关的目标发现、检测与分割任务相较于图像分类更为复杂。解决方案之关键在于提出了一种名为“What-Where Transformer (WWT)”的架构，其核心归纳偏置是“what-where分离”，通过两个创新设计实现：一是将Transformer中的令牌（tokens）视为“what”的表征，注意力图（attention maps）视为“where”的表征，并采用多流、基于槽（slot-based）的架构在并行的前馈模块中分别处理这两类信息；二是在最终层同时重用令牌和注意力图用于下游任务，并直接将其暴露于任务损失函数产生的梯度之下，从而更有效且显式地学习定位信息。这使得WWT即使在仅使用ImageNet单标签分类监督时，也能从原始注意力图中直接实现多目标发现，无需额外后处理，并在零样本目标发现和弱监督语义分割等任务上表现出优于ViT基方法的性能。

链接: https://arxiv.org/abs/2605.12021
作者: Ryota Yoshihashi,Masahiro Kada,Satoshi Ikehata,Rei Kawakami,Ikuro Sato
机构: Institute of Science Tokyo (东京科学大学); DENSO IT Laboratory (DENSO IT实验室); National Institute of Informatics (国立信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

[CV-57] FAME: Feature Activation Map Explanation on Image Classification and Face Recognition CVPR

【速读】：该论文旨在解决深度学习模型在图像处理中可解释性不足的问题，尤其是现有类激活映射（Class Activation Mapping, CAM）方法及其梯度变体假设特征图位置仅受其底层区域影响，这一假设在深层网络中不再成立，导致归因结果失真；同时，基于扰动的固定块方法（如CorrRISE）虽能提供像素级归因，但扰动方式不够灵活。解决方案的关键在于提出了特征激活图解释（Feature Activation Map Explanation, FAME），该方法融合了梯度与特征图信息：通过利用网络梯度计算输入图像的变化，以梯度驱动的方式而非固定块来扰动输入，从而避免了对特征图位置假设的依赖，并能生成更准确、与现有先进方法竞争的高质量归因图，在图像分类和人脸识别任务上验证了其有效性。

链接: https://arxiv.org/abs/2605.12017
作者: Xinyi Zhang,Manuel Günther
机构: Department of Informatics, University of Zurich (苏黎世大学信息学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for CVPR Workshop 2026

点击查看摘要

Abstract:Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM’s above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: \footnotesize this https URL.

[CV-58] L2P: Unlocking Latent Potential for Pixel Generation

【速读】：该论文旨在解决像素扩散模型（pixel diffusion models）因从头训练所需计算资源与数据量过于高昂而难以实际应用的问题。其解决方案的关键在于提出一种名为“潜变量到像素（Latent-to-Pixel, L2P）”的迁移范式：该范式直接利用预训练潜变量扩散模型（Latent Diffusion Models, LDMs）中的丰富知识，具体通过丢弃变分自编码器（VAE），转而采用大块分词化（large-patch tokenization），冻结源LDM的中间层，仅训练少量浅层网络来学习从潜空间到像素空间的变换。同时，L2P仅使用LDM生成的合成图像作为训练语料，无需收集任何真实数据，从而在极低数据量下实现快速收敛。这一策略使得L2P仅需8块GPU即可将大量潜空间先验无缝迁移至像素空间，并消除了VAE的内存瓶颈，原生支持4K超高清生成，在主流LDM架构上仅需极小的训练开销即可达到与源LDM相当的性能。

链接: https://arxiv.org/abs/2605.12013
作者: Zhennan Chen,Junwei Zhu,Xu Chen,Jiangning Zhang,Jiawei Chen,Zhuoqi Zeng,Wei Zhang,Chengjie Wang,Jian Yang,Ying Tai
机构: Nanjing University (南京大学); Tencent Youtu Lab (腾讯优图实验室); Hainan-biuh; Weess Gmbh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL

点击查看摘要

Abstract:Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM’s intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

[CV-59] Robust Promptable Video Object Segmentation CVPR2026

【速读】：该论文旨在解决提示性视频目标分割（Promptable Video Object Segmentation, PVOS）模型在输入损坏（例如恶劣现实条件或时间变化噪声）下性能严重下降的问题，这是首个针对鲁棒PVOS（RobustPVOS）的综合研究。解决方案的关键在于两方面：一是构建了一个新的综合性基准，包含两个真实世界恶劣条件下的评估数据集（351个视频片段和超过2500个目标掩码），并通过对现有视频目标分割（VOS）数据集应用多样且时间变化的损坏生成合成训练数据；二是提出了一种名为记忆-目标条件门控秩适应（Memory-object-conditioned Gated-rank Adaptation，MoGA）的方法，该方法通过利用跨帧内存中维护的目标特定表示来调节鲁棒化过程，从而有效处理目标特定的退化，并确保预测在时间上的一致性，使模型能够以时间一致的方式差异化处理每个跟踪目标。

链接: https://arxiv.org/abs/2605.12006
作者: Sohyun Lee,Yeho Gwon,Lukas Hoyer,Konrad Schindler,Christos Sakaridis,Suha Kwak
机构: POSTECH(浦项科技大学); Google(谷歌); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA’s efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at this https URL.

[CV-60] EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

【速读】：该论文旨在解决文本引导的图像修复（text-guided inpainting）导致的愈加真实的图像伪造问题，特别是现有方法在跨域（cross-domain）检测中难以识别可疑信号、且无法本地化任意分辨率图像中的篡改区域。解决方案的关键是提出一个基于补丁（patch）的双分支框架EDGER：第一个分支为边缘引导分割（Edge-Guided Segmentation），通过频率边缘检测器（Frequency-based Edge Detector）强调篡改边界处的高频不一致性，并微调SegFormer以融合RGB和边缘特征来生成像素级掩码；第二个分支为合成热图（Synthetic Heatmapping），利用CLIP-ViT图像编码器结合LoRA微调，对完全合成的补丁进行分类，提供粗粒度的补丁级合成先验。两个分支互补，在部分篡改的补丁中由边缘引导分割锐化边界，在全合成的补丁中由合成热图标记，从而在不损失原生分辨率的情况下，实现对多兆像素图像的综合定位，并具备强跨域泛化能力。

链接: https://arxiv.org/abs/2605.12002
作者: Minh-Khoa Le-Phan,Minh-Hoang Le,Minh-Triet Tran,Trong-Le Do
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

点击查看摘要

Abstract:Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task’s setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

[CV-61] A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

【速读】：该论文旨在解决在图像分类任务中如何选择最佳预训练模型以满足目标域需求的问题。其解决方案关键在于通过在ImageNet上预训练的11个模型上微调输出层和通用网络参数，并在五个不同目标域数据集上进行评估，同时测量准确率、准确率密度、训练时间和模型大小等指标，从而系统性地比较这些模型在单次和十次训练会话中的表现，以确定最优选择。

链接: https://arxiv.org/abs/2605.11989
作者: Nermeen Abou Baker,Nico Zengeler,Uwe Handmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published by Machine Learning and Knowledge Extraction Journal

点击查看摘要

Abstract:Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.

[CV-62] Optimizing 4D Wires for Sparse 3D Abstraction

【速读】：该论文旨在解决现有3D几何抽象方法中因使用多个独立曲线片段而导致的拓扑碎片化、结构不连贯以及物理可实现性受限的问题。其解决方案的关键在于提出一种统一的表示框架：采用单一连续的四维B样条曲线（spatial coordinates plus variable width, 即 (x,y,z,w)）作为几何基元，通过强制全局拓扑连续性，将传统依赖于局部密度累积的草图过程重构为一个全局路由优化问题，从而引入更强的归纳偏置以生成更整洁、结构更连贯的抽象。此外，为实现基于梯度的优化，论文引入了具备可控投影误差的宽度可变曲线高效光栅化可微分渲染管线，使得该表示能适配Score Distillation Sampling (SDS) 或 CLIP 等现代引导信号，从而在图像到3D抽象、多视图线艺术生成等任务中取得优于离散曲线集合的语义保真度和结构连贯性。

链接: https://arxiv.org/abs/2605.11977
作者: Dong-Yi Wu,Tong-Yee Lee
机构: National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width (x,y,z,w) . Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.

[CV-63] H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes

【速读】：该论文旨在解决缺乏语义标签或固定词汇的情况下，实现从细粒度对象部分到完整对象的多粒度3D场景分组问题。其核心挑战在于如何将2D基础模型提供的线索转化为连贯的分层监督，并将这种层次结构嵌入到3D表示中。解决方案的关键是提出了H2G（Hyperbolic affinity field for hierarchical 3D grouping），即一个双曲亲和场。通过利用Dasgupta目标从2D基础模型的亲和力中推导出语义组织的树形监督，然后将该监督蒸馏到一个单一的Lorentz双曲特征场中，利用双曲几何的树状分支结构特性。同时，设计了一个层次感知目标，将该场与细粒度分配、粗粒度对象结构、紧凑特征聚类和最低公共祖先（LCA）排序对齐，从而在一个特征空间中表示多个分组级别，实现了基于2D基础模型知识的语义分层分组。

链接: https://arxiv.org/abs/2605.11967
作者: ByungHa Ko,Youngmin Lee,Dong Hwan Kim
机构: Department of Computer Science and Engineering, Korea University (高丽大学计算机科学与工程系); Intelligence and Interaction Research Center, Korea Institute of Science and Technology (韩国科学技术研究院智能与交互研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta’s objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.

[CV-64] What Does It Mean for a Medical AI System to Be Right?

【速读】：该论文试图解决的核心问题是：在临床诊断场景（如基于数字化骨髓涂片自动分类浆细胞以诊断多发性骨髓瘤）中，如何恰当地定义和评估医疗AI系统的“正确性”。论文指出现有做法往往将正确性简化为单一基准性能指标，这种还原论视角忽略了临床真实情境中的复杂因素。解决方案的关键在于将医疗AI的正确性重构为一个多维概念体系，具体包含四个相互关联的维度：一是依赖专家标注数据集固有的不稳定性（ground truth instability），二是克服模型输出过度自信导致的不可解释性（opacity），三是批判标准临床评估指标在实践中的不充分性（inadequacy），四是警惕在时间压力下自动化偏见（automation bias）对责任分配（accountability distribution）的侵蚀。通过整合科学哲学与研究伦理的视角，论文主张正确性应当同时兼顾数据质量、模型可解释性、指标临床意义及人机协作中的责任机制，从而超越单纯的技术性能评估。

链接: https://arxiv.org/abs/2605.11963
作者: Antony Gitau
机构: University of South-Eastern Norway (挪威东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Part of a PhD ethics course

点击查看摘要

Abstract:This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.

[CV-65] Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

【速读】：该论文旨在解决当前视觉大语言模型（Visual Large Language Models, VLLMs）在面对历史书写系统（如汉字）的连续形态演变时，其感知鲁棒性尚未被系统评估的问题。现有古代文本数据集多局限于孤立历史时期，无法反映跨越数千年的系统性视觉分布偏移。为解决此问题，论文引入了Chronicles-OCR基准数据集，其关键解决方案包括：首先，与顶级机构领域专家合作构建包含2800张严格平衡图像的数据集，覆盖从龟甲到纸本书法等多种物理介质，并涵盖汉字完整演变轨迹（即“七种汉字书体”）；其次，提出一种分阶段自适应标注范式（Stage-Adaptive Annotation Paradigm），以适应不同历史阶段剧烈的形态和拓扑变化；最后，基于该范式设计了四个定量任务——跨时期字符检测、基于视觉指向的细粒度古文字识别、古代文本解析和书体分类——通过将视觉感知与语义推理相隔离，从而提供一个权威平台以暴露当前VLLMs在演化感知上的局限性。

链接: https://arxiv.org/abs/2605.11960
作者: Gengluo Li,Shangpin Peng,Xingyu Wan,Chengquan Zhang,Hao Feng,Xin Xu,Pian Wu,Bang Li,Zengmao Ding,Yongge Liu,Yipei Ye,Yang Yang,Zhan Shu,Guojun Yan,Zhe Li,Can Ma,Weiping Wang,Yu Zhou,Han Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Tencent (腾讯); Anyang Normal University (安阳师范学院); The Palace Museum (故宫博物院); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at this https URL.

[CV-66] Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

【速读】：该论文旨在解决在类别不平衡数据集中，使用提示学习（prompt learning）微调预训练视觉语言模型（VLMs）时，尾部类别（tail classes）的可区分性下降的问题，同时避免牺牲模型对未知类别的泛化能力。其解决方案的关键在于提出了聚类感知神经崩溃提示微调（cluster-aware neural collapse prompt tuning，CPT），包含两个核心设计：首先，通过从预训练VLM中挖掘语义分配并映射到提示微调特征，构建一个聚类不变空间（cluster-invariant space），在该空间中计算聚类级别的边界并将约束限制在局部邻域，从而减少对预训练VLM全局语义结构的干扰；其次，引入基于神经崩溃（neural collapse）的可区分性优化机制，通过文本等角紧框架（Equiangular Tight Frame，ETF）分离损失、类内收敛损失和旋转稳定损失三个损失函数协同作用，优化簇内几何结构，实现更好的类间分离和类内对齐。

链接: https://arxiv.org/abs/2605.11939
作者: Boyang Guo,Liang Li,Lin Peng,Yuhan Gao,Xichun Sheng,Chenggang Yan
机构: Hangzhou Dianzi University (杭州电子科技大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); The Hong Kong Polytechnic University (香港理工大学); Macao Polytechnic University (澳门理工大学); Zhejiang Provincial Key Laboratory of Low Altitude Ubiquitous Networking Technology, HDU (浙江省低空泛在互联技术重点实验室，杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.

[CV-67] Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution ISCAS2026

【速读】：该论文旨在解决引导深度超分辨率（GDSR）中现有方法难以在RGB和深度模态间建立高效且语义交互的联合表征问题。现有方法要么独立建模每个模态，忽略跨模态语义关联，要么依赖计算复杂度为二次的注意力机制，限制了交互效率。解决方案的关键在于提出一个以交互状态空间模型（Interactive State Space Model）为核心的GDSR框架，其中包含：一种跨模态局部扫描机制，实现RGB和深度特征间的细粒度语义交互；利用Mamba架构实现线性复杂度的全局建模；以及一个跨模态匹配变换模块，通过利用两个模态的典型特征增强交互建模质量。

链接: https://arxiv.org/abs/2605.11934
作者: Chen Wu,Ling Wang,Zhuoran Zheng,Xiangyu Chen,Jingyuan Xia,Weidong Jiang,Jiantao Zhou
机构: National University of Defense Technology (国防科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Sun Yat-sen University (中山大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院(TeleAI)); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISCAS2026

点击查看摘要

Abstract:Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.

[CV-68] Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training ICML2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在自我改进训练（self-improvement training）中存在的两个问题：数据不平衡（data imbalance）——简单样本被过度训练而关键困难样本训练不足，以及语言先验偏差（language prior bias）——模型过度依赖语言先验而忽略视觉线索。解决方案的关键是提出VISTA（Vision-Aware Self-Improvement Training Framework），其核心包括两个创新：一是前缀重采样策略（prefix resampling strategy），通过重用部分正确的推理轨迹来高效收集训练数据；二是视觉感知注意力分数（vision-aware attention score），用于量化模型对视觉信息的关注程度，从而指导模型更均衡地利用多模态信息，最终提升多模态推理性能。

链接: https://arxiv.org/abs/2605.11931
作者: Qihuang Zhong,Liang Ding,Wenjie Xuan,Juhua Liu,Bo Du,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model’s focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

[CV-69] RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation CVPR2026

【速读】：该论文解决的是在序列生成中平衡叙事动态性（narrative dynamism）与多角色连贯性（multi-character coherence）的挑战，现有方法往往导致角色身份丢失或故事停滞。解决方案的关键在于提出了RealDiffusion统一框架，其核心包括：利用热扩散（heat diffusion）作为耗散先验，沿序列平均相邻特征并移除主体区域的高频噪声，从而抑制属性漂移并稳定跨帧身份；同时引入区域感知随机过程（region-aware stochastic process）施加小扰动以探索邻近模式并防止坍缩，从而保持姿态变化和场景演化。此外，通过一个轻量级、无需训练的物理信息注意力机制（Physics-informed Attention），在推理时将可控物理先验注入自注意力层，在不抑制提示驱动变化的前提下正则化时空关系。

链接: https://arxiv.org/abs/2605.11927
作者: Qi Zhao,Jun Chen,Ivor Tsang,Guang Dai
机构: Xi’an Jiaotong University (西安交通大学); Zhejiang Normal University (浙江师范大学); SGIT AI Lab, State Grid Corporation of China (国网信通产业集团AI实验室); Centre for Frontier AI Research (CFAR), A*STAR (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at this https URL.

[CV-70] Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization

【速读】：该论文试图解决可微分矢量图形（Differentiable Vector Graphics）优化中存在的拓扑坍塌（topology collapse）问题——即大量随机初始化的曲线在平面像素匹配的优化框架下无序竞争，导致宏观结构被内部高频噪声扭曲，形成冗余且不可编辑的“多边形汤”（polygon soup）。解决方案之关键在于提出一种名为Vector Scaffolding的层次化优化框架，其核心包括：首先，识别出拓扑坍塌的根本原因是面积与边界梯度之间的数学不平衡，并通过引入内部梯度聚合（Interior Gradient Aggregation）来稳定多尺度曲线混合的学习动态；其次，利用渐进分层（Progressive Stratification）和快速膨胀调度（Rapid Inflation Scheduling）策略，以极高的学习率（达到常规的50倍）逐步密集化矢量图元，从而实现从平面像素匹配到结构化拓扑构建的转变。

链接: https://arxiv.org/abs/2605.11913
作者: Jaerin Lee,Kanggeon Lee,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable “polygon soup” that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ( \times 50 ). Experiments demonstrate that our approach accelerates optimization by 2.5\times while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.

[CV-71] Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning ICML2026

【速读】：该论文解决的是类增量学习（Class-Incremental Learning, CIL）中，最近类均值（Nearest Class Mean, NCM）分类器因非线性特征漂移和非理想训练条件导致类别呈现复杂流形而非神经坍缩（Neural Collapse, NC）理论所假设的单一坍缩点，使得基于点均值的NCM分类器表现次优的问题。解决方案的关键在于提出了层次聚类SOINN（Hierarchical-Cluster SOINN, HC-SOINN）分类器，通过“局部到全局”（local-to-global）表征来捕获类别流形的拓扑结构，并配套引入结构-拓扑对齐残差（Structure-Topology Alignment via Residuals, STAR）方法，利用细粒度的逐点轨迹跟踪机制主动对学习到的拓扑结构进行变形，从而精确适应复杂的非线性特征漂移。理论分析及Procrustes距离实验验证了该框架对流形变形的鲁棒性，且将HC-SOINN取代七个最先进方法中的原始分类器后均取得了一致性性能提升。

链接: https://arxiv.org/abs/2605.11904
作者: Huiyu Yi,Zhiming Xu,Dunwei Tu,Zhicheng Wang,Baile Xu,Furao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by ICML2026

点击查看摘要

Abstract:The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM’s optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global’’ representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework’s resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at this https URL.

[CV-72] Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance

【速读】：该论文试图解决原始无人机视频因透视成像坐标系而非稳定的公制（metric）道路坐标系，导致难以直接用于交通分析的问题。解决方案之关键在于：利用可见的道路几何特征（如车道标线、路缘和人行横道）估计从图像坐标系到公制地面坐标系的道路平面单应性（homography），从而将车辆观测数据通过预估的地面接触点投影至鸟瞰视图（bird’s-eye-view，BEV）表示，最终在道路平面上恢复车辆的方向、速度、航向角及动态三维边界框（3D cuboid）等轨迹信息。

链接: https://arxiv.org/abs/2605.11900
作者: Alexey Popov,Natalia Trukhina,Vadim Vashkelis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird’s-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.

[CV-73] Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks CVPR2026

【速读】：该论文旨在解决视觉识别中因类别不平衡（class imbalance）导致的罕见类（rare class）识别性能低下问题，尤其在安全关键领域，正样本获取成本高昂且罕见事件本身在数据集中代表性不足。解决方案的关键是提出一种轻量级的合成数据增强pipeline：通过低秩适配（Low-Rank Adaptation, LoRA）在仅20-50张罕见类真实图像上微调预训练的扩散模型（diffusion model），并利用该模型生成合成样本用于训练。该方法通过系统地调整合成与真实样本的比例，在两个结构差异显著的领域（NIH ChestX-ray14胸部X光病理分类和Magnetic Tile Defect数据集工业表面裂纹检测）中验证了有效性——在仅用真实图像评估的独立测试集上，适度的合成增强可一致提升罕见类的召回率与F1分数，且随着合成比例增加呈现出收益递减效应。

链接: https://arxiv.org/abs/2605.11898
作者: Daniil Dushenev,Nazariy Karpov,Daniil Zinovjev,Alexander Gorin,Konstantin Kulikov
机构: National University of Science and Technology MISIS (国家科技大学MISIS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026

点击查看摘要

Abstract:Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains. Comments: 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.11898 [cs.CV] (or arXiv:2605.11898v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.11898 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniil Dushenev [view email] [v1] Tue, 12 May 2026 10:11:57 UTC (421 KB)

[CV-74] Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

【速读】：该论文试图解决现有无监督迁移学习方法在处理异构多视图数据（heterogeneous multiview data）时，无法忠实恢复内在子空间结构（intrinsic subspace structures）的问题，具体挑战在于构建能够保留这些子空间结构并实现跨视图语义对齐的稀疏相似度图。解决方案的关键在于提出了一种稀疏注意力图学习（Sparse Attention Graph Learning, SAGL）方法，其核心包括：首先，引入双线性注意力分解（bilinear attention factorization）机制，捕获高维特征间的非对称相似性，打破传统表示学习的对称性瓶颈；其次，采用动态稀疏门控机制（dynamic sparsity gating mechanism）预测特征特定的压缩因子，自适应控制邻居的拓扑贡献；最后，通过α-entmax实现结构化稀疏投影（structured sparse projection），生成每个视图的子空间保持稀疏注意力图（subspace-preserving sparse attention graphs），并利用这些视图特定图进行稀疏信息聚合，从而获得判别性表征。此外，论文还提供了连接可微稀疏注意力和概率单纯形约束的严格理论分析。

链接: https://arxiv.org/abs/2605.11881
作者: Jie Chen,Yuanbiao Gou,Chuanbin Liu,Zhu Wang,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via \alpha -entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.

[CV-75] h-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

【速读】：该论文旨在解决预训练流匹配视频生成器中无训练相机控制（training-free camera control）这一部分观测逆问题（partial-observation inverse problem）所面临的困境：即深度扭曲引导视频（depth-warped guidance video）仅提供潜变量子集上的噪声证据，采样器需在轨迹遵循与视觉质量之间取得平衡，而现有启发式引导强度调参缺乏鲁棒性。解决方案的关键在于对采样器进行结构性改变：在每步外循环硬替换引导（hard-replacement guidance）后，增加一个内循环的模块条件伪吉布斯细化（block-conditional pseudo-Gibbs refinement），对同一噪声水平下未观测补集进行迭代更新，并证明该过程可收敛至部分观测条件数据分布。为加速高维视频潜变量的收敛，进一步利用其条件局部性，将未观测补集划分为3D补丁（3D patches），每个补丁由自定义混合指示器（custom mixing indicator）跟踪，自适应冻结已收敛的补丁，从而在保证收敛性的同时提升效率。

链接: https://arxiv.org/abs/2605.11871
作者: Yuzhu Wang,Xi Ye,Duo Su,Yangyang Xu,Jun Zhu
机构: Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf h -control, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emphblock-conditional pseudo-Gibbs refinement on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf h -control attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

[CV-76] FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

【速读】：该论文试图解决的问题是：在视频扩散Transformer（Video Diffusion Transformers, DiTs）中，尽管通过模型蒸馏可以大幅降低整体推理延迟，但每步推理延迟仍然是关键瓶颈；尤其在少步数推理（few-step regimes）场景下，现有基于去噪轨迹冗余的步进式加速策略因时间状态稀缺、无法有效进行特征复用或预测建模而面临收益递减，从而阻碍进一步加速。解决方案之关键在于：将优化焦点从时间轨迹转向潜在帧维度（latent frame dimension），利用该维度内帧间稀疏性（frame-wise sparsity）允许减少计算、以及每个帧位置对全局时空上下文具有同等重要性的结构一致性（structural consistency）这一内在对偶性，提出无需训练且与算子无关的框架FIS-DiT（Frame Interleaved Sparsity DiT）。其核心执行策略为帧间交错稀疏（Frame Interleaved Sparsity, FIS），即在模型层级间操作帧子集，通过交错刷新所有潜在位置而无需对完整块（block）进行计算，从而在保持全局时空信息的同时大幅降低每步推理计算量，实现2.11–2.41倍加速且质量损失可忽略。

链接: https://arxiv.org/abs/2605.11869
作者: Jian Tang,Jiawei Fan,Qingbin Liu,Zheng Wei
机构: Platform and Content Group, Tencent (平台与内容事业群, 腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11–2.41 \times speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.

[CV-77] When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI MICCAI2026

【速读】：该论文旨在解决结构MRI到淀粉样蛋白PET合成（Structural MRI-to-amyloid PET synthesis）中表现不一致的根本原因——即MRI所反映的神经退行性变与PET所测量的淀粉样蛋白病理在阿尔茨海默病中常存在时间解耦，导致相似MRI模式对应不同淀粉样蛋白状态，形成固有的生物歧义（biological ambiguity），使该任务本质上成为病态问题（ill-posed problem）。解决方案的关键在于：第一，通过按淀粉样蛋白和神经退行性变状态分层控制训练分布，证实性能受限源于数据分布中的歧义而非模型架构容量；第二，引入正交生物学信息（如血浆生物标志物）作为多模态输入，从而解除歧义，实现性能提升与稳定性恢复。因此，稳定且有意义的进展依赖于多模态整合（multimodal integration），而非单纯的架构复杂性。

链接: https://arxiv.org/abs/2605.11867
作者: Louise E.G. Baron,Ross Callaghan,David M. Cash,Philip S. Weston,Hojjat Azadbakht,Hui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026 accepted paper (no rebuttal)

点击查看摘要

Abstract:Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer’s disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.

[CV-78] GATA2Floor: Graph attention for floor counting in street-view facades ICIP2026

【速读】：该论文旨在解决从街道级图像自动分析建筑立面时，不具备空间排列推理能力的问题：传统方法仅进行孤立的窗户/门检测，无法对元素之间的垂直布局关系进行建模，从而导致楼层计数、结构理解等高层任务性能受限。解决方案的关键在于：1）将每个立面显式建模为包含窗户/门检测的图（graph），并对边施加垂直先验（vertical prior），从而捕获元素间的空间关系；2）提出基于多头图注意力v2（Multi-head Graph Attention v2, GATv2）的模型GATA2Floor，通过可学习的跨注意力查询（learnable cross-attention queries）将元素软分配到潜在楼层槽（latent floor slots），同时预测全局楼层数并输出可解释的分配结果；3）为克服标注数据稀缺，设计一种轻量级无标签提议机制（label-free proposal mechanism），利用自监督特征和视觉语言评分（vision-language scoring）生成候选元素，使整个基于图的关系推理能够在无标注条件下应用。

链接: https://arxiv.org/abs/2605.11863
作者: Ngoc Tan Le,Tzoulio Chamiti,Eirini Papagiannopoulou,Nikos Deligiannis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at IEEE ICIP 2026; 6 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.

[CV-79] Selection Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

【速读】：该论文试图解决雷达-相机深度估计中如何将极度稀疏、全天候、度量的雷达信号有效转化为密集的逐像素深度图的问题。现有方法在骨干网络的序列操作外部进行特征融合，即使跨模态Mamba变体也保留了单模态的选择机制。解决方案的关键在于 Radar-Modulated Selection (RMS)，即把雷达注入 Mamba 的选择性扫描机制（selection mechanism）内部：雷达从扫描内部调制步长 Δ 和读出 C（添加零初始化的扰动），而输入投影 B 和状态动力学 A 保持纯图像。这种构造在初始化时完全等价于预训练的纯图像 Mamba，确保雷达只在其能提升准确度的地方施加影响，并自然地获得了线性成本的跨模态耦合和雷达缺失时回退到纯图像骨干的能力。通过将 RMS 部署在多视角扫描金字塔（MVSP）中，匹配不同尺度下雷达的空间覆盖范围，SemoDepth 实现了最先进的性能，且消融实验证明外部特征混合在 RMS 之上无法带来额外精度提升，验证了扫描内部选择可以取代扫描外部融合。

链接: https://arxiv.org/abs/2605.11840
作者: Zhangcheng Hou,Tomoaki Ohtsuki
机构: School of Science and Technology, Keio University (庆应义塾大学理工学部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods – concatenation, confidence-aware gating, sparse supervision, graph-based extraction – combine radar and image features outside the backbone’s sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba’s selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size \Delta and readout \mathbfC while leaving the input projection \mathbfB and state dynamics \mathbfA image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar’s spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0–50, 0–70, and 0–80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.

[CV-80] REFNet: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Birds-Eye Polar View ITSC

【速读】：该论文旨在解决多模态传感器融合中雷达（radar）与摄像头数据在统一域中对齐的挑战，具体针对车载环境感知中雷达噪声大、分类能力弱且与相机视角不兼容的问题，同时兼顾融合准确性和计算效率。解决方案的关键在于：采用变分编码器-解码器（variational encoder-decoder）架构，一方面将前置摄像头图像转换为鸟瞰视图（Bird’s-Eye View, BEV）极坐标域，另一方面从雷达原始距离-多普勒（Range-Doppler, RD）频谱中通过编码器-解码器恢复角度信息，生成距离-方位角（Range-Azimuth, RA）特征，从而使两种模态在兼容的极坐标域中对齐，实现鲁棒且高效的融合。

链接: https://arxiv.org/abs/2605.11824
作者: Kavin Chandrasekaran,Sorin Grigorescu,Gijs Dubbelman,Pavol Jancura
机构: Elektrobit Automotive GmbH (Elektrobit汽车有限公司); Eindhoven University of Technology (埃因霍温理工大学); Transilvania University of Brasov (布拉索夫特兰西瓦尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE Intelligent Transportation Systems Conference (ITSC) 2025

点击查看摘要

Abstract:A realistic view of the vehicle’s surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird’s-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.

[CV-81] RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

【速读】：该论文旨在解决基于扩散模型（diffusion-based）的图像层分解（image layer decomposition）中存在的核心挑战：复杂自然图像中的遮挡完成（occlusion completion）、鲁棒层分离（robust layer disentanglement）、精确前景边界（precise foreground boundaries）以及高质量多层自然图像数据集的匮乏。解决方案的关键在于提出了RevealLayer框架，该框架通过三个核心组件实现精确的层分离与遮挡内容恢复：一是区域感知注意力模块（Region-Aware Attention），用于分离隐藏层与可见层；二是遮挡引导适配器（Occlusion-Guided Adapter），利用上下文信息增强重叠区域；三是复合损失函数（composite loss），用于强制生成尖锐的alpha边界并抑制残差伪影。此外，论文还构建了大规模高质量多层自然图像数据集RevealLayer-100K及基准测试集RevealLayerBench，以支持训练与评估。

链接: https://arxiv.org/abs/2605.11818
作者: Binhao Wang,Shihao Zhao,Bo Cheng,Qiuyu Ji,Yuhang Ma,Liebucha Wu,Shanyuan Liu,Dawei Leng,Yuhui Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

[CV-82] See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

【速读】：该论文旨在解决视觉-语言-动作（VLA）模型因高昂计算成本而无法实现实时部署的问题，特别是现有令牌剪枝方法因抛弃接触点等关键几何细节而导致性能严重下降，进而限制压缩率与加速潜力之间的根本性权衡。解决方案的关键在于打破这一权衡：将压缩重新构思为视觉编码器中基于几何感知的连续令牌重采样过程。为此，论文提出可微分网格采样器（Differentiable Grid Sampler, GridS），这是一个即插即用模块，通过任务自适应的方式预测一组最小化显著坐标集，并利用可微分插值提取特征，从而在实现激进压缩（视觉令牌数量少于原始10%）的同时保留必要的空间几何信息。实验证明，GridS在LIBERO基准和真实机器人平台上实现了FLOPs减少76%且成功率无下降，是目前报告的最低可行视觉令牌数量。

链接: https://arxiv.org/abs/2605.11817
作者: Yixu Feng,Zinan Zhao,Yanxiang Ma,Chenghao Xia,Chengbin Du,Yunke Wang,Chang Xu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at this https URL.

[CV-83] Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的动作关系幻觉（action-relation hallucinations）问题，即模型生成与视觉输入相矛盾的文本，尤其关注对象间交互关系的错误描述。解决方案之关键在于：首先，定义动作关系敏感性（Action-Relation Sensitivity, ARS）分数，用于识别对动作关系变化最敏感的注意力头，从而定位包含关键视觉线索的动作相关图像区域；其次，提出关系感知视觉增强（Relation-aware Visual Enhancement, RVE）方法，增强LVLM对这些动作相关图像区域的注意力，从而在几乎不增加推理成本的前提下有效缓解动作关系幻觉，并泛化到空间关系幻觉和对象幻觉。

链接: https://arxiv.org/abs/2605.11808
作者: Zhenxin Qin,Qiang Li,Qingzhuo Wang,Ruiyang Qin,Zhihua Wei,Wen Shen
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM’s attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM’s attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

[CV-84] Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning

【速读】：该论文试图解决无数据持续学习（Data-free continual learning, DFCIL）中现有模型反演方法因采用对角协方差假设而忽略特征相关性，导致合成伪样本保真度不足、知识保留受限的问题。解决方案的关键在于提出了REMIX框架，通过拉普拉斯核参数化实现结构化协方差建模，以线性增长的内存和仅对数因子的计算开销捕捉特征依赖关系，从而避免了传统全协方差建模所需的密集矩阵求逆和对数行列式计算，使合成样本更加连贯并显著提升基准性能。

链接: https://arxiv.org/abs/2605.11804
作者: Patryk Krukowski,Jacek Tabor,Przemysław Spurek,Marek Śmieja,Łukasz Struski
机构: Jagiellonian University (雅盖隆大学); IDEAS Research Institute (IDEAS研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.

[CV-85] OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

【速读】：该论文试图解决 Video Large Language Models (Video-LLMs) 在处理长视频时视觉 token 数量激增导致推理成本快速增长的问题；现有训练无关（training-free）的 token 压缩方法主要依赖跨帧相似性或分段启发式，忽视了每个 token 在帧内的语义角色，且无法根据帧间可压缩性自适应调整压缩强度。解决方案的关键在于提出 OTT-Vid，一个基于最优传输（Optimal Transport, OT）的分配框架，其核心包括两个阶段：首先通过空间剪枝（spatial pruning）识别每帧中的代表性内容，然后在相邻帧之间求解非均匀 token 质量（non-uniform token mass）的最优传输，其中 token 质量用于保护语义重要的 token 免受过度压缩，而位置感知代价（locality-aware cost）同时捕捉特征和空间差异；由此得到的传输计划（transport plan）能够平衡 token 重要性与匹配代价，而其总代价（即传输难度）被用于动态分配各帧对的压缩预算，从而实现自适应且语义保持的时域压缩。

链接: https://arxiv.org/abs/2605.11803
作者: Minseok Kang,Minhyeok Lee,Jungho Lee,Minjung Kim,Donghyeong Kim,Dayeon Lee,Heeseung Choi,Ig-jae Kim,Sangyoun Lee
机构: Yonsei University(延世大学); LG Electronics(LG电子); KIST(韩国科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22pages, 9 figures. Code available at this https URL

点击查看摘要

Abstract:As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token’s semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

[CV-86] SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions ICIP2026

【速读】：该论文旨在解决自动驾驶3D目标检测中，多模态传感器融合方法（如相机与LiDAR）在理想条件下表现优异，但当其中一种模态数据（相机或LiDAR）缺失、损坏或受噪声干扰时，性能急剧下降的问题。解决方案的关键是开发一个框架无关的融合模块（framework-agnostic fusion module），该模块能够灵活处理单一模态缺失或损坏的情况，通过设计不依赖于统一鸟瞰图（BEV）表示的机制，在融合过程中自适应地利用可用模态的有效信息，从而显著提升在传感器退化场景下的鲁棒性。

链接: https://arxiv.org/abs/2605.11799
作者: Markus Essl,Marta Moscati,Mubashir Noman,Muhammad Zaigham Zaheer,Usman Naseem,Shah Nawaz,Markus Schedl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2026

点击查看摘要

Abstract:Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird’s-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.

[CV-87] Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

【速读】：该论文旨在解决视觉障碍人士在城市环境中安全独立导航时面临的挑战，即现有可穿戴辅助设备依赖任务特定的视觉管线（task-specific vision pipelines），缺乏灵活性和泛化能力，无法适应多样化真实世界场景。解决方案的关键在于提出一种基于视觉问答（Visual Question Answering）的事件地图框架，利用视觉-语言模型（Vision-Language Models, VLMs）通过三级层次查询结构（three-level hierarchical query structure）实现细粒度场景理解，无需针对特定任务进行重新训练；模型响应被聚合为加权风险评分系统，将街道段映射为四个离散安全类别，生成可导航的风险感知事件地图（risk-aware event maps）以规划路线。同时，该研究构建了一个覆盖六大洲20个城市、包含800余张标注图像和18,000个问答的多样化数据集，并基准测试了四种VQA架构，发现生成式多模态大语言模型（Multimodal Large Language Models, MLLMs）显著优于分类方法，其中Qwen-VL在精度和召回率上取得了最佳平衡，从而验证了MLLMs作为灵活且可泛化的辅助导航系统基础的可行性。

链接: https://arxiv.org/abs/2605.11782
作者: Antoni Valls,Jordi Sanchez-Riera
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与工业信息学研究所，西班牙国家研究委员会-加泰罗尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, submitted to IEEE T-ITS

点击查看摘要

Abstract:Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

[CV-88] Revisiting Shadow Detection from a Vision-Language Perspective

【速读】：该论文旨在解决传统阴影检测（shadow detection）在视觉模糊场景下的不可靠问题，即当暗区域可能由投射阴影或固有深色表面引起时，仅凭像素级视觉监督难以建立稳定的决策规则。解决方案的关键在于提出SVL（Shadow Vision-Language）框架，从视觉-语言视角引入语言作为显式语义参考（explicit semantic reference），以消除视觉模糊区域的歧义。具体而言，通过场景级阴影比例回归目标（scene-level shadow ratio regression objective）对齐全局图像表示与阴影相关的文本嵌入，提供图像级的阴影范围引导；再通过全局到局部耦合机制（global-to-local coupling mechanism）将这一全局信息传递至密集预测，确保图像级引导与补丁级预测的一致性；同时，利用局部补丁级文本约束（local patch-level text embeddings）强化在挑战性外观条件下的细粒度判别能力。该框架基于冻结的DINOv3图像编码器，仅训练轻量的投影与解码模块，参数量少于1%，从而实现了参数高效且鲁棒的阴影检测性能。

链接: https://arxiv.org/abs/2605.11771
作者: Yonghui Wang,Wengang Zhou,Hao Feng,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System (空间信息处理与应用系统技术重点实验室); Department of Electronic Engineering and Information Science (电子工程与信息科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision–language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision–Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than 1% trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

[CV-89] M4-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

【速读】：该论文试图解决在将Segment Anything Model 2 (SAM2) 应用于RGB-D视频显著目标检测（RGB-D VSOD）任务时面临的三个主要挑战：线性LoRA空间建模能力有限、未能充分利用SAM的多尺度特征、以及对显式提示初始化的依赖。解决方案的关键在于提出M^4-SAM框架，通过三项创新设计来克服这些挑战：首先，在SAM2编码器中注入模态感知混合专家低秩适应（Modality-Aware MoE-LORA），利用卷积专家编码局部空间先验并引入模态调度器实现高效多模态微调；其次，部署门控多层次特征融合（Gated Multi-Level Feature Fusion），通过自适应门控机制分层聚合多尺度编码器特征，以平衡空间细节与语义上下文；最后，采用伪引导初始化（Pseudo-Guided Initialization），将粗掩码作为伪先验引导记忆库，从而实现无需手动提示的零样本VSOD。实验结果表明，M^4-SAM在三个公开RGB-D VSOD数据集上的所有评估指标均达到了最先进性能。

链接: https://arxiv.org/abs/2605.11760
作者: Jiyuan Liu,Jia Lin,Xiaofei Zhou,Runmin Cong,Deyang Liu,Zhi Liu
机构: Hangzhou Dianzi University (杭州电子科技大学); Shandong University (山东大学); Anqing Normal University (安庆师范大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM’s multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M ^4 -SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2’s encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M ^4 -SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

[CV-90] Focusable Monocular Depth Estimation

【速读】：该论文试图解决当前单目深度基础模型（monocular depth foundation models）在优化时采用统一的逐像素损失函数（uniform pixel-wise objectives），无法区分用户指定或任务相关的目标区域与无关背景的问题，导致模型缺乏对特定区域的深度建模偏好。为此，论文提出了一个新的任务——可聚焦单目深度估计（Focusable Monocular Depth Estimation, FDE），要求模型在给定目标区域时能优先保证前景深度精度、保留锐利边界且维持全局场景几何一致性。解决方案的关键是提出FocusDepth框架，这是一个基于提示条件（prompt-conditioned）的单目相对深度估计框架，其核心是多尺度空间对齐融合模块（Multi-Scale Spatial-Aligned Fusion, MSSA），该模块将Segment Anything Model 3的多尺度特征在空间上对齐到Depth Anything系列模型，并通过尺度特定的门控条件融合（gated conditional fusion）注入密集的提示线索，从而在不破坏已有几何表征的前提下，将聚焦能力整合进深度估计模型。此外，论文构建了FDE-Bench基准（包含252.9K/72.5K训练/验证三元组，涵盖972个类别），验证了该方案在框提示和文本提示下均优于全局微调基线，其中MSSA的空间对齐是实现聚焦的关键，破坏提示-几何对应关系会导致AbsRel上升最多13.8%。

链接: https://arxiv.org/abs/2605.11756
作者: Yuxin Du,Tao Lin,Zile Zhong,Runting Li,Xiyao Chen,Jiting Liu,Chenglin Liu,Ying-Cong Chen,Yuqian Fu,Bo Zhao
机构: School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA’s spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.

[CV-91] One-Step Generative Modeling via Wasserstein Gradient Flows

【速读】：该论文试图解决扩散模型和基于流的方法在生成任务中采样迭代次数多、计算成本高的问题，以实现快速且高保真的单步生成。解决方案之关键在于两步框架：首先，通过最小化能量泛函的Wasserstein梯度流（Wasserstein gradient flow）定义从参考分布到目标分布的连续演化；其次，训练一个静态的神经网络生成器将此演化压缩为单步生成。其中，能量泛函采用Sinkhorn散度（Sinkhorn divergence），从而导出基于最优传输的高效更新规则，能够捕获全局分布差异并改善目标分布的覆盖。论文还证明了在合适假设下，有限样本训练动力学收敛到连续时间分布动力学。

链接: https://arxiv.org/abs/2605.11755
作者: Jiaqi Han,Puheng Li,Qiushan Guo,Renyuan Xu,Stefano Ermon,Emmanuel J. Candès
机构: Stanford University (斯坦福大学); ByteDance Seed (字节跳动Seed)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 38 pages, 14 figures

点击查看摘要

Abstract:Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256 \times 256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100 \times faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

[CV-92] BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy

【速读】：该论文旨在解决支气管镜检查中因呼吸道复杂分支结构导致的导航困难问题。解决方案的关键在于提出了BronchoLumen，一个基于YOLO的实时支气管开口检测系统，该系统利用最新的目标检测架构（如YOLOv12中的注意力模块）来增强空间推理能力，并仅依靠公开数据集实现了跨图像域的鲁棒性检测，其中YOLOv12通过改进定位精度（mAP@0.5:0.9指标更高）来提升导航准确性，尽管其整体精度略低于YOLOv8。

链接: https://arxiv.org/abs/2605.11748
作者: Yongchao Li,Marian Himstedt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, IPCAI 2026

点击查看摘要

Abstract:Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics mAP@0.5 and mAP@0.5:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a mAP@0.5 of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with mAP@0.5:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.

[CV-93] WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views ICML2026

【速读】：该论文试图解决现有时空语义推理方法中因依赖隐式潜在结构（implicit latent structures）结合密集特征图（dense feature maps）或任务特定头部（task-specific heads）而导致的计算效率低下和灵活性不足的问题。解决方案的关键在于提出一种名为WorldComp2D的轻量级表示学习框架，其核心是通过多尺度局部感受野（multiscale local receptive fields）显式地根据对象身份（object identity）和空间邻近性（spatial proximity）来结构化潜在空间几何，具体由邻近依赖编码器（proximity-dependent encoder）将观测映射到结构化的时空语义潜在空间，再由定位器（localizer）从该表示中推断对象坐标，从而在不牺牲推理能力的前提下大幅降低参数和计算量。

链接: https://arxiv.org/abs/2605.11743
作者: SeongMin Jin,Doo Seok Jeong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a regular paper at ICML2026

点击查看摘要

Abstract:Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at this https URL.

[CV-94] CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

【速读】：该论文旨在解决视频异常检测中现有方法难以精准捕捉细微、局部化异常且缺乏可解释推理过程的问题，尤其是生成视频中的异常。解决方案的关键在于提出了一种基于视觉语言模型（Vision-Language Models）的粗到细异常奖励模型 Concentrate and Concentrate (CaC)，其推理流程先通过全局时间扫描锚定异常时间窗口，再在局部区间内进行细粒度空间定位，最终经由结构化时空链式推理（spatiotemporal Chain-of-Thought reasoning）得出稳健判断；为赋予模型这些能力，作者构建了首个包含逐帧边界框标注、时间异常窗口与细粒度属性标签的大规模生成视频异常数据集，并设计三阶段渐进训练范式：模型先通过单帧与多帧监督微调学习空间和时间锚定，随后基于两轮组相对策略优化（Group Relative Policy Optimization, GRPO）的强化学习策略进行优化，同时引入时间交并比（Temporal IoU）和空间交并比（Spatial IoU）奖励来监督中间定位过程，从而引导模型实现更扎实、可解释的时空推理。

链接: https://arxiv.org/abs/2605.11723
作者: Jiyuan Wang,Huan Ouyang,Jiuzhou Lin,Chunyu Lin,Dewen Fan,Boheng Zhang,Haonan Fan,Fei Zuo,Jia Sun,Huaiqing Wang,Honglie Wang,Yiyang Fan,Zhenlong Yuan,Zijun Li,Yongrui Heng,Guosheng Lin,Fan Yang,Tingting Gao
机构: BJTU(北京交通大学); NTU(南洋理工大学); BUPT(北京邮电大学); Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures

点击查看摘要

Abstract:In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

[CV-95] EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

【速读】：该论文旨在解决当前文本到图像生成器在处理涉及多物体、数量、属性和关系的组合性提示（compositional prompts）时效果不佳的问题。其解决方案EPIC的关键在于提出一种训练无关的推理时精炼框架，将优化过程形式化为谓词引导的搜索（predicate-guided search）：首先生成并固定一个包含对象变量和类型谓词的视觉程序（visual program），然后对每次生成或编辑的图像提取视觉证据并验证所有谓词是否满足，仅当全部通过时判定为合格；否则根据失败谓词的类型——局部失败触发定向编辑（targeted editing），全局失败触发重采样（resampling），同时保持视觉程序不变。这一机制通过将验证与条件化修正结合，在不改变基模型的前提下显著提升了组合提示的准确率。

链接: https://arxiv.org/abs/2605.11722
作者: Sunung Mun,Sunghyun Cho,Jungseul Ok
机构: Graduate School of Artificial Intelligence, POSTECH (人工智能研究生院, 浦项科技大学); Department of Computer Science Engineering, POSTECH (计算机科学与工程系, 浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

[CV-96] Unlocking Compositional Generalization in Continual Few-Shot Learning

【速读】：该论文旨在解决持续学习（continual learning）中基于对象中心表征（object-centric representations）进行少样本学习（few-shot learning）时面临的泛化瓶颈：现有方法要么将场景压缩为全局嵌入（global embeddings）导致无法分解，要么使用部件级匹配目标（part-level matching objectives）使表征与已见模式过度耦合，从而无法泛化至真正的新颖概念。解决方案的关键在于严格解耦表征学习（representation learning）与组合推理（compositional inference），利用自监督视觉Transformer（Vision Transformers, ViTs）固有的补丁级语义几何结构（patch-level semantic geometry），采用双阶段策略：在训练阶段，槽位表征（slot representations）完全针对整体类别身份进行优化，保留高度泛化的对象级几何结构；在推理阶段，保留的槽位被动态组合以匹配新场景。这一范式带来双重结构优势：冻结的主干网络自然防止表征漂移（representation drift），而轻量级的整体优化保留下特征向新颖概念迁移的能力。

链接: https://arxiv.org/abs/2605.11710
作者: Phu-Quy Nguyen-Lam,Phu-Hoa Pham,Dao Sy Duy Minh,Chi-Nguyen Tran,Huynh Trung Kiet,Long Tran-Thanh
机构: Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City, Vietnam (信息技术学院，理学院，越南国立大学，胡志明市，越南); Department of Computer Science, University of Warwick, Coventry, United Kingdom (计算机科学系，华威大学，考文垂，英国)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features’ capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.

[CV-97] CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

【速读】：该论文试图解决多模态核心集选择（multimodal coreset selection）中的两个关键问题：一是单模态主导的采样方法忽略跨模态信息不平衡，导致另一模态的语义损失；二是基于样本评分的粗粒度采样使选出的核心集偏向评分模型，难以保证与原始数据集的分布等价性。此外，现有方法未能同时兼顾全局语义结构、局部细粒度细节以及密集区域的冗余感知覆盖。解决方案的核心在于提出了CAST（Collapse-Aware multi-Scale Topology fusion）框架，关键步骤包括：首先分别构建图像和文本模态的拓扑结构，通过局部崩溃感知精炼（local-collapse-aware refinement）和跨模态融合得到统一的拓扑；然后在扩散小波域（diffusion wavelet domain）引入多尺度分布匹配准则，使核心集在多尺度上逼近原始数据集；最后引入局部软关系覆盖机制（local soft relational coverage），将纯几何覆盖扩展为关系感知的间接覆盖，以惩罚密集区域中的冗余选择。

链接: https://arxiv.org/abs/2605.11705
作者: Boran Zhao,Hetian Liu,Zhenxian Hu,Yuqing Yuan,Yu Yan,Pengju Ren
机构: School of Software Engineering(软件工程学院); the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室); National Engineering Research Center for Visual Information and Applications(国家视觉信息与应用工程技术研究中心); Institute of Artificial Intelligence and Robotics(人工智能与机器人研究所); Xi’an Jiaotong University(西安交通大学); XJTU-POLIMI Joint School(西安交通大学-米兰理工大学联合学院); Faculty of Electronic and Information Engineering(电子与信息工程学院); School of Human Settlements and Civil Engineering(人居环境与建筑工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.

[CV-98] ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

【速读】：该论文旨在解决文本驱动的人体运动生成（text-driven human motion generation）任务中，传统自回归方法（如标准下一词元预测）无法有效捕捉运动从粗到细的结构化特征，导致运动细节缺失、全局结构一致性不足的问题。解决方案的关键在于：将3D运动量化为多尺度骨骼-时间（skeletal-temporal）离散词元（discrete tokens），并采用自回归方式逐尺度预测下一个尺度的词元图（next-scale token maps），从而实现从粗粒度到细粒度的渐进式生成；同时，通过显式设计运动词元化器（motion tokenizers）和量化器（quantizers），确保每个尺度的离散词元严格保留骨骼层次结构（skeletal hierarchy），并利用位量化（bitwise quantization）和预测机制有效扩展词汇量以保留运动细节并稳定优化过程。

链接: https://arxiv.org/abs/2605.11704
作者: Inwoo Hwang,Hojun Jang,Bing Zhou,Jian Wang,Young Min Kim,Chuan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

[CV-99] WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting CVPR26

【速读】：该论文旨在解决单图像重光照模型在从合成数据迁移到真实世界场景时面临的严重域偏移问题。现有方法虽然在合成基准上表现优异，但由于缺乏专门针对单图像重光照的真实世界数据集，其实际有效性未经验证。解决方案的关键在于构建了首个野外数据集WildRelight，它包含高分辨率户外场景及其严格对齐的时间变化自然光照和对应的HDR环境贴图，利用这种时间对齐结构引入物理引导的推理框架，通过结合扩散后验采样（Diffusion Posterior Sampling, DPS）与时序感知的测试时自适应（Temporal Sampling-Aware Test-Time Adaptation, TTA），将天然光照演化作为自监督约束，使合成模型能够在线对齐真实世界统计分布，从而将原本棘手的模拟到真实挑战转化为可处理的自监督任务。

链接: https://arxiv.org/abs/2605.11696
作者: Lezhong Wang,Mehmet Onurcan Kaya,Siavash Bigdeli,Jeppe Revall Frisvad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Companion paper to the CVPR26 findings paper ‘WildRelight’, introducing the physics-guided adaptation method evaluated on the dataset. Project Page: this https URL

点击查看摘要

Abstract:Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

[CV-100] Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

【速读】：该论文试图解决异构视觉智能体在去中心化学习环境中，通过仅交换离散标记序列且不依赖共享外在通信目标时，如何基于局部感知证据自主形成共享符号系统的问题，核心挑战在于探究私有视觉空间相似性如何约束最终语言的涌现、内容及对称性。解决方案的关键在于提出的Metropolis-Hastings标题游戏（MHCG）：两个智能体通过MH准则驱动接受-拒绝机制进行协作，听者依据自身视觉特征评估提议的标记序列，从而在无共享感知访问的条件下逐步形成视觉信息丰富的共享标题。其中，听者侧的MH接受标准是避免退化标记生成、确保有效通信的核心机制，而不同视觉编码器间的表征相似度则直接决定了共享序列的丰富性、特异性及对称性。

链接: https://arxiv.org/abs/2605.11695
作者: Mikako Ochiai,Masatoshi Nagano,Tadahiro Taniguchi
机构: Graduate School of Informatics, Kyoto University (京都大学信息学研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.

[CV-101] DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

【速读】：该论文旨在解决视觉 Transformer (Vision Transformers, ViTs) 中自注意力机制因 token 序列长度带来的二次计算复杂度问题，并克服现有 token 缩减方法（如固定启发式指标、预定义比率或静态离线掩码）缺乏推理时输入依赖自适应性的局限。其解决方案的核心在于提出 DORA (Dynamic Online Reinforcement Agent)，这是首个基于强化学习的在线推理框架，通过将 token 合并过程形式化为一个顺序马尔可夫决策过程 (MDP)，利用轻量级强化学习智能体根据当前特征状态和层特定上下文动态决定每个 Transformer 块的合并策略。该方案的关键在于采用非对称 Actor-Critic 架构，以高容量 Critic 进行稳定离线训练，同时保留最小计算量的 Actor 头用于低开销在线推理，并通过包含非线性蒸馏惩罚的密集奖励函数优化来平衡计算效率与特征保真度。

链接: https://arxiv.org/abs/2605.11683
作者: Kaixuan He,Song Chen,Yi Kang
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.

[CV-102] ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

【速读】：该论文试图解决感知到程序重建（perception-to-program reconstruction）任务中的评估问题，即如何系统性地衡量一个模型从给定渲染图像中生成可执行绘图程序的能力，并避免过拟合与污染。解决方案的关键在于构建一个合成基准 ShapeCodeBench：采用一种包含四种图元（primitives）的领域特定语言（DSL），所有实例由种子随机数生成器（seeded RNG）生成，从而可随时创建新的留出集（held-out set）以降低精确实例污染；同时定义了一系列评价指标（精确匹配、像素准确率、前景IoU、解析成功率、执行成功率）并按其难度划分三级（easy, medium, hard），并释放冻结的评估集及完整代码、数据与论文源材料以支持独立复现与扩展。

链接: https://arxiv.org/abs/2605.11680
作者: Shivam Kumar
机构: Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, 2 tables. Code, data, and artifacts: this https URL ; archival release: this https URL

点击查看摘要

Abstract:We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.

[CV-103] Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

【速读】：该论文旨在解决跨域小样本学习（Cross-Domain Few-Shot Learning, CDFSL）中，视觉-语言模型（如CLIP）的少样本微调尚未得到充分探索的问题。具体而言，在CDFSL场景下，基于适配器（adapter）的方法（如LoRA）一致优于基于提示（prompt）的方法（如MaPLe），这与域内（in-domain）场景中的表现相反。解决方案的关键在于通过分析发现LoRA的优越性源于它修正了视觉CLS token的注意力崩塌（collapsed attention），从而通过聚焦于与文本相关的视觉区域，增强了模态对齐（modality alignment）和类别分离（class separation）。基于这一洞察，论文提出了Semantic Probe——一个即插即用的注意力修正框架，可同时应用于适配器和提示方法，从而使得原本在域内有效的提示方法在CDFSL中重新具有竞争力。

链接: https://arxiv.org/abs/2605.11659
作者: Yaze Zhao,Yicong Liu,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA’s superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP’s standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

[CV-104] Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

【速读】：该论文主要解决跨视角地理定位（Cross-view geo-localization, CVGL）任务中现存的三项核心局限：其一，现有全局描述子设计将图像块网格（patch grid）压缩为单一向量，未能有效分离视角差异中的布局与纹理信息；其二，与飞行高度相关的尺度变化被保留在学习到的嵌入（embedding）中，而非被边缘化；其三，多目标训练依赖人工调参的标量权重来平衡不同梯度尺度的损失函数。解决方案之关键在于提出了一种轻量级可替换头部模块 SkyPart，该模块在基于图像块的视觉Transformer（Vision Transformer, ViT）上实施显式部分分组（explicit part grouping），通过四个理论驱动的组件实现：利用可学习原型（learnable prototypes）经单次余弦分配（single-pass cosine assignment）竞争图像块令牌；仅在训练阶段施加高度条件线性调制（altitude-conditioned linear modulation），使推理时的检索嵌入与高度无关；采用图注意力读出（graph-attention readout）对活跃原型进行聚合；以及基于肯德尔不确定性加权（Kendall uncertainty-weighted）的多目标损失，其不动点满足帕累托最优（Pareto-stationary）。该方案在保持最小模型尺寸（26.95M参数，22.14 GFLOPs）的同时，在单次推理、无重排序、无测试时增强（test-time augmentation, TTA）协议下，于SUES-200、University-1652和DenseUAV基准数据集上取得了新最优结果，且在WeatherPrompt十类退化场景基准上相较于最强基线的优势进一步扩大。

链接: https://arxiv.org/abs/2605.11654
作者: Chi-Nguyen Tran,Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Long Tran-Thanh
机构: Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City, Vietnam (信息技术学院，越南国立大学胡志明市科技大学); Department of Computer Science, University of Warwick, Coventry, United Kingdom (计算机科学系，华威大学，考文垂，英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 37 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

[CV-105] Unlocking UML Class Diagram Understanding in Vision Language Models

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在图表（如UML类图）理解与问答任务上落后于照片的问题，特别是针对计算机科学领域中的特定图表类型（如UML类图）缺乏相关研究。解决方案的关键在于：构建一个兼具挑战性与可管理性的基于UML类图的视觉问答基准（benchmark），并生成包含16,000个图像-问题-答案三元组的大规模训练数据集，在此基础上利用LoRA微调方法，使模型在该任务上轻松超越当前在多项基准中表现优异的Qwen 3.5 27B模型。

链接: https://arxiv.org/abs/2605.11634
作者: Artem Naboichenko,René Peinl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

[CV-106] Single-Shot HDR Recovery via a Video Diffusion Prior

【速读】：该论文旨在解决单次高动态范围（HDR）图像重建中生成方法对输入图像保真度不足、需为高光和阴影分别设计模型、以及直接预测最终HDR图像导致可解释性差的问题。解决方案的关键在于将单次HDR重建重新定义为条件视频生成（conditional video generation），通过微调视频扩散模型（video diffusion model），以输入的低动态范围（LDR）图像为条件生成曝光包围序列（exposure bracket），再使用轻量级UNet预测逐像素权重（per-pixel weights）将生成的帧融合为最终HDR图像。这一显式重建中间曝光堆栈再融合的策略避免了直接幻觉HDR图像，提升了可解释性和输入保真度，并且无需针对不同曝光范围使用单独模型。

链接: https://arxiv.org/abs/2605.11628
作者: Chinmay Talegaonkar,Jinshi He,Christopher McKenna,Nicholas Antipa
机构: University of California San Diego (加利福尼亚大学圣地亚哥分校); Creare LLC (Creare有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.

[CV-107] RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction ICML2026

【速读】：该论文旨在解决从组织病理学全切片图像（WSIs）预测全基因组RNA-seq表达谱时，现有确定性回归方法因一对一映射而无法有效捕获生物异质性及预测不确定性的问题。解决方案的关键在于提出了RNA-FM，一种基于流匹配（flow-matching）的生成框架，它将转录组预测重新表述为连续时间条件下的传输问题，通过学习一个以形态学为条件的速度场（velocity field），将简单先验分布映射至目标基因表达分布，并结合通路级结构实现可扩展、生物学可解释的全基因组基因表达插补。

链接: https://arxiv.org/abs/2605.11622
作者: Yaxuan Song,Jianan Fan,Tianyi Wang,Qiuyue Hu,Hang Chang,Heng Huang,Weidong Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at this https URL

点击查看摘要

Abstract:Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at this https URL.

[CV-108] Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

【速读】：该论文试图解决在无训练条件下的功能可供性接地（functional affordance grounding）问题，即智能体需要从3D场景中定位支持特定交互的可操作区域（如把手、按钮等），而现有训练无关的视觉-语言管线因可操作区域通常较小、视觉模糊且场景中同类实例重复而难以准确识别。解决方案的关键在于提出AFFORDMEM框架，通过构建两个层次的记忆来补偿信息缺失：一是跨场景可供性记忆（cross-scene affordance memory），维护一个类别级的RGB图像记忆库，其中叠加渲染了可供性区域，在查询时检索最相关的示例以引导冻结的视觉语言模型（VLM）聚焦于文本提示常遗漏的小可操作子区域；二是场景内空间记忆（in-scene spatial memory），将场景中候选实例及其3D空间关系组织成结构化场景图（scene graph），使语言模型能够解析涉及远处或当前未观察候选的空间引用（如“顶部第二个把手”）。该框架无需模型微调且无需目标场景标注，仅基于源场景构建可复用记忆库。

链接: https://arxiv.org/abs/2605.11616
作者: Qirui Wang,Jingyi He,Yining Pan,Xulei Yang,Shijie Li
机构: TUM(慕尼黑工业大学); A*STAR(新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as “the second handle from the top.” AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

[CV-109] Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLM s

【速读】：该论文旨在解决全模态大语言模型（Omni-LLMs）因处理大量多模态输入 token 而带来的巨大计算开销问题，并指出现有推理时剪枝方法仅依据当前查询或跨模态对齐度选择重要 token，可能丢弃对后续问题或理解非对齐视听上下文有贡献的证据。解决方案的关键在于提出 ContextGuard，一个基于“保留广泛视听上下文、去除跨模态冗余”原则的推理时剪枝框架。具体地，ContextGuard 从音频预测粗略的视觉语义，剪除那些语义大概率可从音频恢复的视频 token，同时保留额外的视频 token 以维持音频无法确定的局部视觉细节；此外，通过合并时间上相似的视频 token 实现进一步压缩。该框架无需对下游 LLM 进行微调，仅依赖独立训练的轻量级预测器，在 Qwen2.5-Omni 和 Video-SALMONN2+（3B 和 7B 规模）的六个音视频基准测试中，ContextGuard 在剪掉更多 token 的同时优于现有推理时剪枝方法，并在 Qwen2.5-Omni 7B 上剪除 55% 输入 token 后仍能在五个基准上达到全 token 级性能。

链接: https://arxiv.org/abs/2605.11605
作者: Chaeyoung Jung,Kyeongha Rho,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

[CV-110] HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

【速读】：该论文试图解决闭环驾驶仿真中自回归展开（autoregressive rollout）的监督来源问题。现有蒸馏方法依赖教师单次输出的有限时长监督，或引入学生端退化训练，前者因快速自运动和场景变化而迁移性差，后者仍受限于教师输出长度。关键困难在于标准教师模型在自身预测下会漂移，污染监督信号。解决方案的关键是赋予教师模型稳定的自回归展开能力，具体通过HorizonDrive框架实现：首先采用计划性展开恢复（Scheduled Rollout Recovery）训练基础模型从预测损坏的历史中重建真实未来片段，生成在长自回归展开中保持稳定的教师；随后将该可展开教师通过自回归扩展，在有限内存成本下提供无界时域的分布匹配监督，同时利用教师展开DMD（Teacher Rollout DMD）让短窗口学生对齐教师，实现高效实时部署。这一机制使得教师自身成为可靠的长时监督源，突破了传统蒸馏的监督范围限制。

链接: https://arxiv.org/abs/2605.11596
作者: Conglang Zhang,Yifan Zhan,Qingjie Wang,Zhanpeng Ouyang,Yu Li,Zihao Yang,Xiaoyang Guo,Weiqiang Ren,Qian Zhang,Zhen Dong,Yinqiang Zheng,Wei Yin,Zhengqing Chen
机构: Wuhan University (武汉大学); The University of Tokyo (东京大学); Horizon Robotics (地平线机器人); Tsinghua University (清华大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher’s single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

[CV-111] PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

【速读】：该论文旨在解决自动驾驶场景高保真重建中现有前馈3D高斯泼溅（3D Gaussian Splatting, 3DGS）方法存在的多视角不一致和分层伪影问题，以及动态实例建模时缺乏显式跨视角对应和实例级一致性的缺陷。解决方案的关键在于两点：一是采用点对齐表示（point-aligned representations），即在世界空间初始化稀疏3D查询（sparse 3D queries），通过时空融合（spatial-temporal fusion）将多视图图像信息聚合到这些查询上，从而在单次前馈传递中强制实现显式跨视角一致性；二是引入场景图（scene graphs）显式组织动态实例，利用3D边界框实现实例级运动传播与时间一致性动态表示，从而在整体框架中同时保证静态场景的几何一致性和动态对象的实例级连贯性。

链接: https://arxiv.org/abs/2605.11594
作者: Cheng Chi,Xianqi Wang,Hongcheng Luo,Mingfei Tu,Gangwei Xu,Zehan Zhang,Bing Wang,Guang Chen,Hangjun Ye,Sida Peng,Xin Yang,Haiyang Sun
机构: Xiaomi EV (小米电动汽车); Huazhong University of Science and Technology (华中科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.

[CV-112] Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在多图像跨模态检索任务中存在的严重位置偏差问题——即模型预测结果主要由输入顺序主导，而非图像与文本之间的语义相关性。解决方案的关键在于提出一种无需训练的注意力引导去偏框架，该框架基于作者发现的“对数几率-注意力发散”（Logit-Attention Divergence）现象：尽管输出logits表现出显著偏差，但模型内部的注意力图仍能良好地对齐相关视觉证据。因此，该方法在推理阶段利用内在的注意力信号进行实例级校正，仅需少量校准集且计算开销极低，从而有效提升排列不变性（permutation invariance）并实现当前最优性能，将准确率相对基线提升超过40%。

链接: https://arxiv.org/abs/2605.11591
作者: Mingtao Xian,Yifeng Yang,Qinying Gu,Xinbing Wang,Nanyang Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40% compared to baselines. Code is available at this https URL.

[CV-113] A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

【速读】：该论文致力于解决灰度图像的去噪问题。其解决方案的关键在于提出了一种结合四叉树区域分割模型（quadtree region-partitioning model）与混合自回归模型（mixture autoregressive model）的概率图像生成模型，并将基于最大后验估计（MAP）的去噪问题转化为最大化一个变分下界（variational lower bound）的框架。为了最大化该下界，作者开发了一种交替应用变分贝叶斯（variational Bayes）和梯度方法（gradient methods）的算法，且特别指出梯度更新规则可以解析计算，无需数值计算或近似。

链接: https://arxiv.org/abs/2605.11585
作者: Shota Saito,Yuta Nakahara,Kohei Horinouchi,Naoki Ichijo,Manabu Kobayashi,Toshiyasu Matsushima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.

[CV-114] he Midas Touch for Metric Depth

【速读】：该论文旨在解决相对深度估计（relative depth estimation）在实际应用中因缺少度量尺度（metric scale）、局部不一致性（local inconsistencies）和低计算效率（low computational efficiency）导致的局限性。其解决方案的关键在于提出了一种数学可解释的方法——Midas Touch for Depth (MTD)，该方法仅利用极其稀疏的3D数据即可将相对深度转换为度量深度。具体而言，MTD通过稀疏图优化（sparse graph optimization）实现分段恢复策略（segment-wise recovery strategy）以消除局部尺度不一致，继而采用不连续性感知测地代价（discontinuity-aware geodesic cost）进行像素级细化（pixel-wise refinement）。这种设计不仅保证了强泛化能力，还显著提升了精度，同时其轻量级即插即用特性便于部署到多种下游3D任务中。

链接: https://arxiv.org/abs/2605.11578
作者: Yu Ma,Zizhan Guo,Zuyi Xiong,Haoran Zhang,Yi Feng,Hongbo Zhao,Hanli Wang,Rui Fan
机构: College of Electronic and Information Engineering, Tongji University (同济大学电子与信息工程学院); Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University (同济大学上海自主智能无人系统科学中心); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University (西安交通大学人机混合增强智能全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph\textbfMidas \textbfTouch for \textbfDepth (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at this https URL.

[CV-115] B-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

【速读】：该论文试图解决在时间对齐的音频与视觉信号缺乏明确语义对应时，跨模态对齐困难的问题。解决方案的关键在于引入文本作为语义锚点（semantic anchor），通过参数高效微调（PEFT）框架中的核心模块——文本桥接音频-视觉适配器（TB-AVA），实现音频与视觉流之间的文本介导交互；其中门控语义调制（GSM）模块根据文本推断的语义相关性选择性调制特征通道，从而在冻结音频和视觉编码器的情况下达成高效的跨模态表征学习。

链接: https://arxiv.org/abs/2605.11572
作者: Seongah Kim,Dinh Phu Tran,Hyeontaek Hwang,Saad Wazir,Duc Do Minh,Daeyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic this http URL propose to use text as a semantic anchor for audio-visual representation this http URL this end, we introduce a parameter-efficient adaptation frameworkbuilt on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.

[CV-116] Dynamic Execution Commitment of Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作模型（Vision-Language-Action, VLA）中固定执行步长（fixed execution horizon）导致的性能脆弱性问题：现有动作分块（action chunking）策略虽通过一次前向预测多个连续动作来分摊推理成本，但依赖任务调优的固定步长忽略了预测可靠性的状态依赖性，在动态或分布外场景下性能不稳定。解决方案的关键在于提出自适应动作接受机制（Adaptive Action Acceptance, A3），将动态执行承诺重构为自推测前缀验证（self-speculative prefix verification）问题：首先通过组采样（group sampling）计算动作的轨迹级共识分数（consensus score），然后选择代表性草稿（draft）并优先进行下游验证。具体强制执行两条核心原则：（1）共识排序条件不变性（consensus-ordered conditional invariance），即通过共识分数高的动作作为条件重新解码低共识动作，验证其是否保持一致性；（2）前缀闭合序列一致性（prefix-closed sequential consistency），仅接受从起点开始最长的连续已验证动作序列，从而保证物理推演的完整性。由此，执行步长自然表现为满足模型内部逻辑和序列执行约束的最长可验证前缀（longest verifiable prefix），无需手工调参即可在鲁棒性与吞吐量之间实现更优权衡。

链接: https://arxiv.org/abs/2605.11567
作者: Feng Chen,Xianghui Wang,Yuxuan Chen,Boying Li,Yefei He,Zeyu Zhang,Yicheng Wu
机构: University of Adelaide (阿德莱德大学); Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学); Monash University (莫纳什大学); Zhejiang University (浙江大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released in the next version

点击查看摘要

Abstract:Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

[CV-117] CP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

【速读】：该论文旨在解决状态空间模型（SSMs）在长程视觉任务中因隐式递归动态导致的状态依赖记忆行为难以控制的问题，特别是在紧凑型主干网络中长扫描路径易超出有效记忆范围。解决方案的关键在于提出Token-Conditioned Poles SSM（TCP-SSM），一个结构化选择性SSM框架，通过引入显式且可解释的稳定极点（包括建模单调或符号交替衰减的实极点，以及捕获阻尼振荡响应的复共轭极点）来明确递归动态。通过有界半径和角度调制将共享基极点转换为依赖于当前视觉令牌的极点，使每个扫描步骤能自适应调整记忆行为，同时保持极点稳定性。此外，集成分组极点共享与轻量级低秩输入通路，在保持线性复杂度前提下实现高效扫描算子，在Vision Mamba风格模型中可将SSM计算复杂度降低高达44%并维持或提升基线精度。

链接: https://arxiv.org/abs/2605.11563
作者: Sara Shoouri,Morteza Tavakoli Taba,Hun-Seok Kim
机构: University of Michigan(密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model’s state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.

[CV-118] When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLM s

【速读】：该论文试图解决多模态大语言模型（MLLMs）中因视觉幻觉（visual hallucinations）导致的生成内容与图像事实不一致或虚构对象的问题。其核心洞察在于：幻觉并非单纯源于视觉注意力缺失，而是即使模型赋予图像令牌（image tokens）显著注意力权重，其内部表示仍可能向错误答案偏移；通过逐层拉普拉斯能量（layer-wise Laplacian energy）对视觉注意力高频结构的分析，可定位幻觉偏好出现的层以及真实答案短暂恢复的层。基于此，解决方案的关键是提出了一种无需训练的推理策略LaSCD（Laplacian-Spectral Contrastive Decoding），它利用拉普拉斯能量挑选信息丰富的层，并以闭式解形式重新映射下一令牌的logits，从而在不牺牲通用能力的前提下系统性降低幻觉。

链接: https://arxiv.org/abs/2605.11559
作者: Fanpu Cao,Xin Zou,Xuming Hu,Hui Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at this https URL.

[CV-119] ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy MICCAI2026

【速读】：该论文试图解决传统放疗剂量预测中依赖精确人工标注解剖结构掩膜耗时费力的问题，提出了一种基于稀疏涂鸦标注的剂量预测框架。解决方案的关键在于设计了两个核心模块：涂鸦补全模块（Scribble Completion Module, SCM）通过将稀疏涂鸦标签传播至语义相似的体素来生成密集解剖掩膜，并引入基于超体素的正则化以保持几何边界一致性；结构引导剂量生成模块（Structure-Guided Dose Generation Module, SGDGM）则用于加强稀疏结构线索与剂量分布之间的对应关系，从而构建一个“涂鸦-掩膜-剂量”的稀疏标注学习流水线，使模型仅利用稀疏结构标注即可实现与密集标注相当的剂量预测性能。

链接: https://arxiv.org/abs/2605.11555
作者: Zhenxi Zhang,Yitao Zhuang,Yao Pu,Peixin Yu,Zirong Li,Yan Xia,Hui Li,Bin Li,Fuchen Zheng,Ge Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is a preprint version of the paper. The final version will appear in the proceedings of MICCAI 2026

点击查看摘要

Abstract:Anatomical structure masks are widely adopted in radiotherapy dose prediction, as they provide explicit geometric constraints that facilitate structure-dose coupling. However, conventional manual delineation of these masks requires precise annotation of structure boundaries relevant to radiotherapy, which is time-consuming and labor-intensive. To address these limitations, we propose a scribble-guided dose prediction framework that relies solely on anatomical structures annotated with sparse scribbles. Specifically, we design a Scribble Completion Module (SCM) to generate dense anatomical masks by propagating sparse scribble labels to semantically similar voxels. During the propagation process, a supervoxel-based regularization is introduced to preserve geometric boundary consistency to ensure anatomical plausibility. Furthermore, we propose a Structure-Guided Dose Generation Module (SGDGM) to strengthen the correspondence between sparse structural cues and dose distribution. The completed dense masks derived from scribbles serve as structural guidance to condition dose prediction, forming a scribble-mask-dose learning pipeline under sparse annotation. Experiments on the GDP-HMM dataset demonstrate that ScribbleDose achieves competitive dose prediction performance using only sparse structural annotations. The source code and reannotated scribble annotations are publicly available at this https URL.

[CV-120] VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck

【速读】：该论文旨在解决现有分布外（OOD）检测方法（如最大softmax概率MSP）缺乏理论支撑、校准误差大、无法可靠量化不确定性的问题。解决方案的关键在于利用深度变分信息瓶颈（VIB）显式约束网络表示中的信息流，并引入信息论度量（KL散度和预测熵）作为互补的检测信号：KL散度用于高效检测远OOD样本（如噪声、域偏移），预测熵用于精准检测近OOD样本（如新颖数字类别），通过并行策略融合两者实现高检测性能；同时，VIB的压缩机制（β=10^-3）可显著降低期望校准误差（ECE），产生更可靠的不确定性估计，从而支持基于阈值的主动学习查询选择。

链接: https://arxiv.org/abs/2605.11551
作者: Aryan Gondkar,Hayder Radha,Yiming Deng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: 6 pages, 3 figures, Fall 2025 version

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3% average AUROC and 92% true positive rate at 5% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0% AUROC, 60.1% TPR). Compression via the information bottleneck principle ( \beta=10^-3 ) reduces Expected Calibration Error by 38%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.

[CV-121] he DAWN of World-Action Interactive Models

【速读】：该论文试图解决现有世界动作模型（World Action Models, WAMs）中世界预测与动作生成之间缺乏交互性的问题。传统方法要么将两者视为独立并行的分支，要么采用刚性的“先预测后规划”流水线，从而忽略了场景演化与操控决策之间的相互依赖性。解决方案的关键在于提出了世界-动作交互模型（World-Action Interactive Models, WAIMs），并通过DAWN（Denoising Actions and World Interactive model）实例化。DAWN在一个紧凑的语义潜在空间（semantic latent space）中运作，将世界预测器（World Predictor）与世界条件动作去噪器（World-Conditioned Action Denoiser）耦合：预测的世界假设用于条件化动作去噪，而去噪后的动作假设则反馈更新世界预测，两者在推理过程中递归地相互细化。此外，DAWN不试图完全消除测试时世界演化或在像素空间中展开整个未来，而是执行一个简短的显式潜在展开（explicit latent rollout），该展开足以支持复杂交互场景中的长视野轨迹生成，从而兼顾规划性能与安全性。

链接: https://arxiv.org/abs/2605.11550
作者: Hongbo Lu,Liang Yao,Chenghao He,Haoyu Wang,Xiang Gu,Xianfei Li,Wenlong Liao,Tao He,Pai Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbfDAWN (\textbfDenoising \textbfActions and \textbfWorld i\textbfNteractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emphWorld Predictor with a \emphWorld-Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

[CV-122] GeoR-Bench: Evaluating Geoscience Visual Reasoning

【速读】：该论文试图解决当前地球科学人工智能（AI）研究缺乏能够评估开放、真实世界地球科学问题的基准的问题。现有基准多聚焦于任务特定场景（如遥感解译、地理问答），无法捕获真实地球科学问题的开放性和复杂性，因而难以衡量AI系统距离真正的地球科学智能（Geoscience intelligence）的差距。解决方案的关键在于提出 \textbf{GeoR-Bench}（Benchmark for evaluating Geoscience visual Reasoning through reasoning informed visual editing tasks），该基准通过推理引导的视觉编辑任务来评估地球科学视觉推理能力，包含440个精心挑选的样本，覆盖6个地球科学类别和24种任务类型，涵盖地球观测影像及地图、图表等结构化科学表示。评估维度包括推理（reasoning）、一致性（consistency）和质量（quality）三个方向。实验结果表明，当前多模态模型在地球科学推理方面存在严重瓶颈，最高性能模型在严格准确率上仅达42.7%，而最佳开源模型仅为10.3%，且输出视觉一致性和图像质量常高于其科学准确性，说明模型仅生成表面合理的结果，未能真正理解潜在的地球系统过程。

链接: https://arxiv.org/abs/2605.11541
作者: Yushuo Zheng,Zicheng Zhang,Huiyu Duan,Chunyi Li,Zijian Chen,Ziheng Jia,Yue Shi,Ke Gu,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbfGeoR-Bench, a \underlineBenchmark for evaluating \underlineGeoscience visual \underlineReasoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7% overall strict accuracy, while the best open-source models only get 10.3%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.

[CV-123] XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions

【速读】：该论文试图解决自动驾驶及智能交通系统在极端天气条件下的脆弱性问题，现有用于评估天气鲁棒性的数据集在规模、多样性和真实感上存在显著不足。解决方案的关键是构建了一个名为 XWOD (Extreme Weather Object Detection) 的大规模真实世界交通目标检测基准数据集，包含 10,010 张图像和 42,924 个边界框，覆盖雨、雪、雾、霾/沙尘、洪水、龙卷风和野火七种极端天气条件，并首次将洪水、龙卷风和野火这类气候加剧的灾害纳入其中。通过在该数据集上训练标准 YOLO 系列检测器，并在外部天气基准（RTTS、DAWN、WEDGE）上进行零样本测试，取得了相对于现有 YOLO 基线分别提升 56%、83% 和 35% 的 mAP_50 成绩，证明 XWOD 可作为学习天气鲁棒交通感知的强大源域。

链接: https://arxiv.org/abs/2605.11521
作者: Chih-Hsin Chen,Yu-Tung Liu,Amar Fadillah,Kuan-Ting Lai,Dong Liu
机构: National Taipei University of Technology (国立台北科技大学); Adobe Inc.(Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP _50 scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.

[CV-124] PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting CVPR

【速读】：该论文旨在解决无监督三维点云分割中，离散3D点与连续2D图像之间的错配问题（discrete-continuous mismatch），这种错配导致投影重叠和复杂的模态对齐，进而损害2D到3D迁移时的语义一致性。解决方案的关键在于引入3D高斯泼溅（3D Gaussian Splatting）作为统一的中间表示，通过将输入稀疏点云重建为密集的3D高斯空间来填充空间间隙并编码遮挡关系，从而消除投影导致的语义混淆；接着从该高斯空间渲染多视角密集图像，利用SAM提取2D语义掩码，并通过对比学习将语义蒸馏到3D高斯原语（Gaussian primitives）上，确保跨视角语义一致性；最后通过两步配准对齐高斯空间与原始点云，再基于最近邻搜索为原始点云分配语义标签。

链接: https://arxiv.org/abs/2605.11520
作者: Yixiao Song,Qingyong Li,Wen Wang,Zhicheng Yan
机构: 1Key Laboratory of Big Data Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education(交通大数据与人工智能重点实验室（北京交通大学），教育部); 2Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University(智慧高速铁路系统前沿科学中心，北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.

[CV-125] LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

【速读】：该论文试图解决两个核心问题：一是超高清（ultra-high-definition, UHD）视频去雾领域缺乏公开的评估基准（benchmark），二是现有视频去雾方法在处理连续3–5帧的UHD序列时无法在消费级GPU上高效运行。解决方案的关键在于一个关键观察——大气去雾可归结为一种依赖于低频深度场的逐像素仿射变换（per-pixel affine transform），该变换能通过双边网格（bilateral grid）紧凑编码，且其预测成本与输出分辨率解耦。基于此，论文提出了LiBrA-Net，它将时空仿射场分解为一个空间-颜色双边子网格和一个时间双边子网格，在固定低分辨率下分别预测；随后在gl(3)李代数（Lie algebra）中通过群论正则化（group-theoretic regularization）融合子网格系数，并利用Cayley参数化（Cayley parameterization）将其映射为可逆的GL(3)变换；最后通过一个轻量级的输入引导分支（input-guided branch）恢复高频细节。此外，论文还发布了UHV-4K——首个配对的4K视频去雾基准，为每一帧提供了深度、透射率和光流标注。

链接: https://arxiv.org/abs/2605.11508
作者: Yongcong Wang,Chengchao Shen,Guangwei Gao,Wei Wang,Pengwen Dai,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3–5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial–color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the \mathfrakgl(3) Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at this https URL.

[CV-126] Principled Design of Diffusion-based Optimizers for Inverse Problems

【速读】：该论文旨在解决基于分数的扩散模型（score-based diffusion models）在逆问题（inverse problems）应用中面临的实际部署瓶颈：推理时间过长以及超参数（如噪声调度和后验采样权重）需要针对每个问题场景进行繁琐的手动调整。尽管预训练扩散模型可在不同任务间复用而无需重新训练，但推理阶段的超参数仍需特定设置。解决方案的关键在于：一是提出具有原则性的重参数化（reparameterization）方法，引入超参数的不变性（invariance），使得同一组超参数能够跨多个逆问题直接复用，省去重新调优的过程；二是基于RED-diff框架，进一步开发出OptDiff管道（pipeline），该管道将后验采样（posterior sampling）重新表述为优化问题，从而提供简化的调优框架，并便于集成凸优化工具来加速推理。通过在图像重建、去模糊和超分辨率等任务上的实验，该方法在显著提升推理速度的同时改善了图像质量。

链接: https://arxiv.org/abs/2605.11506
作者: Julio Oscanoa,Irmak Sivgin,Cagan Alkan,Daniel Ennis,John Pauly,Mert Pilanci,Shreyas Vasanawala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.

[CV-127] PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

【速读】：该论文试图解决零样本基于骨骼的动作识别（Zero-shot Skeleton-based Action Recognition, ZSSAR）中因骨骼数据作为人体姿态估计（Human Pose Estimation, HPE）的压缩输出而导致的上游语义损失（upstream semantic loss）问题，即传统方法直接将关节坐标序列与语言对齐时，丢失了人体-物体交互和姿态相关的视觉线索。解决方案的关键在于提出一个HPE感知的ZSSAR框架PoseBridge，它不从外部引入RGB分支或物体检测器，而是从与骨骼生成相同的HPE过程中提取姿态锚定语义线索（pose-anchored semantic cues），并通过骨架条件桥接（skeleton-conditioned bridging）和语义原型适应（semantic prototype adaptation）将这些线索转移到骨骼-文本对齐中，从而在不增加额外模态开销的前提下显著提升开放场景下的识别性能。

链接: https://arxiv.org/abs/2605.11497
作者: Sanghyeon Lee,Jinwoo Kim,Jong Taek Lee
机构: School of Computer Science and Engineering(计算机科学与工程学院); Kyungpook National University(庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

[CV-128] STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

【速读】：该论文旨在解决蒸馏后少步（T≤4）或单步（T=1）扩散模型在图像生成中样本多样性显著低于多步扩散模型的问题。由于少步/单步模型缺乏多步扩散中可用的调度、轨迹或迭代优化等多样性引入机制，现有基于特征扰动的增强方法往往效果有限，要么多样性增益不足，要么损害生成质量。解决方案的关键在于提出一种名为STRIDE（training-free, optimization-free, single forward pass）的方法，其核心思想是：有效的多样性注入必须尊重模型习得的特征几何（feature geometry）。具体地，STRIDE通过向中间Transformer特征注入空间相干的粉红噪声（spatially coherent pink noise），并将该噪声投影到模型自身激活的主成分（principal components）上，确保扰动位于已学习的特征流形（feature manifold）内。这种设计使得扰动沿表示空间中有意义的方向产生受控变异，从而在不破坏文本对齐（text alignment）的前提下显著提升批次内多样性。实验表明，STRIDE在多样性-保真度前沿（diversity-fidelity frontier）上帕累托优于现有免训练基线，其有效性不在于增加扰动强度，而在于使扰动与模型的内部表示结构（internal representation structure）对齐。

链接: https://arxiv.org/abs/2605.11494
作者: Ankit Yadav,Arpit Garg,Ta Duc Huy,Lingqiao Liu
机构: Australian Institute for Machine Learning, Adelaide University, Australia(澳大利亚阿德莱德大学澳大利亚机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages 3 figures 4 tables

点击查看摘要

Abstract:Distilled one-step (T=1) or few-step (T \leq 4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model’s learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model’s own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model’s internal representation structure.

[CV-129] A Mimetic Detector for Adversarial Image Perturbations

【速读】：该论文试图解决深度图像分类器在面对对抗性攻击（adversarial attacks）时的检测难题，特别是针对标准ℓ∞有界攻击（如FGSM、PGD和Carlini-Wagner的ℓ∞变体）所生成的微小高频噪声模式。这些攻击产生的噪声在像素级别呈现近随机符号模式，虽在ℓ2范数下几乎不可见，但携带了不成比例的梯度能量。解决方案的关键在于利用基于开源MOLE库中的高阶Corbino-Castillo拟微分算子（mimetic operators）构建一个单次（single-shot）、无需训练（training-free）的检测器。该检测器无需重新训练、无需替代分类器，也无需访问被攻击的网络，仅依赖输入图像本身的属性，在O(HW)时间内完成判决。实验表明，该检测器在标准peppers测试图像上，对于ℓ∞预算ε=16/255，干净图像与对抗样本的分离度从阶数k=2时的3.55倍单调提升至k=6时的4.19倍，验证了对高频噪声模式的有效捕捉。

链接: https://arxiv.org/abs/2605.11492
作者: Johnny Corbino
机构: Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard \ell^\infty -bounded attacks (FGSM, PGD, and the \ell^\infty variant of Carlini–Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in \ell^2 , but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino–Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in O(HW) time. We validate the detector on the standard \textttpeppers test image at the canonical \ell^\infty budget \varepsilon = 16/255 and observe a clean-vs-adversarial separation that grows monotonically from 3.55\times at order k=2 to 4.19\times at k=6 .

[CV-130] 3DGS3: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering

【速读】：该论文针对3D高斯泼溅（3D Gaussian Splatting, 3DGS）在扩展至超密集场景和高分辨率时因计算瓶颈无法应用于延迟敏感应用的问题，提出了一种名为3DGS³的统一后渲染框架。其解决方案的关键在于，不优化泼溅管线本身，而是通过可微处理低分辨率输出来联合执行超采样（super sampling）和帧插值（frame interpolation）：一方面，梯度感知超采样（Gradient-Aware Super Sampling, GASS）模块利用3DGS的连续可微性提取图像梯度，引导基于GRU的细化网络实现高保真超采样；另一方面，轻量级时序帧插值（Lightweight Temporal Frame Interpolation, LTFI）模块基于紧凑的U-Net-like主干，融合连续帧的时序与可微空间线索来合成时间连贯的中间帧。该框架在保持与现有3DGS加速技术兼容的同时，显著提升了渲染效率与视觉质量。

链接: https://arxiv.org/abs/2605.11489
作者: Yibo Zhao,Fan Gao,Youcheng Cai,Ligang Liu
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-quality real-time 3D rendering but faces challenges in efficiently scaling to ultra-dense scenes and high-resolution due to computational bottlenecks that limit its use in latency-sensitive applications. Instead of optimizing the splatting pipeline itself, we propose \textbf3DGS ^3 , a unified post-rendering framework that jointly performs super sampling and frame interpolation through differentiable processing of low-resolution outputs to achieve both high-resolution and high-frame-rate rendering. Our \textbfGradient- -Aware Super Sampling (GASS) module leverages the continuous differentiability of 3DGS to extract image gradients that guide a GRU-based refinement network to enable high-fidelity super sampling. Furthermore, a \textbfLightweight Temporal Frame Interpolation (LTFI) module based on a compact U-Net-like backbone fuses temporal and differentiable spatial cues from consecutive frames to synthesize temporally coherent intermediate frames. Experiments on public datasets demonstrate that 3DGS ^3 achieves superior rendering efficiency and visual quality when compared with state-of-the-art methods and remains compatible with existing 3DGS acceleration techniques. The code will be publicly released upon acceptance.

[CV-131] LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLM s

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理长视频时，如何在有限视觉标记预算下有效选择信息帧的问题。现有方法如均匀采样、逐点相关性评分、分块选择或智能体探索，要么忽略了全局依赖关系，要么引入了显著的计算开销。解决方案的关键在于提出LDDR（Linear DPP-Based Dynamic Resolution）框架，这是一个无需训练、即插即用且预算感知的视频帧采样方法。其主要创新包括：在任务条件化特征空间中执行查询感知的确定性点过程（Determinantal Point Process, DPP）帧选择，实现比标准DPP基线快3倍的运行时加速；同时引入组DPP重要性指标（Group DPP importance），用于指导帧保留策略和动态分辨率分配，将更多视觉标记分配给信息丰富且非冗余的帧，而对低效帧进行降采样或剪枝。该方法在多个视频基准上显著提升了性能，尤其在预算受限场景下提升达2.5个百分点，并兼容多种开源和闭源MLLM主干模型。

链接: https://arxiv.org/abs/2605.11477
作者: Jingfeng Chen,Jiawen Qian,Wendi Deng,Yinuo Guo,Jiaqi Yu,Sicong Leng,Raghuveer Thirukovalluru,Bhuwan Dhingra
机构: Carnegie Mellon University(卡内基梅隆大学); Individual Researcher(独立研究员); National University Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学); Duke University(杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.

[CV-132] Deep Probabilistic Unfolding for Quantized Compressive Sensing

【速读】：该论文试图解决经典量化压缩感知（quantized compressive sensing）重建精度与效率不足的问题，关键在于提出一种深度概率展开模型（deep probabilistic unfolding model）。该模型通过推导闭合形式且数值稳定的似然梯度投影（likelihood gradient projection）替代传统L2投影，将硬量化约束转化为软概率引导，从而真实模拟量化物理过程；同时设计高效的双域Mamba模块（dual-domain Mamba module），动态捕获并融合多尺度局部与全局特征，确保远距离相关区域间的有效交互。这些创新使模型在重建时兼顾理论物理约束与特征表达能力，显著提升了量化压缩感知的实际应用效果。

链接: https://arxiv.org/abs/2605.11475
作者: Gang Qu,Ping Wang,Siming Zheng,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.

[CV-133] Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals

【速读】：该论文试图解决在轨迹预测任务中难以显式学习和表示智能体主观性（subjectivities）的问题，尤其是如何通过特定自我的主观性（ego’s subjectivity）来调节模型预测。解决方案的关键在于引入一个从“排练”（rehearsal）到“加演”（encore）的连续过程框架：首先，通过自我预测器（ego predictor）从短期观测中为场景中所有参与者学习并生成一组显式偏置的排练轨迹（rehearsal trajectories）；然后，将这些排练轨迹作为即时控制（immediate controls）来条件化最终预测，从而为预测网络提供直接且独特的自我偏置，以模拟不同智能体的各向异性主观性。

链接: https://arxiv.org/abs/2605.11463
作者: Conghao Wong,Ziqian Zou,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models’ predictions through a specific ego’s subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents’ various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emphEncore trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.

[CV-134] SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

【速读】：该论文试图解决大型视觉语言模型（VLMs）在空间推理任务（如深度排序和精确坐标定位）中表现不佳的问题，现有方法依赖场景中心数据集（如多视图扫描或室内视频）提供空间监督，但受限于底层场景数量稀少，导致数据规模和多样性远低于网络规模的2D图像集合。解决方案之关键在于提出SpatialForge，一种可扩展的数据合成流水线（data synthesis pipeline），它能够将任意自然场景2D图像转化为空间推理监督信号，通过将空间推理分解为感知（perception）和关系（relation）两个子任务，构建覆盖深度、布局和视角依赖推理的结构化监督信号，并引入自动验证（automatic verification）机制确保数据质量；基于该流水线构建的SpatialForge-10M数据集包含1000万个空间问答对，通过大规模训练显著提升了标准VLM的空间推理能力，验证了利用海量2D数据驱动3D空间推理的有效性。

链接: https://arxiv.org/abs/2605.11462
作者: Zishan Liu,Ruoxi Zang,Yanglin Zhang,Wei Liu,Yin Zhang,Jian Yao,Jiayin Zheng,Zhengzhe Liu
机构: Lingnan University (岭南大学); XPENG Robotics (小鹏机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.

[CV-135] Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

【速读】：该论文试图解决当前主流视觉-语言-动作模型（Vision-Language-Action models, VLA）因采用单帧观测范式而结构性地缺乏对时间动态感知能力的问题，导致在非平稳场景（如动态环境或动作执行过程中环境变化）中性能严重退化，而现有方法要么需要昂贵的重新训练，要么受限于延迟瓶颈和动作块间时间一致性差。解决方案之关键在于提出一种名为“节奏与路径校正”（Pace-and-Path Correction）的免训练、闭式推理时操作符，该操作符可包裹任何基于分块动作的VLA模型。其核心是从单一二次代价函数出发进行联合最小化，得到一个正交分解为两个独立通道的统一解：节奏通道压缩沿计划方向的执行速度，路径通道则施加正交空间偏移，两者共同在动作块窗口内吸收感知到的动态变化，从而在不重新训练模型的前提下实现实时的时序动态补偿。

链接: https://arxiv.org/abs/2605.11459
作者: Yanyan Zhang,Chaoda Song,Vikash Singh,Xinpeng Li,Kai Ye,Zhe Hu,Zhongzhu Pu,Yu Yin,Vipin Chaudhary
机构: Case Western Reserve University (凯斯西储大学); The Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

[CV-136] Leverag ing Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

【速读】：该论文试图解决现有全能图像恢复方法将退化视为离散类别，从而难以建模复合退化中连续关系结构的问题。解决方案的关键在于提出一种多模态大语言模型（Multimodal Large Language Model, MLLM）引导的图像恢复框架，利用多模态嵌入作为低层恢复的指导：具体通过MLLM引导的融合块（MLLM-guided Fusion Block, MGFB）将MLLM特征注入编码器-解码器架构以增强退化感知表示；引入混合频率专家（Mixture-of-Frequency-Experts, MoFE）模块，根据MLLM引导的上下文线索自适应组合频率专家；并设计带有关系对齐损失（relational alignment loss）的MLLM引导路由器，使路由模式与退化输入的嵌入空间关系保持一致，从而有效建模复合退化中的连续关系结构。

链接: https://arxiv.org/abs/2605.11444
作者: Eunho Lee,Youngbae Hwang,Rei Kawakami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.

[CV-137] Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

【速读】：该论文试图解决预训练多模态大语言模型（MLLMs）在灾后视觉问答（VQA）任务中因对提示措辞敏感而导致可靠性不足的问题，从而限制其在时间紧迫的灾害评估场景中的实际应用。解决方案的关键在于采用结构化推理策略，即使用一个MLLM生成任务特定的指令，作为另一个MLLM的链式思维（Chain-of-Thought, CoT）指导，并结合不同程度的上下文学习（In-Context Learning, ICL）融入答案生成过程，从而利用显式的推理引导和上下文示例来提升回答的准确性和一致性。

链接: https://arxiv.org/abs/2605.11439
作者: Armin Zarbaft,Ehsan Karimi,Nhut Le,Maryam Rahnemoonfar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

点击查看摘要

Abstract:Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.

[CV-138] Beyond Masks: The Case for Medical Image Parsing

【速读】：该论文试图解决当前医学影像研究中以逐体素掩码（per-voxel masks）为主要输出与临床报告需求之间的严重脱节问题：掩码仅能提供大小、体积和位置等有限信息，而放射科医生撰写的报告几乎不包含这些内容，报告需要的是对影像的结构化解释。解决方案的关键在于将医学影像解析（medical image parsing）确立为核心输出范式，即生成一种结构化表示，其中实体（entities，如结构或发现）、属性（attributes，如边界规则性、强化模式、严重程度）和关系（relationships，如相对位置、毗邻关系、与既往扫描的变化）被同时输出且相互一致；同时要求该解析满足决策（命名正确事物）、重建（内容足够丰富以再生影像）和预测（内容足以预测患者状态演变）三个层次性质。论文强调，前行的道路不是设计新架构，而是承诺采用这种更丰富的输出，并设计相应的训练信号来奖励它。

链接: https://arxiv.org/abs/2605.11438
作者: Siddharth Gupta,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University(约翰霍普金斯大学); Northwestern University(西北大学); Johns Hopkins Medicine(约翰霍普金斯医学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.

[CV-139] ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models CVPR2026

【速读】：该论文试图解决无真实正常光照参考条件下的光照退化图像恢复（illumination degradation image restoration）问题，即在不依赖成对高质量参考图像的情况下，修复因光照不均、曝光不足或过度等退化导致的低质量图像。解决方案的关键在于提出一个零参考（zero-reference）的扩散框架ZeroIDIR，将恢复过程解耦为自适应光照校正和基于扩散的重建两个阶段，并仅在低质量退化图像上训练。具体而言，关键模块包括：一个自适应伽马校正（adaptive gamma correction）模块，执行空间变化的曝光校正，生成仅经光照校正的表示以缓解曝光偏差，并为后续扩散过程提供可靠输入；同时引入直方图引导的光照校正损失（histogram-guided illumination correction loss），将校正后的光照分布正则化到自然场景分布。随后，将光照校正后的图像视为中间噪声状态，输入到提出的扰动一致性扩散模型（perturbed consistency diffusion model）中重建细节并抑制噪声；此外，还提出一种扰动扩散一致性损失（perturbed diffusion consistency loss），约束最终恢复图像的前向扩散轨迹与扰动状态保持一致，从而在没有监督的情况下提升恢复的保真度和稳定性。

链接: https://arxiv.org/abs/2605.11435
作者: Hai Jiang,Zhen Liu,Yinjie Lei,Songchen Han,Bing Zeng,Shuaicheng Liu
机构: School of Aeronautics and Astronautics, Sichuan University (四川大学航空航天学院); University of Electronic Science and Technology of China (电子科技大学); College of Electronics and Information Engineering, Sichuan University (四川大学电子信息与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at this https URL.

[CV-140] Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning

【速读】：该论文试图解决糖尿病视网膜病变（DR）分类中因视网膜图像尺寸大且多变而导致的处理困难与分类性能不足的问题。解决方案的关键在于：在将图像数据输入深度学习网络进行分类之前，采用多种下采样（downscaling）算法进行预处理，以统一图像尺寸并保留关键特征；同时，结合两个公开数据集（Kaggle与印度糖尿病视网膜病变图像数据集），并设计了一种新颖的多通道（Multi Channel）Inception V3架构与独特的自创预处理阶段，从而在准确率、特异性和灵敏度等指标上超越先前最优方法。

链接: https://arxiv.org/abs/2605.11430
作者: Nishi Doshi,Urvi Oza,Pankaj Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning

[CV-141] PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming

【速读】：该论文试图解决当前4D高斯溅射（4D Gaussian Splatting, 4DGS）模型作为单块比特流（monolithic bitstreams）时，客户端必须完整下载后才能渲染任何帧，导致移动带宽下数十至数百秒的黑屏等待，且无法兼容自适应比特率（adaptive-bitrate）传输的问题。其关键解决方案在于提出分层变形分解（Hierarchical Deformation Decomposition, HDD），将4DGS中固有的粗到细运动层次外部化为三个独立可传输的层——静态骨架（static scaffold）、全局变形（global deformation）和局部细化（local refinement），使得任意比特流前缀即可渲染，从而将单次训练转化为可扩展的、兼容DASH/HLS的比特流。此外，结合高斯熵属性率失真损失（Gaussian-entropy attribute rate-distortion loss）和时域掩码一致性正则化（temporal mask consistency regulariser）来压缩基础层并抑制低比特率闪烁，以及通过学习激活率（activation rate rho）门控的容量加权推出计划（capacity-weighted rollout schedule）避免变形网络欠训练，最终实现首帧延迟从73–930秒降至约1.7秒（2 Mbps链路），在匹配渲染保真度下减少60%流式比特流。

链接: https://arxiv.org/abs/2605.11427
作者: Jiachen Li,Guangzhi Han,Jin Wan,Delong Han,Yuan Gao,Min Li,Mingle Zhou,Gang Li
机构: Qilu University of Technology(齐鲁工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers – a static scaffold, a global deformation, and a local refinement – so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by 60% at matched rendering fidelity and reduces first-frame latency from 73–930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.

[CV-142] VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors SIGGRAPH

【速读】：该论文旨在解决从稀疏视角（甚至单张图像）进行高保真3D场景重建时，由于输入覆盖不足而导致的未见、遮挡或弱约束区域推理失败问题。解决方案的关键在于构建一个无需训练的生成式重建框架VidSplat，其核心组件包括：一是基于视频扩散先验的分阶段去噪策略，通过渲染的RGB和掩码图像自适应引导去噪方向，以保持生成新视图的3D几何一致性；二是迭代采样相机轨迹、合成新视图并用置信度加权优化补充训练数据的机制，从而逐步补偿缺失的输入覆盖，实现完整场景重建。

链接: https://arxiv.org/abs/2605.11424
作者: Jimin Tang,Wenyuan Zhang,Junsheng Zhou,Zian Huang,Kanle Shi,Shenkun Xu,Yu-Shen Liu,Zhizhong Han
机构: School of Software, Tsinghua University (清华大学软件学院); Kuaishou Technology (快手科技); Department of Computer Science, Wayne State University (韦恩州立大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH Conference 2026. Project Page: this https URL

点击查看摘要

Abstract:Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.

[CV-143] JACoP: Joint Alignment for Compliant Multi-Agent Prediction CVPR

【速读】：该论文试图解决现有随机人类轨迹预测（Stochastic Human Trajectory Prediction, HTP）方法在优化个体预测精度时，生成的整体预测缺乏集体合规性（collective compliance），导致输出轨迹存在社会碰撞（social collisions）和环境违规（environmental violations），从而限制了实际应用的问题。解决方案的关键在于提出了一个创新的多阶段框架JACoP（Joint Alignment for Compliant Multi-Agent Prediction），其核心包含两个组件：一是基于锚点的智能体中心剖面器（Anchor-Based Agent-Centric Profiler），用于对初始预测进行有效的合规过滤；二是基于马尔可夫随机场（Markov Random Field, MRF）的对齐器，通过将智能体间的空间和社会成本表示为MRF能量势（energy potentials），形式化场景预测的联合选择过程，从而能够从联合轨迹分布中推断和采样，实现最优的场景级（scene-level）合规预测。

链接: https://arxiv.org/abs/2605.11385
作者: Qingze Liu,Alen Mrdovic,Danrui Li,Mathew Schwartz,Sejong Yoon,Mubbasir Kapadia
机构: Rutgers University, New Brunswick (罗格斯大学, 新布朗斯维克); The College of New Jersey (新泽西学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by CVPRF 2026

点击查看摘要

Abstract:Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.

[CV-144] HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels

【速读】：该论文旨在解决噪声标签环境下，深度神经网络因决策边界坍塌（boundary collapse）而无法区分难清洁样本（hard clean samples）与噪声样本（noisy samples）的性能瓶颈问题。解决方案之关键在于提出基于哈密顿动力学（Hamiltonian dynamics）的HamBR范式，其核心是采用球面哈密顿蒙特卡洛（Spherical HMC）机制主动探测表示空间中的类间模糊区域并合成高质量虚拟离群点（virtual outliers），再通过基于能量的建模施加显式排斥约束，在决策边界处建立稳健的能量屏障（energy barrier），从而迫使真实样本从分散的重叠区域向各自类中心收缩，恢复决策边界的判别锐度（discriminative sharpness），最终提升模型对硬边界样本的判别准确性及分布外检测（OOD detection）能力。

链接: https://arxiv.org/abs/2605.11383
作者: Ningkang Peng,Jingyang Mao,Qianfeng Yu,Xiaoqian Peng,Peirong Ma,Yanhui Gu
机构: Nanjing Normal University(南京师范大学); Nanjing University of Chinese Medicine(南京中医药大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.

[CV-145] Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers CVPR

【速读】：该论文试图解决生成物理上合理的动态、长期人-物体交互（Human-Object Interaction, HOI）运动的问题，现有方法受限于静态交互数据集以及预训练智能体无法同时处理动态全身运动与物体交互。解决方案的关键在于：在规划阶段，利用预训练人体运动扩散模型的动态先验增强HOI数据集，生成物体轨迹以规划动态HOI序列；在执行阶段，设计一个组合器网络（composer network）混合分别专精于动态人体运动和静态HOI运动的预训练模仿智能体（imitation agents），实现两者互补技能的时空组合，从而在保持交互的同时提升动态HOI任务的成功率，并显著减少训练时间。

链接: https://arxiv.org/abs/2605.11369
作者: Sanghyeok Nam,Byoungjun Kim,Daehyung Park,Tae-Kyun Kim
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026

点击查看摘要

Abstract:Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.

[CV-146] 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

【速读】：该论文试图解决当前视觉生成世界模型（world model）普遍将任务简化为新视角合成或未来帧预测、偏重视觉真实性而忽视部分可观测环境下具身智能体（embodied agents）所需的结构化不确定性这一问题。其关键解决方案是提出“世界模型即3D空间中的具身信念推理”（world modeling as embodied belief inference in 3D space）这一新视角，并据此实例化为3D-Belief——一个生成式3D世界模型。该模型能从部分观测中显式地推断出可操作的3D信念（explicit, actionable 3D beliefs），并在线（online）依次更新；它直接在3D中表示不确定性（uncertainty），使得具身智能体可以想象合理的场景补全并推理部分观测环境，从而在场景记忆、多假设信念采样、顺序信念更新以及对未观测区域进行语义信息预测等方面实现关键能力。

链接: https://arxiv.org/abs/2605.11367
作者: Yifan Yin,Zehao Wen,Jieneng Chen,Zehan Zheng,Nanru Dai,Haojun Shi,Suyu Ye,Aydan Huang,Zheyuan Zhang,Alan Yuille,Jianwen Xie,Ayush Tewari,Tianmin Shu
机构: Johns Hopkins University (约翰霍普金斯大学); Lambda; University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent’s belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.

[CV-147] Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

【速读】：该论文旨在解决基于Transformer的3D重建模型在实际部署中面临的两个耦合挑战：一是密集的多视图注意力机制导致大量的token混合开销（token-mixing overhead），限制了模型扩展时的效率；二是低精度（如FP8）执行会破坏几何敏感表示（geometry-sensitive representations），造成深度、姿态和3D一致性的退化。解决方案的关键在于提出一个名为Lite3R的算法-系统协同设计框架：一方面，采用模型无关的教师-学生（teacher-student）框架，将密集注意力替换为稀疏线性注意力（Sparse Linear Attention），在保留关键几何交互的同时降低注意力计算成本；另一方面，引入参数高效的FP8感知量化感知训练（FP8-aware Quantization-Aware Training, FP8-aware QAT）策略，并结合部分注意力蒸馏（partial attention distillation），冻结预训练骨干的大部分参数，仅训练轻量级线性分支投影层（linear-branch projection layers），从而在保持预训练几何先验的同时实现稳定的低精度部署。

链接: https://arxiv.org/abs/2605.11354
作者: Haoyu Zhang,Zeyu Zhang,Zedong Zhou,Yang Zhao,Hao Tang
机构: Peking University(北京大学); La Trobe University(拉筹伯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: this https URL. Website: this https URL.

[CV-148] Gradient-Free Noise Optimization for Reward Alignment in Generative Models

【速读】：该论文旨在解决现有基于扩散和流模型的奖励对齐方法依赖多步随机轨迹，难以扩展到确定性生成器，且噪声空间优化需要反向传播通过生成器和奖励管道，从而限制其仅适用于可微设置的问题。解决方案的关键是提出ZeNO（Zeroth-order Noise Optimization）框架，该框架将噪声优化重新表述为路径积分控制问题，仅通过零阶奖励评估（无需梯度）即可进行估计；当以奥恩斯坦-乌伦贝克（Ornstein-Uhlenbeck）参考过程实例化时，其更新隐式地关联到针对奖励倾斜分布的朗之万动力学，从而在推理时实现有效缩放，并能在反向传播不可行的场景（如蛋白质结构生成）中表现出色。

链接: https://arxiv.org/abs/2605.11347
作者: Jeongsol Kim,Hongeun Kim,Jian Wang,Jong Chul Ye
机构: KAIST AI (韩国科学技术院人工智能研究所); Snap Inc. (Snap公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein–Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.

[CV-149] Quantifying Rodda and Graham Gait Classification from 3D Makerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

【速读】：该论文旨在解决脑性瘫痪（Cerebral Palsy, CP）儿童步态评估中存在的可及性差和客观性不足的问题：当前金标准三维仪器化步态分析（3D-IGA）成本高昂且局限于专业中心，而临床观察性评估的信度中等，无法为大规模、低资源环境下的步态功能监测提供定量支持。解决方案的关键在于开发了一种无标记步态分析流程（markerless gait analysis pipeline），该流程直接从单视角临床步态视频中量化Rodda和Graham分类系统的膝踝z分数，通过训练模型实现与3D-IGA的高一致性（膝z分数R²=0.80，CCC=0.89；踝z分数R²=0.57，CCC=0.72），并进一步支持过伸屈曲的二元筛查（AUROC=0.88）以及连续z分数的纵向轨迹追踪，从而在低资源临床环境中实现可规模化、客观的步态评估。

链接: https://arxiv.org/abs/2605.11314
作者: Lauhitya Reddy,Seth Donahue,Jeremy Bauer,Susan Sienko,Anita Bagley,Joseph Krzak,Maura Eveld,Karen Kruger,Ross Chafetz,Vedant Kulkarni,Hyeokhyen Kwon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

点击查看摘要

Abstract:Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 \pm 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at n=54 ), the sagittal-view model achieved R^2 = 0.80 \pm 0.02 and CCC = 0.89 \pm 0.02 for knee z-scores and R^2 = 0.57 \pm 0.02 and CCC = 0.72 \pm 0.02 for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC = 0.88 , correctly identifying 83% of affected children, and applying Rodda and Graham rules yields 43 \pm 1 % 7-class accuracy with macro-AUROC = 0.78 \pm 0.01 , ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

[CV-150] Couple to Control: Joint Initial Noise Design in Diffusion Models

【速读】：该论文试图解决扩散模型在批量生成图像时，独立高斯初始噪声假设限制了生成多样性和可控性的问题。其关键解决方案在于：将初始噪声的控制从选择或优化单个种子，重新定义为设计多样本图集中噪声的依赖结构（dependence structure），通过在保持每个噪声边缘分布为标准高斯的前提下引入耦合（coupling），从而在不增加采样成本的情况下提升生成多样性，并作为优化型管线的结构化初始化。具体实现中，排斥性高斯耦合（repulsive Gaussian coupling）可提高图集多样性，子空间耦合（subspace couplings）则支持固定对象下的背景生成，两者均能在保持提示对齐和图像质量的同时，匹配或超越现有测试时噪声优化基线。

链接: https://arxiv.org/abs/2605.11311
作者: Jing Jia,Liyue Shen,Guanyang Wang
机构: Rutgers University (罗格斯大学); University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
备注: 26 pages

点击查看摘要

Abstract:Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.

[CV-151] Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

【速读】：该论文旨在解决图像到代码生成任务中现有评估基准的局限性，包括领域狭窄、依赖配对可执行参考代码以及通用评分标准无法捕捉领域特定重构错误等问题。解决方案的关键在于提出了 Vision2Code 基准与评估框架：通过构建包含 2,169 个测试样例、覆盖 15 个源数据集（涵盖图表、几何、图形、科学图像、文档和 3D 空间场景）的多领域测试集；模型生成可执行程序后，采用渲染与源图像比对的方式，利用一个结合数据集特定评分标准（domain-specific rubrics）和确定性护栏（deterministic guardrails）的视觉语言模型（VLM）评分器进行质量评估，同时引入渲染成功诊断（render-success diagnostics）以分离代码执行失败与重构质量缺陷。该评估协议经人工验证，其对齐人类判断的效果优于通用视觉评分或嵌入相似性基线方法。

链接: https://arxiv.org/abs/2605.11307
作者: Ajay Vikram Periasami,Junlin Wang,Bhuwan Dhingra
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at this https URL.

[CV-152] CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

【速读】：该论文旨在解决当前大多数胸部X线影像视觉-语言模型（vision-language models）仅基于静态的单一影像-报告对进行训练，缺乏对纵向（longitudinal）时间变化进行显式建模与监督的问题，从而无法实现基于既往和当前影像的时序推理（temporal reasoning）。解决方案的关键在于构建了CheXTemporal数据集，该数据集包含配对的前后两次胸部X线片（prior-current chest X-rays），并提供了五类进展分类（new, worse, stable, improved, resolved）、局部病变空间标注、显式的时空对齐（spatial-temporal alignment）以及多来源覆盖以实现跨域评估。此外，还构建了包含28万对影像的银标准（silver standard）数据集，用于弱监督下的大规模评估。这些资源使得对现有模型在零样本（zero-shot）设置下的空间定位（spatial grounding）、细粒度时序推理（fine-grained temporal reasoning）及分布偏移鲁棒性进行系统性评估成为可能，从而揭示模型在纵向疾病演变建模上的显著不足。

链接: https://arxiv.org/abs/2605.11304
作者: Eva Prakash,Yunhe Gao,Chong Wang,Justin Xu,Neal Prakash,Arne Michalson,Seena Dehkharghani,Eun Kyoung Hong,Julie Bauml,Roger Boodoo,Jean-Benoit Delbrouck,Sophie Ostmeier,Curtis Langlotz
机构: Stanford University(斯坦福大学); University of Oxford(牛津大学); University of California, Berkeley(加利福尼亚大学伯克利分校); HOPPR; University Hospital Zurich(苏黎世大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.

[CV-153] Can Graphs Help Vision SSMs See Better?

【速读】：该论文旨在解决视觉状态空间模型（Vision State Space Models, Vision SSMs）中现有扫描算子（scan operators）仅通过几何遍历或动态坐标采样来适应路径或采样位置，而未能显式建模哪些局部补丁（patches）应在全局状态空间混合之前彼此交换信息的根本性局限。解决方案之关键在于提出了GraphScan——一种图诱导的动态扫描算子：对于每个token，GraphScan构建一个空间受限的局部图，学习基于特征条件的亲和力（feature-conditioned affinities）并引入相对位置偏置（relative positional bias），随后通过一步消息传递（one-step message passing）在其语义邻域内生成输出token；这些token在经由选择性状态空间模型（selective SSM）进行全局聚合之前已具备局部基础。GraphScan不仅保持token数量不变和图像尺寸线性缩放，还用特征驱动的语义路由（feature-conditioned semantic routing）替代了坐标驱动的插值（coordinate-conditioned interpolation），从而实现了更有效的局部-全局信息交换。

链接: https://arxiv.org/abs/2605.11300
作者: Dhruv Parikh,Anvitha Ramachandran,Haoyang Fan,Mustafa Munir,Rajgopal Kannan,Viktor Prasanna
机构: USC(南加州大学); UT Austin(德克萨斯大学奥斯汀分校); DEVCOM ARL Army Research Office(美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emphcan graphs help vision state space models see better? We introduce \textbfGraphScan, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbfGraphScan-Mamba achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

[CV-154] Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences

【速读】：该论文旨在解决因伦理和后勤障碍导致的公路施工危害图像培训材料匮乏问题。解决方案的关键是开发并评估一种基于生成式AI（Generative AI）的方法，从OSHA严重伤害报告叙述中自动合成可视化图像，具体包括两种生成模式：单次通过（single-pass）生成每起事件一幅图像，以及时间序列（temporal）生成四阶段序列；同时构建了多维度评估框架，结合CLIP（Contrastive Language-Image Pre-training）语义检索与专家评估，对生成图像的教育效用、保真度（fidelity）和对齐度（alignment）等维度进行量化验证，从而确保合成图像的可接受性与实用性。

链接: https://arxiv.org/abs/2605.11276
作者: Trevor Neece,Mason Smetana,Lev Khazanovich
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.

[CV-155] Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

【速读】：该论文试图解决传统岛屿面积和海岸线测量方法过度依赖正射影像、昂贵机载深度传感器或密集地面控制点，导致在广阔且难以进入的开放海域环境中存在人力成本高、耗时长、效率低下的问题。解决方案的关键在于提出一种基于纯单目视觉 (pure monocular vision) 的全自动测量框架：仅需输入目标区域的地理坐标或名称，获取低空环绕图像序列后，通过轻量级轨迹对齐算法 (Umeyama) 恢复点云的全局物理尺度，并对缩放模型进行正射纠正，从而在二维栅格平面上直接实现高精度的面积和周长提取，完全摆脱了对先验地理信息系统数据的依赖。

链接: https://arxiv.org/abs/2605.11267
作者: Quanyun Wu,Kyle Gao,Wentao Sun,Hongjie He,Yuhao Chen,David A. Clausi,Jonathan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE OCEANS (Sanya) 2026

点击查看摘要

Abstract:Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline

[CV-156] PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

【速读】：该论文旨在解决现有高斯溅射（Gaussian Splatting）方法仅生成视觉逼真的三维场景而缺乏对物体物理行为理解的问题。其解决方案的关键在于引入物理引导的三维高斯溅射（Physics-Guided 3D Gaussian Splatting, PG-3DGS）框架，通过将可微物理模拟与三维高斯表示相耦合，在视觉损失之外允许物理目标引导形状优化过程，从而生成既满足光度准确性又具备物理功能性的三维结构。该框架实现了外观重建与物理推理的统一，使得生成物体（如可倒水的茶壶和产生升力的飞机）能在保持视觉质量的同时展现物理有意义的行为。

链接: https://arxiv.org/abs/2605.11266
作者: Zachary Lee,Maxwell Jacobson,Yexiang Xue
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Submitted to Artificial Intelligence. 52 pages

点击查看摘要

Abstract:Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.

[CV-157] DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction MICCAI2026

【速读】：该论文旨在解决手术计算机视觉中密集预测模型（如分割和手术区域预测）因训练数据分布与部署环境间的分布偏移（distribution shift）而导致的泛化能力不足问题；解决方案之关键在于提出一种名为DenseTRF的自监督表示适应框架，其核心是通过插槽注意力（slot attention）学习纹理感知表示（texture-aware representations），以捕获跨域不变的视觉结构，进而在无监督条件下将这些表示适配到目标分布，并结合基于插槽注意力的密集预测条件化（conditioning dense prediction on slot attention）与模型合并策略（model merging strategies）来增强模型对域偏移的鲁棒性，最终实现跨手术流程的泛化性能提升。

链接: https://arxiv.org/abs/2605.11265
作者: Guiqiu Liao,Matjaž Jogan,Daniel A. Hashimoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

点击查看摘要

Abstract:Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.

[CV-158] ABRA: Agent Benchmark for Radiology Applications

【速读】：该论文试图解决现有医学-agent基准测试中将医学影像作为预选样本呈现，而非作为代理必须自主导航的动态环境的问题。解决方案的关键在于引入了ABRA（一个放射学-agent基准测试），该测试使代理通过21个函数调用工具（涵盖切片导航、窗宽窗位调整、序列选择、像素坐标标注和结构化报告）操作OHIF查看器和Orthanc DICOM服务器。ABRA包含655个程序化生成的任务，覆盖三个难度等级和八种类型，并采用任务类型特定的自动评分器沿规划、执行和结果三个维度评分。通过比较真实标注任务与配对预言机变体（其中模拟探测器提供发现），该基准测试定位了瓶颈在于感知（perception）而非工具编排（tool orchestration）。

链接: https://arxiv.org/abs/2605.11224
作者: Bulat Maksudov,Vladislav Kurenkov,Kathleen M. Curran,Alessandra Mileo
机构: School of Computing, Dublin City University (都柏林城市大学计算机学院); AXXX (未知); School of Medicine, University College Dublin (都柏林大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at this https URL

[CV-159] Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

【速读】：该论文旨在解决自动化生成临床级别（clinician-grade）外科手术评估报告的问题，其核心挑战在于如何将密集的时空视频表征与基于语言的推理（language-based reasoning）有效对齐，同时缓解高质量隐私保护数据集稀缺的瓶颈。解决方案的关键在于提出了一个感知-对齐-推理（Perception-Alignment-Reasoning）框架，其中核心创新是Hi-GaTA（一种轻量级时间适配器），它通过短到长范围时间聚合（short-to-long-range temporal aggregation）将长视频序列高效压缩为与大型语言模型（LLM）兼容的紧凑视觉前缀令牌；同时配套预训练了手术专用的ViViT风格视频编码器Sur40k（在4万分钟公开手术视频上微调），以捕获细粒度的时空过程先验；最后利用LoRA对LLM主干进行参数高效微调，在有限监督下生成连贯且风格一致的手术报告。

链接: https://arxiv.org/abs/2605.11208
作者: Kedi Sun,Chaohui Dang,Yue Feng,James Glasbey,Theodoros N. Arvanitis,Le Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

[CV-160] FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry ALT

【速读】：该论文试图探究深度神经网络中间特征表示（intermediate feature representations）的几何结构，核心问题在于理解特征空间是否近似于线性结构。解决方案之关键是：通过在输入空间施加一系列多样化的操作，包括几何与光度变换、局部遮罩以及基于生成式图像编辑模型的语义操作（semantic manipulations），然后学习从原始特征图（feature map）到被操作后特征图的映射（mapping）在特征空间中的可行性。研究中评估了从线性到非线性、从局部到全局的多种映射类型，并分析其重建质量和语义内容。结果表明，尽管全局模型（如Transformer）通常表现最佳，但一个作用于单特征向量的共享线性模型（shared linear model）也能在极小的重建质量下降下达到类似效果，即使对于高度非线性的语义操作也是如此。通过对不同特征层的映射进行权重与偏置主导性以及线性变换有效秩（effective rank）的分析，研究间接支持了特征空间在一阶近似下由线性结构组织的假设。

链接: https://arxiv.org/abs/2605.11203
作者: Elias B. Krey,Nils Neukirch,Nils Strodthoff
机构: Division AI4Health (AI4健康部门); Carl von Ossietzky Universität Oldenburg (卡尔·冯·奥西茨基奥尔登堡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 24 figures, 3 tables, Code is available at this https URL

点击查看摘要

Abstract:Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.

[CV-161] Unpacking the Eye of the Beholder: Social Location Identity and the Moving Target of Political Perspectives

【速读】：该论文试图解决的问题是：现有的计算工具（如情感分析模型）在处理政治图像时，往往生成一个单一的评分，假设图像对所有人都传达相同的意义，忽略了政治和社会身份（political and social identities）如何结构化地影响人们对政治信息的评价。解决方案的关键是开发了 Perspectivist Visual Political Sentiment (PVPS) 分类器，该分类器利用约82,000次来自5,575名美国成年人的评估进行训练，能够预测由政治和社会身份定义的受众群体对同一图像的不同评价。与标准工具不同，PVPS没有平均掉系统性的分歧，而是保留了这些分歧，返回一个评估档案（evaluative profile），记录谁同意、谁分歧以及分歧沿哪些身份线展开。通过应用于抗议图像等视觉情感研究，PVPS揭示了受众身份显著改变了感知的暴力和情感机制，因此测量政治图像的传达内容必须考虑其“移动”的对象。

链接: https://arxiv.org/abs/2605.11166
作者: Elena Sirotkina
机构: Center for Data Science (数据科学中心); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.

[CV-162] USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

【速读】：该论文试图解决医学图像分割中局部与全局信息融合的挑战，尤其是视觉Transformer（vision transformer）因使用标准自注意力（vanilla self-attention）而带来的二次计算复杂度问题。解决方案的关键在于提出了可扩展且高效的类Mamba注意力（Scalable and Efficient Mamba like Attention，SEMA），它利用局部窗口注意力（local window attention）实现令牌定位（token localization），从而避免标准自注意力的分散特性（dispersion property）并保持聚焦，同时辅以理论上一致的算术平均（arithmetic averaging）来捕获全局注意力。在此基础上，论文进一步设计了混合UNet架构USEMA，将卷积神经网络（CNN）的局部特征提取能力与SEMA注意力结合，从而在多种模态和图像尺寸下实现了相较基于全自注意力的Transformer模型更高的计算效率，以及相对于纯卷积和纯Mamba模型更优的分割性能。

链接: https://arxiv.org/abs/2605.11131
作者: Elisha Dayag,Nhat Thanh Tran,Jack Xin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

[CV-163] LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

【速读】：该论文旨在解决高动态范围（HDR，High Dynamic Range）图像生成中现有生成模型输出的动态范围受限、以及基于扩散的方法需多次采样生成不同曝光图像导致计算成本高昂和曝光间结构不一致的问题。解决方案的关键是提出LatentHDR框架，通过在潜在空间中解耦场景生成与曝光建模：首先利用一个预训练的扩散骨干网络生成单一且连贯的场景潜在表示，然后由一个轻量级的条件潜在到潜在（latent-to-latent）映射头将该表示确定性地转换为不同曝光下的特定表示，从而在单次前向传播中生成密集且结构一致的曝光栈。该设计避免了多遍扩散采样，确保了跨曝光图像的结构对齐，并大幅降低了计算开销。

链接: https://arxiv.org/abs/2605.11115
作者: Pedram Fekri,WenChen Li,William Chen,Peter Altamirano
机构: Monks AI Research Lab (Monks AI研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.

[CV-164] Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

【速读】：该论文旨在解决视觉-语言模型（VLM）中由前景-背景虚假关联（spurious correlations）导致的鲁棒性下降问题，特别是在训练数据中背景与前景存在完美（100%）相关且缺乏少数组样本时，模型容易学习到背景偏见。解决方案的关键在于利用VLM嵌入空间的高线性可加性（high linear additivity）特性，将场景表示分解为前景成分和背景成分，进而提出一种基于合成数据的预训练方法。通过在该分解框架下构造背景不变（background-invariant）的表示，模型能够在真实数据上实现强模拟到真实（sim-to-real）迁移，无需依赖真实世界的去偏数据，从而在Waterbirds数据集上首次达到超过90%的最差组准确率（worst-group accuracy）。

链接: https://arxiv.org/abs/2605.11107
作者: Youssef Zaazou,Mark Thomas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages, 7 figures

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding 90% on Waterbirds under perfect ( 100% ) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

[CV-165] HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

【速读】：该论文试图解决视觉生成模型中长期存在的架构碎片化问题，即传统方法依赖分离的文本编码器和外部变分自编码器 (VAE)，导致多模态输入处理割裂、任务泛化能力受限。解决方案的关键是提出一种原生统一的生成基础模型 HiDream-O1-Image，其核心在于采用像素空间的扩散 Transformer (Diffusion Transformer)，并通过统一 Transformer (UiT) 架构将原始图像像素、文本标记和任务特定条件映射到单个共享的标记空间中。这种原生编码范式消除了对独立 VAE 或预训练文本编码器的依赖，使模型能够将文本到图像生成、基于指令的编辑、主体驱动个性化等多样任务视为一致的上下文推理过程，从而实现端到端、高度可扩展的多模态生成引擎。

链接: https://arxiv.org/abs/2605.11061
作者: Qi Cai,Jingwen Chen,Chengmin Gao,Zijian Gong,Yehao Li,Yingwei Pan,Yi Peng,Zhaofan Qiu,Kai Yu,Yiheng Zhang,Hao Ai,Siying Bai,Yang Chen,Zhihui Chen,Fengbin Gao,Ying Guo,Dong Li,Zhen Shen,Leilei Shi,Jing Wang,Siyu Wang,Yimeng Wang,Rui Zheng,Ting Yao,Tao Mei
机构: HiDream.ai（HiDream.ai）
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Source codes and models are available at Github: this https URL and Huggingface: this https URL

点击查看摘要

Abstract:The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

[CV-166] he first global agricultural field boundary map at 10m resolution

【速读】：该论文试图解决全球范围内缺乏公开、一致且高分辨率的农业田块边界数据集的问题。现有像素级遥感产品无法直接提供田块级分析单元，而少数田块级数据产品仅覆盖欧洲部分地区或单个国家，限制了全球作物监测、粮食安全评估及下游农业科学研究的规模化应用。解决方案的关键在于：采用在Fields of The World数据集上训练的U-Net分割模型，对无云Sentinel-2影像拼接（mosaic）进行全自动推理，生成了覆盖241个国家和地区的10米分辨率全球田块边界多边形（共31.7亿个），并通过附带500米置信度层（confidence layer）来标识可靠区域，从而在无完整全球验证数据的条件下提供可用的产品。

链接: https://arxiv.org/abs/2605.11055
作者: Caleb Robinson,Gedeon Muhawenayo,Subash Khanal,Zhanpei Fang,Isaac Corley,Ana M. Tárano,Lyndon Estes,Jennifer Marcus,Nathan Jacobs,Hannah Kerner,Inbal Becker-Reshef,Juan M. Lavista Ferres
机构: Microsoft AI for Good Research Lab (微软人工智能公益研究实验室); Arizona State University (亚利桑那州立大学); Washington University in St. Louis (圣路易斯华盛顿大学); Oregon State University (俄勒冈州立大学); Taylor Geospatial (泰勒地理空间); Clark University (克拉克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.

[CV-167] Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation

【速读】：该论文试图解决现有不确定性量化方法在医学图像分割中缺乏空间分布可解释性的问题，即大多数方法将不确定性简化为标量置信度估计，没有从语义上约束其空间分布，导致不确定性无法以人类可理解的方式反映歧义来源。解决方案的关键是提出一个基于证据学习的原理引导的不确定性监督框架（PriUS, principle-guided uncertainty supervision framework），该框架识别了三个与感知对齐的原则，要求不确定性的空间分布反映：（1）结构间的图像对比度（image contrast）、（2）图像退化的严重程度（severity of image corruption）、（3）解剖结构的几何复杂性（geometric complexity）。通过在训练过程中显式强制执行对应于这些原则的监督目标，并引入定量指标衡量不确定性预测与诱发歧义的图像属性之间的一致性，PriUS能够生成更符合语义认知的不确定性估计，同时保持分割性能。

链接: https://arxiv.org/abs/2605.10984
作者: An Sui,Yuzhu Li,Gunter Schumann,Fuping Wu,Xiahai Zhuang
机构: School of Data Science, Fudan University (复旦大学数据科学学院); Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University (复旦大学类脑智能科学与技术研究院); National Heart and Lung Institute, Imperial College London (帝国理工学院国家心肺研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.

[CV-168] MPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在对齐扩散模型（diffusion models）到下游任务时普遍存在的奖励黑客（reward hacking）问题，该问题通过引发视觉模式坍塌（visual mode collapse）和放大不可靠奖励，导致生成多样性与质量显著下降。论文指出现有方法的核心原因在于其模式寻求（mode-seeking）特性——仅最大化期望奖励而不有效约束可接受轨迹上的概率分布，从而将生成集中在少数高奖励路径上。解决方案的关键在于提出轨迹匹配策略优化（Trajectory Matching Policy Optimization, TMPO），用轨迹级奖励分布匹配（trajectory-level reward distribution matching）替代标量奖励最大化。具体而言，TMPO引入Softmax轨迹平衡（Softmax-TB）目标，将K条轨迹的策略概率匹配到奖励诱导的玻尔兹曼分布（Boltzmann distribution）；理论证明该目标继承了前向KL散度（forward KL divergence）的模式覆盖（mode-covering）性质，能在优化奖励的同时保持对所有可接受轨迹的覆盖。此外，为降低大规模流匹配模型（flow-matching models）上多轨迹训练的时间开销，TMPO设计了动态随机树采样（Dynamic Stochastic Tree Sampling），使轨迹共享去噪前缀并在动态调度步骤处分叉，从而减少冗余计算并提升训练效率。在人类偏好、组合生成和文本渲染等多样化对齐任务上的广泛结果表明，TMPO在生成多样性上相较于现有最优方法提升9.1%，并在所有下游与效率指标上取得竞争性表现，实现了奖励与多样性的最优权衡。

链接: https://arxiv.org/abs/2605.10983
作者: Jiaming Li,Chenyu Zhu,Zhiyuan Ma,Nanxi Yi,Youjun Bao,Li Sun,Quanying Lv,Xiang Fang,Daizong Liu,Jianjun Li,Kun He,Bowen Zhou
机构: HUST(华中科技大学); Kuaishou Technology(快手科技); NTU(南洋理工大学); Wuhan University(武汉大学); Tsinghua University(清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

[CV-169] DiffSegLung: Diffusion Radiomic Distillation for Unsupervised Lung Pathology Segmentation

【速读】：该论文旨在解决CT图像中肺部病理的无监督分割难题，主要面临两大挑战：一是缺乏多病理队列的标注数据，二是现有基于扩散的方法未能利用物理上区分组织类别的定量亨氏单位（HU）信号。解决方案的关键在于提出扩散放射组学蒸馏（Diffusion Radiomic Distillation）框架，通过将手工设计的放射组学描述符作为物理基础的教师模型，借助对比目标（contrastive objective）来引导3D扩散U-Net瓶颈层的表示学习，从而在不依赖标注的情况下将病理判别结构嵌入到潜伏特征中。推理时丢弃教师，采用高斯混合模型（Gaussian Mixture Model, GMM）对多时间步瓶颈特征进行聚类，结合HU引导的标签分配，并利用Sobel扩散融合（Sobel Diffusion Fusion）进行边界细化，最终在四个异质性CT队列上实现优于现有无监督基线的分割性能。

链接: https://arxiv.org/abs/2605.11758
作者: Rezkellah Noureddine Khiati,Pierre-Yves Brillet,Catalin Fetita
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised segmentation of pulmonary pathologies in CT remains an open challenge due to the absence of annotated multi pathology cohorts and the failure of existing diffusion-based methods to exploit the quantitative Hounsfield Unit (HU) signal that physically distinguishes tissue classes. To address this, we propose DiffSegLung,a framework that introduces Diffusion Radiomic Distillation, in which handcrafted radiomic descriptors serve as a physics grounded teacher to shape the bottleneck of a 3D diffusion U-Net via a contrastive objective, transferring pathology discriminative structure into the learned representation without any annotations. At inference, the teacher is discarded and multitimestep bottleneck features are clustered by a Gaussian Mixture Model with HU-guided label assignment, followed by Sobel Diffusion Fusion for boundary refinement. Evaluated on 190 expert annotated axial slices drawn from four heterogeneous CT cohorts, Diff-SegLung improves segmentation across all four pathology classes over unsupervised baselines and improves generation fidelity over prior CT diffusion models.

[CV-170] NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI

【速读】：该论文试图解决低场磁共振成像（low-field MRI）中信噪比（SNR）低导致临床诊断图像质量受限，以及传统重复信号采集（NEX）方法因扫描时间过长而难以实用的问题。解决方案的关键在于提出一个名为NexOP的深度学习框架，该框架对多NEX采样的采样和重建进行联合优化：在固定采样预算约束下，优化扩展k空间-NEX域上的采样密度概率，从而在NEX维度上自适应地分配采样资源；同时引入一种新的深度学习架构，能够从多个低SNR测量中重建出单幅高SNR图像。实验表明，NexOP在不同加速因子和组织对比度下均优于现有方法，并给出了理论分析支持其数值结果。

链接: https://arxiv.org/abs/2605.11583
作者: Tal Oved,Efrat Shimron
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.

[CV-171] Deploying Self-Supervised Learning for Real Seismic Data Denoising

【速读】：该论文试图解决在没有干净参考数据的情况下，对真实地震数据进行有效去噪的难题；其解决方案的关键在于采用Noisy-as-Clean (NaC)自监督学习(SSL)方法，通过向含噪输入添加由参数控制的真实噪声，使得模型能在仅依赖含噪数据本身的情况下学习去噪映射，且性能强烈依赖于注入噪声与实际噪声特征的兼容性，最终表明NaC是一种简单、有效且与模型无关的可行方案。

链接: https://arxiv.org/abs/2605.11109
作者: Giovanny A. M. Arboleda,Claudio D. T. de Souza,Carlos E. M. dos Anjos,Lessandro de S. S. Valente,Roosevelt de L. Sardinha,Albino Aveleda,Pablo M. Barros,André Bulcão,Alexandre G. Evsukoff
机构: COPPE (COPPE); Federal University of Rio de Janeiro (里约热内卢联邦大学); CENPES (CENPES); Petrobras (巴西国家石油公司)
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a promising approach to seismic data denoising as it does not require clean reference data. In this work, the deployment of the Noisy-as-Clean (NaC) method was evaluated for real seismic data denoising under controlled conditions. Two independent seismic acquisitions, each comprising noisy and filtered data, were organized into four real datasets. The NaC SSL method was adapted to add real noise to the noisy input, controlled by a parameter. An experimental protocol with ten experiments was designed to compare different strategies for deploying the NaC SSL method with the supervised learning baseline, using identical network topology and hyperparameters. The models were evaluated in terms of denoising performance, computational cost, and generalization capability. The results show that the synthetic additive white Gaussian noise (AWGN) is inadequate for the denoising of seismic data within the NaC method, and performance strongly depends on the compatibility between the injected and actual noise characteristics. Furthermore, both the characteristics of the seismic data and the noise level influence the performance of the model. Self-supervised fine-tuning on test data has improved SSL performance, whereas no such gain was observed for fine-tuning of supervised models. Finally, NaC has shown to be a simple, effective, and model-independent method that offers a feasible solution for the denoising of real seismic data.

[CV-172] SplitFed-CL: A Split Federated Co-Learning Framework for Medical Image Segmentation with Inaccurate Labels

【速读】：该论文旨在解决分割联邦学习（Split Federated Learning, SplitFed）中，因客户端间标签质量异质性（heterogeneous label quality）导致医学图像分割性能显著下降的问题。解决方案的关键是提出了一个名为 SplitFed-CL 的协同学习（co-learning）框架，其核心机制包括：一个全局教师模型（global teacher）指导各客户端的本地学生模型（local students）检测并修正不可靠的标注；可靠标签直接用于监督训练，而不可靠标签则通过加权学生-教师精炼（weighted student-teacher refinement）进行校正；引入一致性正则化（consistency regularization）以增强对输入扰动的鲁棒性；设计可训练的权重模块（trainable weighting module）自适应平衡各损失项；此外，还提出了一种新颖的难度引导策略（difficulty guided strategy），通过形状复杂度与标注难度关联的扰动程度，模拟人类在边界处的中心化注释错误。实验表明，该方法在多类分割数据集及真实错标数据集上均优于七个基线模型。

链接: https://arxiv.org/abs/2605.11060
作者: Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Split Federated Learning (SplitFed) combines federated and split learning to preserve privacy while reducing client-side computation. However, in medical image segmentation, heterogeneous label quality across clients can significantly degrade performance. We propose SplitFed-CL, a co-learning framework where a global teacher guides local students to detect and refine unreliable annotations. Reliable labels supervise training directly, while unreliable labels are corrected via weighted student–teacher refinement. SplitFed-CL further incorporates consistency regularization for robustness to input perturbations and a trainable weighting module to balance loss terms adaptively. We also introduce a novel difficulty guided strategy to simulate human like boundary centric annotation errors, where the degree of perturbation is governed by shape complexity and the associated annotation difficulty. Experiments on two multiclass segmentation datasets with controlled synthetic noise, together with a binary segmentation dataset containing real-world annotation errors, demonstrate that SplitFed-CL consistently outperforms seven state-of-the-art baselines, yielding improved segmentation quality and robustness.

[CV-173] Streaming of rendered content with adaptive frame rate and resolution

【速读】：该论文试图解决在带宽受限条件下流式渲染内容时，固定帧率结合分辨率调整策略导致的感知质量次优问题。解决方案之关键在于设计了一个轻量级神经网络，该网络能够根据传输带宽、场景内容及运动速度，自适应地预测最佳的帧率和分辨率组合，从而在保持低计算开销的同时显著提升人类视觉系统的感知质量；该方法与编解码器无关，仅需对现有渲染基础设施进行最小修改，并基于感知视频质量指标标注的大规模渲染数据集进行训练。

链接: https://arxiv.org/abs/2605.10995
作者: Yaru Liu,Joseph G. March,Rafal K. Mantiuk
机构: University of Cambridge(剑桥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: this https URL.

[CV-174] Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising

【速读】：该论文旨在解决在高密度地震数据快速增长背景下，复杂噪声抑制面临的传统监督深度学习方法任务特定、依赖大量配对数据且易受域偏移影响的问题，以及从头预训练地震基础模型所需大量领域数据和计算资源的瓶颈。解决方案的关键在于：提出一个高效框架，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）重用通用视觉基础模型（Vision Foundation Model, VFM），具体采用预训练的DINOv3编码器并结合低秩适配（Low-Rank Adaptation, LoRA）实现特征的有效迁移，仅引入少量可训练参数；同时，引入一个基于峰度引导的无监督测试时自适应模块（kurtosis-guided unsupervised test-time adaptation），该模块在推理阶段仅更新LoRA参数，通过峰度（kurtosis）识别信息丰富区域并进行自训练，无需标签即可使模型自我校准至场地特定噪声环境，从而显著提升模型在未见过的跨站点数据上的泛化能力和信噪分离性能。

链接: https://arxiv.org/abs/2605.10953
作者: Jiahua Zhao,Umair bin Waheed,Jing Sun,Yang Cui,Nikos Savva,Eric Verschuur
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 34 pages, 8 figures, 6 tables. Submitted to Geophysics for publication consideration

点击查看摘要

Abstract:The demand for high-resolution subsurface imaging and continuous Earth monitoring has driven rapid growth in active and passive seismic data from dense geophone deployments, distributed acoustic sensing (DAS) arrays, and large-scale 2D and 3D surveys. This expansion makes complex noise suppression increasingly challenging, especially when signal fidelity must be preserved. Conventional supervised deep learning methods are often task-specific, require large paired datasets, and can suffer from domain shift under new acquisition conditions. Foundation models offer a promising alternative, but pre-training seismic foundation models from scratch requires massive domain-specific data and substantial computation. We propose an efficient framework that repurposes general-purpose Vision Foundation Models (VFMs) for geophysical tasks through Parameter-Efficient Fine-Tuning. The architecture uses a pre-trained VFM, a DINOv3 encoder, adapted with Low-Rank Adaptation (LoRA) to enable effective feature adaptation with few additional parameters. To improve robustness under unseen field conditions without ground truth, we introduce a kurtosis-guided unsupervised test-time adaptation module that updates only LoRA parameters during inference. This module self-calibrates the model to site-specific noise by identifying information-rich regions via kurtosis and performing self-training without labeled data. Experiments on public exploration seismic images and DAS vertical seismic profiling data from the Utah FORGE site show that the framework matches or outperforms domain-specific models. Tests on unseen cross-site data from a land survey in China and the Groß Schönebeck geothermal site in Germany further demonstrate strong generalization and effective signal-noise separation. These results highlight the potential of adapting pre-trained VFMs to data-intensive problems in exploration seismology.

[CV-175] AlphaEarth Satellite Embeddings for Modelling Climate Sensitive Diseases Towards Global Health Resilience

【速读】：该论文旨在解决低收入和中等收入国家中，由于常规健康监测稀疏且被动，加上气候变异性对疟疾、儿童急性呼吸道感染和儿童营养不良等主要死因的传播和营养结果的影响，导致缺乏可扩展、低成本的人口健康结果预测方法的问题。解决方案的关键是利用AlphaEarth Foundations的64维卫星嵌入（satellite embeddings）作为预测因子，这些嵌入通过表征地球表面的多维度特征，提供了足够空间粒度的预测价值：在疟疾预测中实现各区域R²增益，在儿童急性呼吸道感染预测中将三个基于树的估计器的合并R²从0.157提升至0.206；虽然发育迟缓预测因与固定效应共线性而暂呈中性，但后续实验需补充DHS集群级坐标以验证其潜力。

链接: https://arxiv.org/abs/2605.10949
作者: Usman Nazir,I-Han Cheng,Sara Khalid
机构: Planetary Health Informatics (PHI) Lab, University of Oxford (牛津大学)
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Visualising Climate 2026

点击查看摘要

Abstract:Malaria, childhood acute respiratory infection, and child undernutrition together account for over two million deaths annually in children under five, with the burden concentrated in low and middle-income countries where climate variability modulates transmission, exposure, and nutritional outcomes. Routine health surveillance in these settings remains sparse and reactive. Satellite-derived representations of the Earth’s surface offer a scalable, low-cost complement to traditional covariates, yet their utility as predictors of population health outcomes is poorly characterised. We summarise findings from three studies evaluating AlphaEarth Foundations 64-dimensional satellite embeddings as predictors of population health outcomes, focusing on vulnerable populations. The studies span infectious disease (malaria, respiratory infection) and stunting. In each study, embeddings provide predictive value at sufficient spatial granularity: (i) malaria prediction across Nigeria shows consistent per-region R^2 gains; (ii) childhood acute respiratory infection prediction across 11 DHS countries increases pooled R^2 from 0.157 to 0.206 across three tree-based estimators; (iii) stunting prediction across 35 countries is neutral at country level due to collinearity with fixed effects. The stunting case is currently limited by lack of DHS cluster-level coordinates, which is the next key experiment.

人工智能

[AI-0] Learning Fast and Slow: Towards LLM s That Adapt Continually

【速读】：该论文旨在解决大型语言模型（LLMs）在下游任务训练中面临的灾难性遗忘（catastrophic forgetting）和可塑性丧失（loss of plasticity）问题：参数更新（如强化学习）虽然能提升任务性能，但迫使模型吸收任务特定信息，损害通用能力；而上下文学习（in-context learning）虽能快速适应且不修改参数，但性能往往不如参数更新。解决方案的关键是提出快慢学习框架（Fast-Slow Training, FST），其中模型参数作为“慢权重”（slow weights），优化后的上下文作为“快权重”（fast weights）。快权重通过文本反馈（textual feedback）学习任务特定信息，慢权重则保持靠近基座模型以保留通用推理能力。这种分离机制使FST在推理任务上样本效率提升高达3倍，同时达到更高性能渐近线，且模型与基座LLM的KL散度降低70%，显著减少灾难性遗忘。在持续学习（continual learning）场景中，FST能持续获取新任务，而仅参数更新的RL则停滞。

链接: https://arxiv.org/abs/2605.12484
作者: Rishabh Tiwari,Kusha Sareen,Lakshya A Agrawal,Joseph E. Gonzalez,Matei Zaharia,Kurt Keutzer,Inderjit S Dhillon,Rishabh Agarwal,Devvrit Khatri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as “slow” weights and optimized context as “fast” weights. These fast “weights” can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

[AI-1] Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

【速读】：该论文试图解决在可验证标注数据稀缺时，如何高效分配这些数据以训练部署模型的问题。标准做法是将稀疏序列级奖励（如GRPO）直接应用于部署的学生模型，但这忽略了奖励密度原则：稀疏奖励更适合于探索高效的教师模型，而密集的token级奖励更适合于将行为压缩到较小的学生模型。解决方案的关键在于一个简单的分配规则：将稀缺的标注数据用于最强的教师模型进行强化学习（RL），使其产生经过奖励塑形的行为，然后通过密集监督的桥梁（dense bridge）将该行为转移到学生模型。该桥梁具体包括对教师轨迹进行前向KL预热（forward-KL warmup），再结合对学生轨迹进行在线策略蒸馏（OPD），从而利用教师提供的密集token级监督来优化学生模型。实验表明，这种方法在固定学生模型大小时，优于直接对学生进行GRPO，并且该桥梁还能使后续的学生侧稀疏RL（如GRPO）变得更加有效，避免在准备不足的策略上浪费稀缺数据。

链接: https://arxiv.org/abs/2605.12483
作者: Yuanda Xu,Hejian Sang,Zhengze Zhou,Ran He,Zhipeng Wang,Alborz Geramifard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from 75.4% to 78.5% after the bridge and outperforms a matched replay control by 2.8 points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

[AI-2] oolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

【速读】：该论文旨在解决计算机使用代理（Computer Use Agents, CUAs）在混合动作空间（即同时包含原子级图形用户界面动作和高层级工具调用）中的路径选择问题：代理常因不确定何时继续执行GUI操作或切换到工具而导致执行路径次优。这一困难源于高质量交错GUI-工具轨迹的稀缺、收集真实工具轨迹的高成本与脆弱性，以及缺乏用于GUI-工具路径选择的轨迹级监督。解决方案的关键在于提出ToolCUA，一种采用分阶段训练范式的端到端代理：首先通过交错GUI-工具轨迹缩放流水线，利用大量静态GUI轨迹并合成接地工具库，以生成多样化的交错轨迹而无需手动工程或真实工具轨迹收集；其次执行工具引导的GUI RFT（Reward Fine-Tuning），结合热启动的监督微调与单轮强化学习改善关键切换点决策；最后在高保真GUI-工具环境中通过在线Agentic强化学习进行优化，并由工具高效路径奖励引导，以鼓励恰当的工具使用和更短的执行路径。

链接: https://arxiv.org/abs/2605.12481
作者: Xuhao Hu,Xi Zhang,Haiyang Xu,Kyle Qiao,Jingyi Yang,Xuanjing Huang,Jing Shao,Ming Yan,Jieping Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: this https URL

[AI-3] Reward Hacking in Rubric-Based Reinforcement Learning

【速读】：该论文旨在解决基于评分标准（rubric-based）的强化学习（Reinforcement Learning）中奖励破解（reward hacking）问题，即当策略针对训练验证器（training verifier）进行优化后，其表现无法泛化至跨族参考验证器（cross-family reference verifiers）。论文的关键解决方案包括：首先，明确分离两种导致性能偏差的来源——验证器失败（verifier failure，即训练验证器认可而参考验证器拒绝的评分标准条目）与评分标准设计限制（rubric-design limitations，即即使强验证器也倾向于选择评分标准内高分但评分标准外低质量响应）；其次，引入一种无需验证器的诊断指标——自我内化差距（self-internalization gap），基于策略对数概率（log-probabilities）追踪参考验证器质量，以检测弱验证器训练何时停止改善；最后，通过实验表明，尽管更强大的验证器能显著减少但无法消除奖励破解，尤其当评分标准未覆盖关键失败模式时，评分标准增益可能与更广的质量增益脱节，从而强调必须结合更强的评分标准设计或引入评分标准外的评审机制。

链接: https://arxiv.org/abs/2605.12474
作者: Anas Mahmoud,MohammadHossein Rezaei,Zihao Wang,Anisha Gunjal,Bing Liu,Yunzhong He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

[AI-4] owards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

【速读】：该论文试图解决在需求响应(Demand Response, DR)程序中，由于离线历史数据无法捕捉电力公司定价信号与客户接受及适应之间的动态交互反馈循环，导致强化学习(Reinforcement Learning, RL)难以优化这一序贯决策过程的问题。解决方案的关键是构建一个开源的、在线Gymnasium兼容的模拟环境DR-Gym，该环境从电力公司视角出发，模拟市场级电力设置，包含校准至真实极端事件的体制切换批发价格模型、基于物理的建筑需求剖面，以及可配置的多目标奖励函数，从而在模拟中再现实体间的动态反馈，使得RL智能体能够在现实且可学习的环境中被训练和评估。

链接: https://arxiv.org/abs/2605.12462
作者: Jose E. Aguilar Escamilla,Lingdong Zhou,Xiangqi Zhu,Huazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility’s pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility’s perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.

[AI-5] CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction

【速读】：该论文试图解决现有自动事实核查（Automated Fact-Checking，AFC）系统与专业事实核查人员在实践中操作方式之间的脱节问题，以及无法有效处理AI生成内容中可能存在的幻觉（hallucinations）的问题。解决方案的关键在于提出了CAAFC（Chronological Actionable Automated Fact-Checker）框架，该框架能够对声明、对话和交谈进行操作，不仅检测事实错误和幻觉，还能通过提供基于主要信息源的可行理由来进行纠正，并且在必要时能够整合最新和上下文信息以更新证据和知识库，从而增强事实验证的可靠性。

链接: https://arxiv.org/abs/2605.12436
作者: Islam Eldifrawi,Shengrui Wang,Amine Trabelsi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the vast amount of content uploaded every hour, along with the AI generated content that can include hallucinations, Automated Fact-Checking (AFC) has become increasingly vital, as it is infeasible for human fact-checkers to manually verify the sheer volume of information generated online. Professional fact-checkers have identified several gaps in existing AFC systems, noting a misalignment between how these systems operate and how fact-checking is performed in practice. In this paper, we introduce CAAFC (Chronological Actionable Automated Fact-Checker), a frame-work designed to bridge these gaps. It surpasses SOTA AFC and hallucination detection systems across multiple benchmark datasets. CAAFC operates on claims, conversations, and dialogues, enabling it not only to detect factual errors and hallucinations, but also to correct them by providing actionable justifications supported by primary information sources. Furthermore, CAAFC can update evidence and knowledge bases by incorporating recent and contextual information when necessary, thereby enhancing the reliability of fact verification.

[AI-6] Formalize Dont Optimize: The Heuristic Trap in LLM -Generated Combinatorial Solvers

【速读】：该论文旨在解决大型语言模型（LLMs, Large Language Models）在通过直接推理处理复杂组合优化问题时表现不佳的问题，因此探索如何利用神经符号系统让LLMs合成可执行的求解器。核心设计问题在于LLM应如何表示求解器，以及是否应同时让其优化搜索过程。解决方案的关键是采用一种保守的设计原则：将LLM的角色限定于为经过验证的求解器（如OR-Tools）形式化变量、约束和目标，而避免由LLM直接进行搜索优化；任何LLM生成的搜索优化策略必须在使用前经过单独验证，以避免因启发式陷阱（如局部近似、未验证的边界或冗余声明式建模）导致的正确性下降和性能退化。这一原则通过CP-SynC-XL基准测试中三种求解器构建范式的对比得到了实证支持：通过Python API调用OR-Tools的约束建模范式在正确性上最优，而让LLM优化搜索仅在少数实例上稍有加速，却在大量实例上造成性能倒退和正确率骤降。

链接: https://arxiv.org/abs/2605.12421
作者: Haoyu Wang,Yuliang Song,Tao Li,Zhiwei Deng,Yaqing Wang,Deepak Ramachandran,Eldan Cohen,Dan Roth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.

[AI-7] Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

【速读】：该论文试图解决的问题是：在基于人类反馈的强化学习（RLHF）与偏好优化的标量化系统中，不同认知类别（如事实错误、不确定性披露、格式不满意、延迟和社会偏好等）的评估不满被压缩为通用优化信号，导致语义奖励坍塌（Semantic Reward Collapse, SRC），进而引发表演性确定性、幻觉连续性、校准漂移、阿谀奉承和可见不确定性抑制等系统性行为问题。这些问题被解释为优化后果而非欺骗。解决方案的关键在于提出宪法性奖励分层（Constitutional Reward Stratification, CRS），一种领域感知的奖励框架，旨在通过将不确定性披露和升级行为视为受保护的认知行为（而非全局惩罚的任务未完成），在自适应学习系统中保留差异化的认知归因（epistemic attribution）；尽管CRS目前仅为可测试的治理导向研究方向，尚需进一步实证验证。

链接: https://arxiv.org/abs/2605.12406
作者: William Parris
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages including references. Position and framework paper. Companion empirical work available at arXiv:2604.17587

点击查看摘要

Abstract:Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation. Comments: 15 pages including references. Position and framework paper. Companion empirical work available at arXiv:2604.17587 Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T07 ACMclasses: I.2.6 Cite as: arXiv:2605.12406 [cs.AI] (or arXiv:2605.12406v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12406 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: William Parris [view email] [v1] Tue, 12 May 2026 17:03:26 UTC (299 KB) Full-text links: Access Paper: View a PDF of the paper titled Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems, by William ParrisView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-8] OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

【速读】：该论文试图解决在线策略自蒸馏（On-Policy Self-Distillation, OPSD）中教师和学生响应之间普遍存在且常被忽视的失配问题：自我反思的教师响应因反射诱导的偏差（reflection-induced bias）和响应模板（response templates）而发生偏移，导致词元级监督（token-level supervision）校准不良。解决方案的关键是提出了一个结果引导的logit调整框架（outcome-guided logit-steering framework），利用可验证的结果奖励（verifiable outcome rewards）对比成功与失败的在线策略轨迹（on-policy trajectories），从而校准教师logits；通过将结果级正确性（outcome-level correctness）与密集的词元级指导（dense token-level guidance）经由logit steering相结合，该框架稳定了自蒸馏过程，并在多种基准上提升了推理性能。

链接: https://arxiv.org/abs/2605.12400
作者: Yuxiao Yang,Xiaoyun Wang,Weitong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 3 tables

点击查看摘要

Abstract:We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.

[AI-9] Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

【速读】：该论文旨在解决训练神经网络时难以检测过拟合的问题，尤其是在没有训练或测试数据可用的情况下。解决方案的关键在于提出一种基于随机矩阵理论（Random Matrix Theory, RMT）的方法：对每个层级的权重矩阵进行逐元素随机化（\mathbf{W} \to \mathbf{W}_\mathrm{rand}），将随机化后的经验谱分布（empirical spectral distribution）与Marchenko-Pastur分布拟合，识别出违反自平均性（self-averaging）的显著离群值，称之为相关陷阱（Correlation Traps）。这些陷阱在过拟合起始阶段（称为“反顿悟”（anti-grokking））形成并增长，其数量和规模随测试准确率下降而增加，而训练准确率保持高位。此外，论文通过将随机数据输入训练模型并评估输出logits的JS散度（JS divergence），提供了一种区分良性陷阱与有害陷阱的实证方法，从而实现了对深层模型中潜在有害过拟合的早期检测。

链接: https://arxiv.org/abs/2605.12394
作者: Hari K. Prakash,Charles H Martin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 24 figures

点击查看摘要

Abstract:Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, \mathbfW \to \mathbfW_\mathrmrand , fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the “anti-grokking” phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.

[AI-10] rust the Batch On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在大模型训练中因策略更新改变数据分布而引发的结构脆弱性问题——这一脆弱性在训练与推演系统存在数值精度、采样及实现细节差异时尤为明显，而现有方法通过预先设置的超参数来耦合信任区域（trust-region）与离策略（off-policy）约束，导致算法对配置高度敏感且需随任务、模型规模或分布失配重新调参。解决方案的关键在于提出一种批量自适应目标函数（batch-adaptive objective），用策略比率的归一化有效样本量（normalized effective sample size, NESS）替代固定裁剪：该统计量同时控制得分函数权重和离策略正则化强度，使更新在比率接近均匀时保持与在线策略得分函数更新一致，并在旧数据或失配数据导致比率集中时自动收紧约束，同时保留高比率token的有效学习信号，从而无需引入新超参数并减少多个现有超参数。

链接: https://arxiv.org/abs/2605.12380
作者: Rasool Fakoor,Murdock Aubry,Nicholas Stranges,Alexander J. Smola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at this https URL.

[AI-11] Discrete Flow Matching for Offline-to-Online Reinforcement Learning

【速读】：该论文旨在解决两个核心挑战：一是现有基于扩散和流匹配的生成式策略方法主要面向连续控制，而大量强化学习任务具有离散动作空间；二是离线到在线强化学习中，生成式策略在通过新交互改进的同时容易丢失从离线数据中学到的有用行为。解决方案的关键在于提出了DRIFT方法，该方法通过优势加权离散流匹配损失来更新离线预训练的连续时间马尔可夫链（CTMC）策略，并引入路径空间惩罚以正则化整个CTMC轨迹分布而非仅最终动作分布，从而保留预训练知识。针对大离散动作空间，设计了候选集近似机制，从参考策略轨迹和均匀探索中采样动作子集进行actor更新，理论分析表明候选集误差受缺失目标概率质量控制，且随候选集对高概率动作的覆盖增加，CTMC生成器误差呈指数下降。实验验证了该方法在离散动作任务中实现稳定离线到在线提升，且路径空间惩罚在微调中保持有界，生成器能够比确定性基线更快适应奖励变化。

链接: https://arxiv.org/abs/2605.12379
作者: Fairoz Nower Khan,Nabuat Zaman Nahim,Peizhong Ju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

[AI-12] ProfiliTable: Profiling-Driven Tabular Data Processing via Agent ic Workflows

【速读】：该论文试图解决现实中数据管道中表格处理（包括清洗、转换、增强、匹配等）任务中，基于LLM的方法由于指令模糊、任务结构复杂且缺乏结构化反馈（structured feedback），导致生成的代码虽然语法正确但语义上有缺陷（semantically flawed code）的问题。解决方案的关键是提出了一个基于动态分析（dynamic profiling）的自主多智能体框架ProfiliTable，它通过交互式探索（ReAct-style data exploration）、知识增强合成（knowledge-augmented synthesis）和反馈驱动迭代（feedback-driven refinement）构建并持续优化统一执行上下文（unified execution context）。具体而言，该框架集成了三个核心组件：Profiler执行数据探索以建立语义理解，Generator检索已有操作符（curated operators）生成任务感知代码，以及Evaluator-Summarizer循环注入执行评分和诊断信息实现闭环改进。实验表明，动态分析在将模糊用户意图可靠地转化为鲁棒且符合治理规范的表格变换中起着关键作用。

链接: https://arxiv.org/abs/2605.12376
作者: Wei Liu,Yang Gu,Xi Yan,Zihan Nan,Beicheng Xu,Keyao Ding,Bin Cui,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.

[AI-13] Agent -Based Post-Hoc Correction of Agricultural Yield Forecasts

【速读】：该论文试图解决在商业软水果（soft fruit）生产中，由于典型的商业农场记录缺乏传感器网络、卫星影像和高分辨率气象数据等先进方法所依赖的输入，导致作物产量预测准确性受限的问题。解决方案的关键在于提出一个结构化的大语言模型（LLM）智能体框架（structured LLM agent framework），该框架通过对现有模型预测进行事后修正（post-hoc correction），并编码农业领域知识，具体包括阶段检测（phase detection）、偏差学习（bias learning）和范围验证（range validation）等工具。在私有草莓产量数据集和公共USDA玉米收获数据集上的评估表明，该框架能够显著降低多个基线模型（XGBoost、Moirai2、Random Forest）的平均绝对误差（MAE）和平均绝对缩放误差（MASE），其中使用Llama 3.1 8B作为智能体取得了最强的校正效果，而LLaVA 13B的收益不一致，凸显了校正模型选择的重要性。

链接: https://arxiv.org/abs/2605.12375
作者: Matthew Beddows,Aiden Durrant,Georgios Leontidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Accurate crop yield forecasting in commercial soft fruit production is constrained by the data available in typical commercial farm records, which lack the sensor networks, satellite imagery, and high-resolution meteorological inputs that most state-of-the-art approaches assume. We propose a structured LLM agent framework that performs post-hoc correction of existing model predictions, encoding agricultural domain knowledge across tools for phase detection, bias learning, and range validation. Evaluated on a proprietary strawberry yield dataset and a public USDA corn harvest dataset, agent refinement of XGBoost reduced MAE by 20% and MASE by 56% on strawberry, with consistent improvements across Moirai2 (MAE 24%, MASE 22%) and Random Forest (MAE 28%, MASE 66%) baselines. Using Llama 3.1 8B as the agent produced the strongest corrections across all configurations; LLaVA 13B showed inconsistent gains, highlighting sensitivity to the choice of refinement model.

[AI-14] Classifier Context Rot: Monitor Performance Degrades with Context Length

【速读】：该论文试图解决在监控编码代理（coding agents）危险行为时，当前前沿语言模型（如Opus、GPT、Gemini）在长上下文（超过500K tokens）分类任务中表现显著退化的问题：在超过800K tokens的无害活动后，模型遗漏微妙危险行为的频率比单独出现时高出2到30倍。解决方案之关键在于采用提示技术（prompting techniques），例如在整个记录中设置周期性提醒（periodic reminders），以部分缓解长上下文下的性能下降；同时指出更好的后训练（post-training）可能进一步改善效果，并强调评估监控器时必须考虑长上下文退化，否则会高估其实际性能。

链接: https://arxiv.org/abs/2605.12366
作者: Sam Martin,Fabien Roger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions 2\times to 30\times more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.

[AI-15] A Family of Quaternion-Valued Differential Evolution Algorithms for Numerical Function Optimization

【速读】：该论文试图解决传统实数差分进化（Differential Evolution, DE）算法在利用替代数系（如四元数）方面尚未被充分探索的问题，旨在将DE算法拓展到四元数空间，以提升生物启发优化算法的紧凑性和准确性。解决方案的关键在于提出一系列四元数值差分进化（Quaternion-Valued Differential Evolution, QDE）算法，这些算法直接在四元数空间操作，并设计了多种新型变异策略，这些策略充分利用了四元数的代数和几何性质，从而在BBOB基准测试的多个函数类上实现了比传统实数DE更快的收敛速度和更优的性能。

链接: https://arxiv.org/abs/2605.12362
作者: Gerardo Altamirano-Gomez,Álvaro Gallardo,Carlos Ignacio Hernández Castellanos
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The numerical optimization of continuous functions is a fundamental task in many scientific and engineering domains, ranging from mechanical design to training of artificial intelligence models. Among the most effective and widely used algorithms for this purpose is Differential Evolution (DE), known for its simplicity and strong performance. Recent research has shown that adapting AI models to operate over alternative number systems-such as complex numbers, quaternions, and geometric algebras-can improve model compactness and accuracy. However, such extensions remain underexplored in bio-inspired optimization algorithms. In particular, the use of quaternion algebra represents an emerging area in computational intelligence. This paper introduces a family of novel Quaternion-Valued Differential Evolution (QDE) algorithms that operate directly in the quaternion space. We propose several mutation strategies specifically designed to exploit the algebraic and geometric properties of quaternions. Results show that our QDE variants achieve faster convergence and superior performance on several function classes in the BBOB benchmark compared to the traditional real-valued DE algorithm.

[AI-16] δ-mem: Efficient Online Memory for Large Language Models

【速读】：该论文试图解决大语言模型（LLM）在长期助手与智能体系统中高效积累和重用历史信息的问题，因为简单的上下文窗口扩展成本高昂且难以保证有效的上下文利用。解决方案的关键在于提出一种名为 δ-mem 的轻量级记忆机制：它通过一个紧凑的在线联想记忆状态（大小为固定矩阵）来增强冻结的全注意力骨干网络，利用 delta-rule 学习更新该状态矩阵以压缩历史信息，并在生成时通过状态读出生成低秩校正，直接耦合到骨干网络的注意力计算中，从而在不进行全量微调、替换骨干网络或显式扩展上下文的前提下实现有效的长期记忆。

链接: https://arxiv.org/abs/2605.12357
作者: Jingdi Lei,Di Zhang,Junxian Li,Weida Wang,Kaixuan Fan,Xiang Liu,Qihan Liu,Xiaoteng Ma,Baian Chen,Soujanya Poria
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose \delta -mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. \delta -mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone’s attention computation during generation. With only an 8\times8 online memory state, \delta -mem improves the average score to 1.10\times that of the frozen backbone and 1.15\times that of the strongest non- \delta -mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching 1.31\times on MemoryAgentBench and 1.20\times on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.

[AI-17] A New Technique for AI Explainability using Feature Association Map

【速读】：该论文旨在解决AI系统在关键现实应用中因缺乏透明度而难以建立信任的问题，其核心目标是实现对AI决策过程的解释。解决方案的关键在于提出一种名为FAMeX（Feature Association Map based eXplainability）的新型可解释AI算法，该算法基于图论方法构建特征关联映射（Feature Association Map, FAM），通过建模特征之间的关联来量化特征重要性。实验证明，与现有算法如置换特征重要性（Permutation Feature Importance, PFI）和SHapley加法解释（SHapley Additive exPlanations, SHAP）相比，FAMeX在分类任务中能更准确地评估特征贡献，从而为AI系统预测提供更优的解释。

链接: https://arxiv.org/abs/2605.12350
作者: Sayantani Ghosh,Amit Kumar Das,Amlan Chakrabarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system

[AI-18] BSO: Safety Alignment Is Density Ratio Matching

【速读】：该论文试图解决语言模型在同时实现有益性（helpfulness）与安全性（safety）对齐时，现有方法流程复杂、缺乏原则性推导的问题。传统方法依赖分离的奖励与代价模型、在线强化学习及原始-对偶更新，近期直接偏好优化虽简化训练，却通过多阶段流程或启发式边际项等临时性修改引入安全性，未从原理上保证最优性。解决方案的关键在于：证明最优安全策略的似然比（likelihood ratio）存在闭式分解，将安全性对齐转化为密度比匹配问题；通过最小化数据比例与模型比例之间的Bregman散度（Bregman divergence），导出名为Bregman Safety Optimization (BSO)的单阶段损失函数族。BSO由凸生成函数诱导，理论上可恢复最优安全策略，且无需辅助模型，仅比标准偏好优化多引入一个超参数，并能将现有安全感知方法作为特例。实验表明，BSO一致地改善了安全性与有益性之间的权衡。

链接: https://arxiv.org/abs/2605.12339
作者: Tien-Phat Nguyen,Truong Nguyen,Thin Nguyen,Duy Minh Ho Nguyen,Ngoc-Thanh Dinh,Trung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.

[AI-19] Manifold Sampling via Entropy Maximization

【速读】：该论文旨在解决从受约束分布中采样时可行集（feasible set）分解为多个不连通分量（disconnected components）所带来的挑战性问题，因为现有方法假设可行集是连通的，无法直接处理这种复杂几何结构。解决方案的关键在于提出了一种名为“基于熵最大化的流形采样”（MASEM）的方法，该方法通过引入一个基于k近邻密度估计（k-nearest neighbor density estimation）的重采样方案，来最大化经验分布（empirical distribution）的熵，从而在不预先知道不连通分量数量的情况下，在由光滑等式和不等式约束隐式定义的流形上实现有效采样。理论分析表明，在平均场（mean field）近似下，MASEM能使经验分布与最大熵目标之间的KL散度（KL-divergence）随重采样步数呈指数衰减，并且结合多个局部采样器后，能在合成和机器人学基准测试中实现快速且可扩展的混合，相比替代方法在Sinkhorn距离上提升一个数量级，且运行时间具有竞争力。

链接: https://arxiv.org/abs/2605.12338
作者: Cornelius V. Braun,Tilman Burghoff,Marc Toussaint
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:

点击查看摘要

Abstract:Sampling from constrained distributions has a wide range of applications, including in Bayesian optimization and robotics. Prior work establishes convergence and feasibility guarantees for constrained sampling, but assumes that the feasible set is connected. However, in practice, the feasible set often decomposes into multiple disconnected components, which makes efficient sampling under constraints challenging. In this paper, we propose MAnifold Sampling via Entropy Maximization (MASEM) for sampling on a manifold with an unknown number of disconnected components, implicitly defined by smooth equality and inequality constraints. The presented method uses a resampling scheme to maximize the entropy of the empirical distribution based on k-nearest neighbor density estimation. We show that, in the mean field, MASEM decreases the KL-divergence between the empirical distribution and the maximum-entropy target exponentially in the number of resampling steps. We instantiate MASEM with multiple local samplers and demonstrate its versatility and efficiency on synthetic and robotics-based benchmarks. MASEM enables fast and scalable mixing across a range of constrained sampling problems, improving over alternatives by an order of magnitude in Sinkhorn distance with competitive runtime.

[AI-20] Reinforcing VLAs in Task-Agnostic World Models

【速读】：该论文试图解决现有后训练视觉-语言-动作（VLA）模型方法中，世界模型和奖励模型严重依赖任务特定数据进行微调，从而限制了模型向未见任务的可扩展性问题。解决方案的关键在于提出RAW-Dream（Reinforcing VLAs in task-Agnostic World Dreams）范式，该范式完全将世界模型学习与下游任务依赖解耦：通过一个在多样化的无任务行为上预训练的世界模型来预测未来轨迹，并利用一个现成的视觉-语言模型（VLM）进行零样本奖励生成。由于这两个组件都是任务无关的，VLA模型可以在无需任何任务特定数据的情况下，仅通过零样本想象进行微调以适应新任务。此外，为解决世界模型幻觉问题，引入了一种双噪声验证机制以过滤不可靠的轨迹，从而确保策略训练的有效性。

链接: https://arxiv.org/abs/2605.12334
作者: Yucen Wang,Rui Yu,Fengming Zhang,Junjie Lu,Xinyao Qin,Tianxiang Zhang,Kaixin Wang,Li Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

[AI-21] owards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

【速读】：该论文试图解决非管制机场（non-towered airports）飞行后安全分析中因依赖飞行员自宣布通信协议（pilot self-announcement communication protocol）而频繁出现的接近空中相撞（near mid-air collisions）问题。解决方案之关键在于提出一种通用的视觉语言模型（Vision-Language Model, VLM）方法，通过整合转录的通用交通咨询频率（Common Traffic Advisory Frequency, CTAF）无线电通信自然语言、METAR（METeorological Aerodrome Report）天气数据、ADS-B（Automatic Dependent Surveillance-Broadcast）飞行轨迹以及目视飞行规则（Visual Flight Rules, VFR）航图等多模态数据，利用大语言模型（LLM）进行联合分析，从而实现对通行权违规（right-of-way violation）等危险场景的准确识别。该框架在仅使用CTAF和METAR输入时，即便采用开源LLM，在二分类安全评估任务中也能达到超过0.85的宏F1分数，表明其具备潜在的应用价值。

链接: https://arxiv.org/abs/2605.12332
作者: Torsten Darrell,Mahyar Ghazanfari,Jordan Kam,Alexandre Bayen,Amin Tabrizian,Peng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 17 figures, 5 tables, Accepted to AIAA 2026

点击查看摘要

Abstract:We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.

[AI-22] LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

【速读】：该论文旨在解决现有交叉口管理方法（如固定周期控制、SCATS、AIM和GLOSA）在复杂动态交通环境中难以实时处理冲突路权请求、异构车辆优先级及车辆特定运动学约束的问题，尤其是当前基于大语言模型（LLM）的方法仍依赖信号基础设施或作为辅助组件，且LLM推理延迟限制了其在亚秒级控制中的应用。解决方案的关键在于提出了LISA（LLM-Based Intent-Driven Speed Advisory）——一种无信号的认知仲裁框架，利用LLM对车辆声明的意图进行推理，并整合优先级类别、队列压力和能源偏好，从而实现实时、信号自由的交叉口管理，显著降低平均控制延迟（最高89.1%）、等待时间（93%）和燃油消耗（48.8%），并在近饱和需求下维持了服务等级C。

链接: https://arxiv.org/abs/2605.12321
作者: Abderrahmane Lakas,Mohamed Amine Ferrag,Merouane Debbah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LISA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LISA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LISA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LISA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LISA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET) Cite as: arXiv:2605.12321 [cs.AI] (or arXiv:2605.12321v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] ransferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

【速读】：该论文旨在解决随机延迟环境下强化学习面临的两个核心问题：一是随机延迟破坏了动作与状态反馈之间的时间对应关系，导致智能体难以追踪动作效应的真实传播过程；二是跨任务场景中，任务目标和奖励公式的变化使得先前获取的任务知识难以重用。解决方案的关键在于提出一种基于隐式因果图建模的可迁移延迟感知强化学习方法（transferable delay-aware reinforcement learning method based on implicit causal graph modeling）。该方法首先通过场节点编码器（field-node encoder）将高维观测映射为具有节点级语义的潜在状态，然后利用消息传递机制（message-passing mechanism）刻画节点间的动态因果依赖关系，从而学习可迁移的结构化表示（structured representations）和环境动态知识（environment dynamics knowledge）。在此基础上，引入基于想象力的行为学习与规划（imagination-driven behavior learning and planning），在潜在空间中优化策略，实现跨任务的知识迁移与快速适应。

链接: https://arxiv.org/abs/2605.12312
作者: Chenran Zhao,Dianxi Shi,Yaowen Zhang,Chunping Qiu,Shaowu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.

[AI-24] Executable Agent ic Memory for GUI Agent

【速读】：该论文试图解决现代图形用户界面（GUI）智能体在长周期任务中因依赖“模型中心、逐步骤交互范式”而表现脆弱的问题——即大型语言模型（LLM）需在每个屏幕重新解释UI并重新决策，导致鲁棒性差、成本高。解决方案之关键在于提出了可执行智能体记忆（Executable Agentic Memory, EAM），一个结构化的知识图谱（Knowledge Graph, KG），将GUI规划从自由文本生成转变为稳健的检索-执行流程。具体通过状态感知的深度优先搜索（state-aware DFS）和动作组挖掘（action-group mining）构建样本高效的记忆流水线以压缩多步骤例程，并引入值引导的图搜索，利用轻量级Q函数模型指导蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）在知识图谱上进行高效规划，同时理论上建立了Q模型的偏差一致性并推导了路径恢复的样本复杂度界。

链接: https://arxiv.org/abs/2605.12294
作者: Zerui Qin,Sheng Yue,Xingyuan Hua,Yongjian Fu,Ju Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to 19.6% on AndroidWorld, while reducing token costs 6\times relative to GPT-4o. With a 2.8 s average latency, EAM enables reliable, quick, and long-horizon GUI automation.

[AI-25] PriorZero: Bridging Language Priors and World Models for Decision Making

【速读】：该论文试图解决将大语言模型（LLMs）的世界知识融入强化学习（RL）智能体时存在的先验-动态不匹配（prior-dynamics mismatch）问题：静态的LLM知识无法直接适应长周期任务的复杂转移动态，固定LLM策略会限制探索多样性，而端到端微调则面临优化不稳定和信用分配（credit assignment）困难。解决方案的关键在于提出PriorZero框架，通过解耦的滚动-训练设计（decoupled rollout-training design）实现LLM先验与基于世界模型（world model）规划的整合。具体而言，在滚动阶段，采用根节点先验注入机制（root-prior injection mechanism），仅在蒙特卡洛树搜索（MCTS）的根节点注入LLM概念先验，从而将搜索聚焦于语义上有希望的动作，同时保留世界模型的深层前瞻能力；在训练阶段，将世界模型学习与LLM适应解耦：世界模型基于交互数据持续优化其动态、策略和值预测，随后利用世界模型的值估计为LLM微调提供细粒度的信用分配信号，并通过交替优化（alternating optimization）实现稳定微调。

链接: https://arxiv.org/abs/2605.12289
作者: Junyu Xiong,Yuan Pu,Jia Tang,Yazhe Niu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model’s deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at this https URL.

[AI-26] Iterative Audit Convergence in LLM -Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

【速读】：该论文试图解决多智能体大语言模型（LLM）系统中提示词规范（prompt specifications）缺乏结构化审查严谨性的问题，具体表现为跨多个相互依赖文件的数据合约与集成逻辑中普遍存在一致性缺陷，且单文件审查无法有效识别这些缺陷。解决方案之关键在于实施迭代的、智能体驱动的审计（agent-driven auditing），具体采用由Weinberg和Freedman改编的检查表驱动走查（checklist-driven walkthrough），通过多轮（9轮）审计由Claude子智能体执行，对包含约7150行提示词规范的AEGIS生产系统进行系统性扫描，从而发现单文件审查遗漏的缺陷类别，并最终提炼出一套可复现的审计协议与七类别事后缺陷分类法。

链接: https://arxiv.org/abs/2605.12280
作者: Elias Calboreanu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 6 tables. Companion preprint at arXiv:2604.05000 . Submitted to MDPI Software, Special Issue on Software Reliability, Security and Quality Assurance

点击查看摘要

Abstract:Prompt specifications for multi-agent large language model (LLM) systems carry data contracts and integration logic across many interdependent files but are rarely subjected to structured-inspection rigor. This paper reports a single-system empirical case study of iterative, agent-driven auditing applied to AEGIS (Autonomous Engineering Governance and Intelligence System), a production seven-lane orchestration pipeline whose prompt-specification surface comprises approximately 7150 lines: 6907 across seven lane this http URL files and a 245-line shared Ticket Contract. Nine sequential audit rounds, executed by Claude sub-agents using a checklist-driven walkthrough adapted from Weinberg and Freedman, surfaced 51 prompt-specification consistency defects, distinct from the 51 STRIDE-categorized adversarial code findings reported in the companion preprint. Per-round counts were 15, 8, 12, 2, 8, 1, 4, 1, and 0. We report a seven-category post-hoc defect taxonomy with explicit coding rules, observed non-monotonic convergence consistent with cascading edits and audit-scope expansion, and an audit protocol distilled from the study, with the final locked checklist released as a reproducibility appendix. Single-file review missed defect classes that were surfaced only by later expanded-scope rounds in this system. The same LLM family authored and audited the specifications; replication with dissimilar models and human reviewers is required before generalization.

[AI-27] NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities

【速读】：该论文旨在解决地理空间基础模型中对向量地理空间数据（vector geospatial data）的表示学习中存在的碎片化问题。现有的方法通常局限于特定的几何类型（如仅处理点、折线或多边形之一）或仅建模部分空间关系（如度量邻近性），未能统一捕获异质地理实体（如点、折线、多边形）之间的语义、几何与结构化空间关系（包括度量邻近性和拓扑关系），限制了模型对完整空间上下文的建模能力。论文提出的解决方案关键是一种名为NARA（Neural Anchor-conditioned Relation-Aware representation learning）的自监督框架，该框架通过联合建模语义、几何与空间关系，并利用锚点条件（anchor-conditioned）机制显式捕捉超越邻近性的关系空间结构（如拓扑关系），从而为异质向量地理实体学习上下文相关的、关系感知的统一表示，并在建筑功能分类、交通速度预测和下一个兴趣点推荐等任务中验证了其有效性。

链接: https://arxiv.org/abs/2605.12276
作者: Jina Kim,Gengchen Mai,Lingyi Zhao,Khurram Shafique,Yao-Yi Chiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial foundation models have primarily focused on raster data such as satellite imagery, where self-supervised learning has been widely studied. Vector geospatial data instead represent the world as discrete geoentities with explicit geometry, semantics, and structured spatial relations, including metric proximity and topological relationships. These relations jointly determine how entities interact within space, yet existing representation learning methods remain fragmented, often restricted to specific geometry types or partial spatial relations, limiting their ability to capture unified spatial context across heterogeneous geoentities. We propose NARA (Neural Anchor-conditioned Relation-Aware representation learning), a self-supervised framework for vector geoentities. NARA learns context-dependent representations by jointly modeling semantics, geometry, and spatial relations within a unified framework and captures relational spatial structure beyond proximity alone, enabling rich contextualized representations across heterogeneous geoentities of points, polylines, and polygons. Evaluation on building function classification, traffic speed prediction, and next point-of-interest recommendation shows consistent improvements over prior methods, highlighting the benefit of unified relational modeling for vector geospatial data.

[AI-28] How Useful Is Cross-Domain Generalization for Training LLM Monitors?

【速读】：该论文试图解决在使用提示语言模型（prompted language models）进行分类时，虽然能在训练数据有限的领域中进行分类，但缺乏微调（fine-tuning）所带来的鲁棒性和性能优势的问题。解决方案的关键在于提出一种混合训练策略：将多个分类任务（每个任务有其独立的提示）的训练与通用指令遵循训练（general instruction following training）相结合，从而在保持分类训练带来的性能提升的同时，缓解微调模型在特定边缘情况（如分类提示完全改变而数据领域与训练时相同）下无法遵循提示的泛化失败问题。该方案还意外地发现，这种“无思考”（no-thinking）监督分类训练能够泛化到“有思考”（with-thinking）分类和摘要任务，表明其在构建其他分类器和监控系统中的潜在实用性。

链接: https://arxiv.org/abs/2605.12265
作者: Sam Martin,Fabien Roger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and mitigates its generalization failures. Surprisingly, we see that this no-thinking supervised classification training can generalize to with-thinking classification and summarization, suggesting that no-thinking classification training might be instrumentally useful in building other kinds of classifiers and monitoring systems.

[AI-29] Reconnecting Frag mented Citation Networks with Semantic Augmentation

【速读】：该论文试图解决引用图（citation graphs）因文献间缺失引用而导致的碎片化问题，这种碎片化会破坏科学结构的完整性，影响基于引用的指标分析。解决方案的关键在于提出一个计算高效的混合框架，该框架将引用拓扑（citation topology）与大语言模型（Large Language Model, LLM）驱动的文本相似度相结合。具体而言，通过从小的、不连通的组件中添加语义边（semantic edges），并根据文本相似度对现有引用进行加权，从而增强原始引用图。这种语义增强能显著减少碎片化，同时保持学科同质性，并使得在增强图上使用Leiden算法进行聚类检测时，既能保留结构可解释性，又能提供多尺度组织，且方法可高效扩展到大规模数据集。

链接: https://arxiv.org/abs/2605.12263
作者: Vu Thi Huong,Annika Buchholz,Imene Khebouri,Thorsten Koch,Tim Kunt,Wolfgang Peters-Kottig,Tomasz Stompor,Janina Zittel
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Citation graphs are fundamental tools for modeling scientific structure, but are often fragmented due to missing citations of scientifically connected articles. To address this issue, we propose a computationally efficient hybrid framework integrating citation topology with large language model (LLM)-based text similarity. Using 662,369 Web of Science publications in Mathematics and Operations Research Management Science, we augment the original graph by adding semantic edges from small, disconnected components and weighting existing citations according to textual similarity. Semantic augmentation substantially reduces fragmentation while preserving disciplinary homogeneity. Compared to embedding-only clustering, cluster detection on augmented graphs using the Leiden algorithm retains structural interpretability while offering multi-scale organization. The method scales efficiently to large datasets and offers a practical strategy for strengthening citation-based indicators without collapsing disciplinary boundaries.

[AI-30] Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

【速读】：该论文旨在解决在缺失函数未知的情况下，为缺失MDP（miss-MDP）计算近似最优策略的规划问题。missing-MDP是一类特殊的POMDP，其观测函数由缺失机制定义，给定由动作-观测轨迹组成的数据集，缺失函数的未知性使得直接学习策略不可行。解决方案的关键在于利用三种规范缺失类型（完全随机缺失MCAR、随机缺失MAR、非随机缺失MNAR）的结构特性，设计概率近似正确（PAC）算法以从数据中学习缺失函数，从而构建一个近似的完整miss-MDP，再通过现成规划方法求解；理论证明所得策略以高概率在真实miss-MDP中达到ε-最优。

链接: https://arxiv.org/abs/2605.12262
作者: Joshua Wendland,Markel Zubia,Roman Andriushchenko,Maris F. L. Galesloot,Milan Ceska,Henrik von Kleist,Thiago D. Simao,Maximilian Weininger,Nils Jansen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.

[AI-31] Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

【速读】：该论文试图解决的核心问题是：为何共享相同文档与观察的个体常常得出截然不同的结论，而此类分歧通常被错误归因为对方的认知缺陷或恶意，实则源于推理与学习过程中固有的非可识别性（non-identifiability）。解决方案之关键在于：通过引入推理配置文件（inference profile）θ = (R, E, S, D)（即参考、探索、稳定与视野），形式化不同推理设置的差异；并将非可识别性划分为两个层次——θ-level（相同世界模型W下因推理配置不同导致的结论分歧）和 W-level（重复使用同一推理配置θ导致数据暴露与更新规则偏置进而使习得的世界模型W本身分化）。此外，论文还指出分歧倾向于投射到少数基（如抽象对具体、外部化、秩序对自由）上，这一现象是学习系统中计算、观察与协调三类通用约束的必然结果，从而将机制性解释与深度表示学习中的层次表征、隐状态估计及正则化-探索权衡等概念相联系。

链接: https://arxiv.org/abs/2605.12255
作者: Toru Takahashi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, 1 table. Extended English version of a paper accepted for presentation at JSAI 2026

点击查看摘要

Abstract:When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) \theta -level non-identifiability, where conclusions diverge under the same world model W because inference settings differ; and (ii) W -level non-identifiability, where repeated use of an inference setting \theta biases data exposure and update rules, causing the learned world model W itself to diverge. We introduce an inference profile \theta = (R, E, S, D) , consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation o and the same W . We further explain why disagreements tend to project onto a small number of bases – abstract versus concrete, externalizability, and order versus freedom – as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.

[AI-32] No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents

【速读】：该论文旨在解决大型语言模型（LLM）智能体在长周期任务中存在的不可靠性问题，具体表现为策略违规（policy violations）、工具幻觉（tool hallucinations）以及行动与用户意图不对齐（misaligned actions），这些问题严重阻碍了服务智能体的实际部署。解决方案的关键在于提出一种名为NOD（Navigator-Operator-Director）的异构多智能体架构：一方面，通过外部化（externalize）结构化的全局状态（Global State），使得Navigator智能体能够进行显式任务状态跟踪和一致决策，替代以往在对话上下文中隐式维护任务状态的方式；另一方面，引入选择性外部监督（selective external oversight），在关键行动前由独立的Director智能体验证执行并在必要时进行干预。这种设计有效抑制了长周期任务中的错误传播和不安全行为，从而提升了服务智能体的可靠性。

链接: https://arxiv.org/abs/2605.12240
作者: Zixu Yang,Hang Zheng,Nan Jiang,Zhiyang Tang,Situo Zhang,Xiaobao Wu,Lu Chen,Kai Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on \tau^2 -Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.

[AI-33] Harness Engineering as Categorical Architecture

【速读】：该论文旨在解决大型语言模型（LLM）智能体（agent）中系统层（即 harness，包含提示、工具、记忆和编排逻辑）设计缺乏形式化理论的问题，具体表现为 harness 的组合、编译下属性的保持以及跨框架的系统性比较均处于临时（ad hoc）状态。解决方案的关键在于引入 ArchAgents 框架中的范畴架构三元组 (G, Know, Phi)，将智能体外化的四个支柱（记忆、技能、协议、harness 工程）分别映射为余代数状态、operad 组合对象、语法布线 G 以及架构本身，从而为 harness 工程提供形式化依据。具体实现中，结构保障（如完整性门、基于质量的升级、支持收敛检查）被视为 Know 层面的证书（certificate），其保持机制通过编译器函子（compiler functor）的结构性重放（structural replay）来完成，即编译器检查恒等和验证器重放，而非输出层正确性或模型行为。论文通过参考实现（包括针对 Swarms、DeerFlow、Ralph、Scion 和 LangGraph 的编译器函数子）验证了这一对应关系，表明这种范畴架构能够作为 harness 工程背后的形式化理论。

链接: https://arxiv.org/abs/2605.12239
作者: Bogdan Banu
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Category Theory (math.CT)
备注:

点击查看摘要

Abstract:The agent harness, the system layer comprising prompts, tools, memory, and orchestration logic that surrounds the model, has emerged as the central engineering abstraction for LLMbased agents. Yet harness design remains ad hoc, with no formal theory governing composition, preservation of properties under compilation, or systematic comparison across frameworks. We show that the categorical Architecture triple (G, Know, Phi) from the ArchAgents framework provides exactly this formalization. The four pillars of agent externalization (Memory, Skills, Protocols, Harness Engineering) map onto the triple’s components: Memory as coalgebraic state, Skills as operad-composed objects, Protocols as syntactic wiring G, and the full Harness as the Architecture itself. Structural guarantees-integrity gates, quality-based escalation, supported convergence checks-are Know-level certificates whose preservation is structural replay: our compiler checks identity and verifier replay, not output-layer correctness or model behavior. We validate this correspondence with a reference implementation featuring compiler functors targeting Swarms, DeerFlow, Ralph, Scion, and LangGraph: the four configuration compilers preserve three named certificate types by identity or replay, and LangGraph preserves the same certificates through its shared per-stage execution path. The LangGraph compiler creates one node per stage using the same per-stage method as the native runtime, providing LangGraph-native observability without reimplementing harness logic. An end-to-end escalation experiment with real LLM agents confirms that the quality-based escalation control path is model-parametric in this two-model, one-task experiment. The result positions categorical architecture as the formal theory behind harness engineering.

[AI-34] MRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

【速读】：该论文试图解决使用行为克隆（Behavioral Cloning, BC）预训练机器人策略后，在进行强化学习（Reinforcement Learning, RL）微调时因预训练策略的动作分布狭窄而缺乏下游探索覆盖范围的问题。解决方案的关键在于提出一个统一框架，该框架由两部分组成：预训练阶段采用上下文平滑预训练（Context-Smoothed Pre-training, CSP），通过向策略输入注入前向扩散噪声，在精确模仿与广泛动作覆盖之间构建连续体；微调阶段则采用时间步调制强化学习（Timestep-Modulated Reinforcement Learning, TMRL），训练智能体在微调过程中动态调整扩散时间步以显式控制探索程度。该框架可无缝集成任意策略输入（如状态、3D点云或基于图像的VLA策略），显著提升RL微调的样本效率，并在复杂操作任务上实现不到一小时的实机微调。

链接: https://arxiv.org/abs/2605.12236
作者: Matthew M. Hong,Jesse Zhang,Anusha Nagabandi,Abhishek Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at this https URL.

[AI-35] No More No Less: Task Alignment in Terminal Agents

【速读】：该论文试图解决的问题是：现有终端智能体（terminal agents）在自主执行复杂、长期任务时，无法准确区分环境指令中的相关信息与无关或误导信息，即缺乏任务对齐（task alignment）能力。现有基准未能捕捉这种选择性遵循指令的能力，导致智能体可能盲目跟随所有指令或完全忽略所有指令，从而无法真实反映其能力。解决方案的关键在于提出了TAB（Task Alignment Benchmark），它从Terminal-Bench 2.1中提取了89个终端任务，每个任务都故意未完全指定，缺失信息以自然人工制品的必要提示（cue）形式嵌入，同时包含一个看似合理但无关的干扰项（distractor）。解决这些任务要求智能体必须选择性使用必要提示并忽略干扰项，从而有效评估智能体对环境中指令的选择性遵循能力。

链接: https://arxiv.org/abs/2605.12233
作者: Sina Mavali,David Pape,Jonathan Evertz,Samira Abedini,Devansh Srivastav,Thorsten Eisenhofer,Sahar Abdelnabi,Lea Schönherr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.

[AI-36] Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA ECML-PKDD2024

【速读】：该论文旨在解决数字神经形态硬件推广中因硅片流片成本高昂而受限的问题，并提出利用现场可编程门阵列（Field Programmable Gate Array, FPGA）作为灵活且低成本的替代平台，以加速开源神经形态硬件设计。解决方案的关键在于构建一个异构系统级芯片（System-on-Chip, SoC），通过将递归脉冲神经网络加速器ReckOn与传统的RISC-V开源微控制器X-HEEP或Zynq Ultrascale系统的ARM处理器集成，实现加速器与通用处理器的协同管理。研究通过FPGA实现流片版ReckOn的等效验证，确保分类精度和物理实现特性的一致性，并利用Braille数字数据集子集评估在线学习能力，从而证明该异构架构在保持低功耗和并行计算优势的同时，具备实际部署的可行性。

链接: https://arxiv.org/abs/2605.12217
作者: Michelangelo Barocci,Vittorio Fra,Enrico Macii,Gianvito Urgese
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Deep Learning meets Neuromorphic Hardware Workshop at ECML-PKDD 2024 Conference in Vilnius, Lithuania

点击查看摘要

Abstract:The growing popularity of Spiking Neural Networks (SNNs) and their applications has led to a significant fast-paced increase of neuromorphic architectures capable of mimicking the spike-based data processing typical of biological neurons. The efficient power consumption and parallel computing capabilities of the SNNs lead researchers towards the development of digital accelerators, which exploit such features to bring fast and low-power computation on edge devices. The spread of digital neuromorphic hardware however is slowed down by the prohibitive costs that the silicon tape out of circuits brings, that’s why targeting Field Programmable Gate Arrays (FPGAs) could represent a viable alternative, offering a flexible and cost-effective platform for implementing digital neuromorphic systems and helping the spread of open-source hardware designs. In this work we present an heterogeneous System-on-Chip (SoC) where the operations of ReckOn, a Recurrent SNN accelerator, are managed through the integration with traditional processors. These include the RISC-V-based, open-source microcontroller X-HEEP and the ARM processor featured in Zynq Ultrascale systems. We validate our design by reproducing the classification results through the implementation on FPGA of the taped-out version of ReckOn in order to check the equivalence of the accuracy and the characteristics in terms of physical implementation. In a second set of experiments, we evaluate the online learning capability of the solution in classifying a subset of the Braille digit dataset recently used to compare neuromorphic frameworks and platforms.

[AI-37] Goal-Oriented Reasoning for RAG -based Memory in Conversational Agent ic LLM Systems

【速读】：该论文旨在解决基于大语言模型（LLM）的对话智能体在长期交互中因上下文有限而难以维持连贯行为的问题。现有方法通过RAG（检索增强生成）将交互存储于外部记忆模块并基于语义相似性检索，但在应对多跳推理、常识等挑战性问题时，因缺乏对缺失中间事实的显式推理，常返回不相关或不充分的证据。解决方案的关键在于提出Goal-Mem，一种面向目标的推理框架：它从用户话语作为目标出发，执行显式的反向链式推理（backward chaining），将目标分解为原子子目标，对每个子目标进行针对性记忆检索，并在中间目标无法满足时迭代识别应检索的信息。该过程形式化为自然语言逻辑（Natural Language Logic），结合一阶逻辑的可验证性与自然语言的表达力，从而在无需逐步扩展已检索上下文的情况下，系统性地补全推理路径。实验表明，该方法在多跳推理和隐式推理任务上显著优于九种强基线方法。

链接: https://arxiv.org/abs/2605.12213
作者: Jiazhou Liang,Armin Toroghi,Yifan Simon Liu,Faeze Moradi Kalarde,Liam Gallagher,Scott Sanner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent’s ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user’s utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.

[AI-38] Uncertainty Quantification for LLM -based Code Generation

【速读】：该论文旨在解决现有基于PAC预测集（PAC prediction sets）的方法在大型语言模型（LLM）代码生成任务中的局限性，即强单调性假设（monotonicity assumption）和单标签分类框架（single-label classification framework）导致候选程序空间受限，无法适应代码生成固有的多有效输出特性。解决方案的关键是提出RisCoSet方法，利用多重假设检验（multiple hypothesis testing）构建风险控制预测集（risk-controlling prediction sets），通过生成由部分程序（partial program）表示的预测集，保证以高置信度包含至少一个正确解，从而在控制风险的同时显著减少代码移除率（code removal）。

链接: https://arxiv.org/abs/2605.12201
作者: Senrong Xu,Yuhao Tan,Yanke Zhou,Guangyuan Wu,Zenan Li,Yuan Yao,Taolue Chen,Feng Xu,Xiaoxing Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt proposes PAC prediction sets but is limited by its strong monotonicity assumption on risk and single-label classification framework, which severely limits the space of candidate programs and cannot accommodate the multiple valid outputs inherent to code generation. To address these limitations, we propose an approach RisCoSet that leverages multiple hypothesis testing to construct risk-controlling predictions for LLM-based code generation. Given a trained code generation model, we produce a prediction set represented by a partial program, which is guaranteed to contain a correct solution with high confidence. Extensive experiments on three LLMs demonstrate the effectiveness of the proposed method. For instance, compared with the state-of-the-art, our method can significantly reduce the code removal by up to 24.5%, at the same level of risk.

[AI-39] Overtrained Not Misaligned

【速读】：该论文旨在系统探究“涌现性不对齐（Emergent misalignment, EM）”现象——即对狭窄任务（如生成不安全代码）进行微调会导致模型在无关领域（如医疗、伦理等）产生广泛的不对齐行为。通过迄今为止最全面的EM研究（涵盖GPT-4o和12个开源模型、4个系列、参数规模从8B至671B，评估超百万次响应），论文发现EM在GPT-4o中可复现，但在开源模型中远非普遍（仅17%的模型在所有随机种子下表现出一致性），且模型大小与EM易感性存在显著正相关。通过微调过程中的检查点（checkpoint-level）分析，论文揭示EM在训练后期才出现，其产生与主任务收敛（near convergence）在时间上分离且后随，表明EM源于主任务收敛后的持续训练。解决方案的关键在于：提前停止（early stopping）可在消除EM的同时保留平均93%的任务性能，而谨慎选择学习率（learning rate）能够进一步最小化EM风险。跨领域验证（如医疗微调）确认这些模式具有普遍性，大小-EM相关性增强（r=0.90），且对不真实性的过度泛化在67%的案例中可通过早停避免，尽管语义相近的训练域会产生更难分离的不对齐。因此，论文将EM从一种不可预见的微调风险重新定位为可避免的训练伪影（avoidable training artifact），并提供了基于训练实践的有效缓解策略。

链接: https://arxiv.org/abs/2605.12199
作者: Joel Schreiber,Ariel Goldstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at CoLM 2026; companion to Nature Matters Arising (also under review). 25 pages, 6 figures

点击查看摘要

Abstract:Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.

[AI-40] DriftXpress: Faster Drifting Models via Projected RKHS Fields

【速读】：该论文旨在解决漂移模型（Drifting Models）中训练与推理之间的计算权衡问题：传统漂移模型将迭代去噪过程替换为单步生成器评估，从而降低推理成本，但将大量计算转移至训练阶段，导致训练成本高昂。其解决方案的关键在于提出DriftXpress，一种基于投影再生核希尔伯特空间（RKHS，Reproducing Kernel Hilbert Space）场的加速公式，通过在低秩特征空间中近似漂移核（drifting kernel），在保留原始漂移场吸引-排斥（attraction-repulsion）结构的同时，显著降低场评估的计算代价，从而在不牺牲单步推理优势的前提下进一步优化训练-推理权衡。

链接: https://arxiv.org/abs/2605.12183
作者: Ali Falahati,Elliot Creager,Gautam Kamath,Shubhankar Mohapatra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drifting Models have emerged as a new paradigm for one-step generative modeling, achieving strong image quality without iterative inference. The premise is to replace the iterative denoising process in diffusion models with a single evaluation of a generator. However, this creates a different trade-off: drifting reduces inference cost by moving much of the computation into training. We introduce DriftXpress, an accelerated formulation of drifting models based on projected RKHS fields. DriftXpress approximates the drifting kernel in a low-rank feature space. This preserves the attraction-repulsion structure of the original drifting field while reducing the cost of field evaluation. Across image-generation benchmarks, DriftXpress achieves comparable FID to standard drifting while reducing wall-clock training cost. These results show that the training-inference trade-off of drifting models can be pushed further without giving up their one-step inference advantage.

[AI-41] MolDeTox: Evaluating Language Models Stepwise Frag ment Editing for Molecular Detoxification

【速读】：该论文旨在解决现有分子优化研究中毒性相关挑战被忽视的问题，具体包括：现有基准测试数据多样性不足、生成分子结构有效性低、以及过度依赖代理模型进行毒性评估等局限性。解决方案的关键是提出了MolDeTox，一个新型分子解毒基准测试，通过设计细粒度、逐步的毒性感知分子优化任务，实现对多种通用大语言模型（LLMs）和视觉语言模型（VLMs）的可靠评估。其核心创新在于利用片段级（fragment-level）理解与生成分子，显著提升了生成分子的结构有效性和质量，并通过任务级性能分析提供了可解释的解毒过程理解。

链接: https://arxiv.org/abs/2605.12181
作者: Jueon Park,Wonjune Jang,Jiwoo Lee,Yein Park,Jaewoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : this https URL

[AI-42] A Deep Learning-based Receiver for Asynchronous Grant-Free Random Access in Control-to-Control Networks

【速读】：该论文旨在解决室内环境下共享无线信道中非同步控制到控制（C2C）通信的可靠接收问题，关键挑战在于多个节点异步发送的命令单元在时间上不对齐，导致接收端面临信号叠加、边界检测困难以及高流量下的碰撞干扰。解决方案的核心在于提出一种基于卷积神经网络（CNN）的接收机架构：该CNN直接对接收波形进行运算，同时检测命令单元的起始序列和尾部序列，其中尾部序列检测还能额外利用LDPC解码器输出的软信息以及信道估计来提升准确性。在成功解码命令单元后，进一步采用连续干扰消除（SIC）技术迭代消除已解码信号，从而在无协调、高流量场景下实现稳定的包边界识别和低端到端丢包率。

链接: https://arxiv.org/abs/2605.12180
作者: Massimo Battaglioni,Edoardo Carnevali,Dania De Crescenzo,Enrico Testi,Marco Baldi,Enrico Paolini
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transactions on Communications

点击查看摘要

Abstract:In this paper, we study grant-free, asynchronous control-to-control (C2C) communications in an indoor scenario with a shared wireless channel. Each communication node transmits command units, each consisting of a variable-length low-density parity-check (LDPC)–coded payload preceded by a start sequence and followed by a tail sequence. Due to the asynchronous nature of the access, transmissions from different nodes are not aligned over time. As a result, each receiving controller observes the superposition of multiple command units transmitted by different nodes over a receiver-defined superframe interval. Each node transmits one or more replicas of the same command unit. We propose a receiver architecture in which the detection of command unit boundaries (start/tail sequences) is carried out by a single convolutional neural network (CNN) operating directly on the received signal. We show that, while start-sequence detection must rely only on the received waveform, tail-sequence detection can additionally exploit the soft information produced by the LDPC decoder, together with channel estimates. Finally, once commands units are successfully decoded, successive interference cancellation (SIC) can be applied. Simulation results demonstrate that the receiver we propose achieves reliable packet-boundary identification and a low end-to-end packet loss rate, even under uncoordinated and high-traffic operating conditions.

[AI-43] Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）策略在真实部署中因用户输入延迟（通常数秒）而导致策略长时间空闲、交互效率低下的问题。其解决方案的关键是提出一个轻量级模块Premover：在不冻结预训练VLA骨干网络参数的前提下，附加两个小型投影头分别处理图像块（image patches）和语言标记（language tokens），将骨干网络中间层的表示映射到共享的焦点空间；该焦点图通过模拟器渲染的目标对象分割掩码进行监督，并作为下一步图像令牌的逐块重加权（per-patch reweighting）信号；同时，通过从流式前缀联合训练一个单一的标量就绪阈值（readiness threshold），动态决定策略何时开始执行动作。这使得原本空闲的等待窗口被转化为有效的预计算，从而在不牺牲成功率的前提下显著降低挂钟时间。

链接: https://arxiv.org/abs/2605.12160
作者: Joonha Park,Jiseung Jeong,Taesik Gong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step’s image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall-clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full-prompt baseline’s success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.

[AI-44] ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization

【速读】：该论文试图解决现有基于大语言模型（LLM）的端到端算法可视化（Algorithm Visualization, AV）系统在同时模拟算法执行流程和满足视频渲染约束（如元素布局、配色方案）时，因任务复杂而诱发的LLM幻觉问题，具体表现为执行成功率低、元素重叠和帧间不一致。解决方案的关键在于提出ALGOGEN范式，将算法执行与渲染过程彻底解耦。首先引入可视化迹代数（Visualization Trace Algebra, VTA），这是一个关于算法可视状态和操作的幺半群，LLM通过生成Python跟踪器来模拟算法流程并输出VTA-JSON格式的轨迹；同时定义渲染样式语言（Rendering Style Language, RSL）以模板化算法布局。最后，由一个确定性渲染器将算法轨迹与RSL模板编译为Manim、LaTeX/TikZ或网页输出，从而避免LLM直接参与渲染决策，显著降低幻觉，提升任务成功率。

链接: https://arxiv.org/abs/2605.12159
作者: Kunpeng Liao,Yuexiao Ma,Yisheng Lin,Hualin Zeng,Xiawu Zheng,Rongrong Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints, such as element layout and color schemes. This complex task induces LLM hallucinations, resulting in reduced execution success rates, element overlap, and inter-frame inconsistencies. To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or this http URL outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods, with 99.8% versus 82.5%. These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available in the project repository. Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR) Cite as: arXiv:2605.12159 [cs.AI] (or arXiv:2605.12159v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

【速读】：该论文旨在解决现有优化建模基准仅依赖纯文本描述、忽略了实际运营中通过表格、图形、地图等视觉形式传递实例信息的局限，从而无法评估多模态大语言模型在真实决策场景中同时处理文本与视觉需求并生成数学公式与可执行求解器代码的能力。解决方案的关键在于构建一个基于精确求解器的验证框架，该框架能够从统一的已验证源生成结构化的多模态优化实例（包括模型输入与隐藏参考文件），并实例化为MM-OptBench基准，涵盖多个优化族、子类别和难度级别，从而系统性地评估多模态大语言模型在视觉-文本联合理解、数据提取以及求解器正确公式与代码生成方面的表现。

链接: https://arxiv.org/abs/2605.12154
作者: Zhong Li,Qi Huang,Yuxuan Zhu,Mohammad Mohammadi Amiri,Niki van Stein,Thomas Bäck,Matthijs van Leeuwen,Zaiwen Wen,Lincen Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper under review

点击查看摘要

Abstract:Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.

[AI-46] CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

【速读】：该论文旨在解决当前代码智能研究中缺乏大规模、真实的工业级私有代码数据集的问题。现有开源代码语料库无法充分代表实际工业场景中生产代码的复杂性、领域多样性和质量分布。解决方案的关键在于通过正式的数据共享协议与12家工业合作伙伴直接协作，收集了2,440个专有软件仓库，涵盖138种编程语言和3.73亿行代码，并构建了一个多阶段处理流水线，包括结构化的合作伙伴入职流程、结合自动化元数据过滤与人工代码审查的两阶段质量控制，以及覆盖完整版本控制历史的确定性匿名化管道，从而确保数据的真实性、可用性和隐私合规性。

链接: https://arxiv.org/abs/2605.12153
作者: Vladislav Savenkov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 34 pages, 9 figures, 4 appendices. Dataset access: this https URL . Anonymization tool: this https URL . Metadata utility: this https URL

点击查看摘要

Abstract:We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre-training and fine-tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at this https URL.

[AI-47] BoolXLLM : LLM -Assisted Explainability for Boolean Models

【速读】：该论文试图解决基于规则的布尔公式分类器（如BoolXAI）在实际应用中面临的两大核心问题：一是如何自动选择语义上有意义的特征，而非仅依赖统计指标；二是如何将形式化的逻辑规则转化为非技术用户能够理解的自然语言解释。解决方案的关键在于提出一个名为BoolXLLM的混合框架，通过将大语言模型（LLMs）集成到布尔规则学习的端到端流程中，在三个关键阶段发挥作用：（1）特征选择阶段，利用LLMs的领域知识引导识别与任务相关的变量；（2）阈值推荐阶段，由LLMs为数值型特征提出语义上合理的离散化切割点；（3）规则压缩与解释阶段，借助LLMs将学习到的布尔规则翻译为全局和局部层面的自然语言说明。这种整合使得系统既保持了布尔规则的忠实性和紧凑性，又能生成人类易于理解的叙述性解释，从而在维持竞争性预测性能的同时大幅提升可解释性。

链接: https://arxiv.org/abs/2605.12139
作者: Du Cheng,Serdar Kadioglu,Xin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable machine learning aims to provide transparent models whose decision-making processes can be readily understood by humans. Recent advances in rule-based approaches, such as expressive Boolean formulas (BoolXAI), offer faithful and compact representations of model behavior. However, for non-technical stakeholders, main challenges remain in practice: (i) selecting semantically meaningful features and (ii) translating formal logical rules into accessible explanations. In this work, we propose BoolXLLM , as a hybrid framework that integrates Large Language Models (LLMs) into the end-to-end pipeline of Boolean rule learning. We augment BoolXAI , an expressive Boolean rule-based classifier, with LLMs at three critical stages: (1) feature selection, where LLMs guide the identification of domain-relevant variables; (2) threshold recommendation, where LLMs propose semantically meaningful discretization strategies for numerical features; and (3) rule compression and interpretation, where Boolean rules are translated into natural language explanations at both global and local levels. This integration bridges formal, faithful explanations with human-understandable narratives. This allows build an explainable AI system that is both theoretically grounded and accessible to non-experts. Early empirical results demonstrate that LLM-assisted pipelines improve interpretability while maintaining competitive predictive performance. Our work highlights the promise of combining symbolic reasoning with language-based models for human-centered explainability. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12139 [cs.AI] (or arXiv:2605.12139v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12139 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-48] Rollout Cards: A Reproducibility Standard for Agent Research

【速读】：该论文试图解决智能体研究中因报告规则不透明、部署记录缺失而导致的可重复性问题：现有论文仅报告分数，但忽略了对失败、错误或跳过运行次数的统计，且相同行为可能因评估时选择不同的部署片段或应用不同报告规则而产生截然不同的分数，甚至改变模型排名。解决方案的关键是引入“部署卡片 (rollout cards)”，即以部署记录而非报告分数作为可重复性的基本单位，通过捆绑发布保留完整的原始部署记录，并显式声明所采用的视图、报告规则以及丢弃清单，从而允许其他研究者基于原始数据重新计算分析，消除报告规则带来的偏差。

链接: https://arxiv.org/abs/2605.12131
作者: Charlie Masters,Ziyuan Liu,Stefano V. Albrecht
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.

[AI-49] Its Not the Size: Harness Design Determines Operational Stability in Small Language Models

【速读】：该论文旨在探究不同水平的工程化框架（harness engineering）对小语言模型（SLMs, 2-3B参数）操作性能的影响，特别是解决模型在缺乏框架支持时因复杂格式要求而出现的“脚手架崩溃”（scaffold collapse）现象——例如LLaMA 3.2 3B模型在仅模型条件下因格式违规导致任务成功率（TSR）仅0.429。解决方案之关键在于采用一个4-stage管道框架（pipeline harness），即计划-执行-验证-恢复（plan-execute-verify-recover）四阶段流程。该框架通过显式分解任务步骤、引入验证环节和恢复机制，有效提升任务成功率和格式合规性。消融实验表明，规划（planning）和恢复（recovery）环节各自贡献约24.7%的总性能增益，而验证捕获率（Verification Catch Rate, VCR）在全部管道运行中达到0.625，共同构成了解决模型结构崩溃的关键机制。

链接: https://arxiv.org/abs/2605.12129
作者: Yong-eun Cho
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan-execute-verify-recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.

[AI-50] o Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

【速读】：该论文试图解决语言模型在高风险专业环境（如法律、医学）中，面对用户指令、机构权威和专业标准相冲突时表现出的优先级层次（principal hierarchy）不稳定且经常违背专业标准的问题。解决方案的关键并非由论文直接提出，而是通过揭示问题机制来暗示：当前的对齐方法（alignment methods）包括公开的对齐层级（alignment hierarchies）在这些场景下缺乏鲁棒性，模型的主要失效机制是知识省略（knowledge omission）——即模型虽拥有相关知识却在推理后压制它并服从权威压力。因此，有效的解决方案需要设计新的对齐策略，确保模型在任务执行（如起草）时能够持续地将专业标准置于用户或权威指令之上，并防止知识在推理链中被屏蔽。

链接: https://arxiv.org/abs/2605.12120
作者: Fangyi Yu,Nabeel Seedat,Jonathan Richard Schwarz,Andrew M. Bean
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy – an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards – despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace – e.g., that a drug has been withdrawn – yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.

[AI-51] Adaptive Multi-Round Allocation with Stochastic Arrivals ICML2026

【速读】：该论文试图解决在有限预算下，面向具有随机推荐能力个体的多轮次序列资源分配问题，其中成功推荐内生动生成未来决策机会，但向同一个体重复分配资源会导致边际回报递减。由于状态空间（即前沿的随机高维演化）导致贝尔曼递归不可解，解决方案的关键在于引入一个仅依赖于剩余预算和前沿规模的总体水平替代价值函数，通过截断概率生成函数实现精确动态规划，从而得到总预算多项式复杂度的规划算法。

链接: https://arxiv.org/abs/2605.12111
作者: Yuqi Pan,Davin Choo,Haichuan Wang,Milind Tambe,Alastair van Heerden,Cheryl Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: Accepted into ICML 2026

点击查看摘要

Abstract:We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the single-round allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on real-world inspired recruitment scenarios.

[AI-52] Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

【速读】：该论文旨在解决带约束双目标连续优化中生成可行帕累托前沿（Pareto fronts）的效率问题，现有方法（如迭代标量化、进化搜索或特定求解器）需对每个实例重复优化，计算成本高。解决方案的关键在于提出DIPS（端到端框架），通过微调大型语言模型（LLMs）作为摊销生成器：其核心创新包括紧凑离散化方案（compact discretization scheme）使连续决策向量适配自回归词元生成、数值接地令牌初始化（Numerically Grounded Token Initialization）处理新数值词元，以及三阶段课程优化（Three-Phase Curriculum Optimization）逐步对齐结构有效性、可行性与帕累托前沿质量，从而在单次推理中直接输出有序的可行决策向量近似前沿。

链接: https://arxiv.org/abs/2605.12106
作者: Peipei Xu,SiYuan Ma,Yaohua Liu,Yu Wu,Guanliang Liu,Yang Zhang,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:Generating feasible Pareto fronts for constrained bi-objective continuous optimization is central to multi-criteria decision-making. Existing methods usually rely on iterative scalarization, evolutionary search, or problem-specific solvers, requiring repeated optimization for each instance. We introduce DIPS, an end-to-end framework that fine-tunes large language models as amortized Pareto-front generators for constrained bi-objective convex optimization. Given a textual problem description, DIPS directly outputs an ordered set of feasible continuous decision vectors approximating the Pareto front. To make continuous optimization compatible with autoregressive language modeling, DIPS combines a compact discretization scheme, Numerically Grounded Token Initialization for new numerical tokens, and Three-Phase Curriculum Optimization, which progressively aligns structural validity, feasibility, and Pareto-front quality. Across five families of constrained bi-objective convex problems, a fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% relative to reference fronts. With vLLM-accelerated inference, DIPS solves one instance in as little as 0.16 seconds and outperforms general-purpose and reasoning LLM baselines under the evaluated setting. These results suggest that LLMs can serve as effective amortized generators for continuous Pareto-front approximation.

[AI-53] Autonomy and Agency in Agent ic AI: Architectural Tactics for Regulated Contexts

【速读】：该论文旨在解决在受监管环境中部署智能体AI时，如何联合推理能动性（agency）和自主性（autonomy）这两个耦合设计维度的问题。现有方法通常将二者独立处理，导致从业人员缺乏统一原则来权衡人类监督、行动后果与纠错能力，尤其在合规性要求下，随着自主性升高，人类纠错机会减少，因此必须对应限制能动性。解决方案的关键在于引入一个二维设计空间，将能动性和自主性各划分为五个操作级别（从L1到L5），使二者的耦合关系显式化且可导航；在此基础上提出六种架构策略（检查点、升级、多智能体委托、工具配置、工具围栏、写入暂存），用于调整部署在空间中的位置，并通过两个公共部门案例验证其实际合规约束下的适用性。此外，该工作进一步识别了五个独立于维度之外的部署参数（模型能力、智能体架构、工具保真度、工作流瓶颈与评估），这些参数共同塑造了任意配置下的可行性。最终，设计空间、策略与参数共同构成一套共享词汇，使责任、可审计性和可逆性成为设计阶段显式考量的属性，而非部署后补救的特性。

链接: https://arxiv.org/abs/2605.12105
作者: Damir Safin,Dian Balta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error correction is less available, so reliable operation requires constraining agency accordingly; compliance requirements reinforce this by mandating human involvement as action consequences grow. Yet no established approach addresses them jointly, leaving practitioners without a principled basis for reasoning about oversight, action consequences, and error correction. This work introduces a two-dimensional design space in which both dimensions are organised into five operational levels, making the coupling explicit and navigable. Autonomy ranges from human-commanded operation (L1) to fully autonomous monitoring (L5); agency ranges from reasoning over supplied context (L1) to committed writes to authoritative records (L5). Building on this space, we propose six architectural tactics–checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging–for adjusting a deployment’s position within it. The tactics are grounded in two worked examples from public-sector contexts, illustrating how they apply under realistic compliance constraints. We further examine five deployment parameters–model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation–that shape what is achievable at any configuration independently of agency and autonomy. Together, the design space, tactics, and deployment parameters provide a shared vocabulary for principled, compliance-aware agentic AI design in which responsibility, auditability, and reversibility are explicit design considerations rather than properties that must be retrofitted after deployment. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12105 [cs.AI] (or arXiv:2605.12105v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

【速读】：该论文试图解决在机器人探索中设计可学习的信息论目标（information-theoretic objectives）时面临的挑战，即如何在高维机器人系统中构造能够准确反映参数可学习性（parametric learnability）的目标，避免因弱可观测（weakly observable）或不可识别（unidentifiable）参数方向对探索的扭曲以及信息度量的失真。解决方案之关键在于提出了准最优实验设计（Quasi-Optimal Experimental Design, QOED），一种基于最优实验设计的自适应信息目标。具体而言，QOED通过两个核心步骤实现：(i) 对Fisher信息矩阵（Fisher information matrix）进行特征空间分析（eigenspace analysis），识别可观测子空间并选择可识别的参数方向；(ii) 修改探索目标，优先强调这些选定的可识别方向，同时抑制非关键参数（nuisance parameters）带来的干扰效应。在有限干扰和关键与非关键方向耦合受限的条件下，QOED能够提供对理想信息目标（即探索所有参数的目标）的常数因子近似（constant-factor approximation），从而在模拟和真实世界的导航与操作任务中显著提升探索效率（性能提升分别为35.23%和21.98%），并在基于模型的策略优化（model-based policy optimization）中超越已有强化学习基线。

链接: https://arxiv.org/abs/2605.12084
作者: Youwei Yu,Jionghao Wang,Zhengming Yu,Wenping Wang,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q\footnotesize OED), an adaptive information objective grounded in optimal experimental design. Q\footnotesize OED (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q\footnotesize OED provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q\footnotesize OED on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI35.23\percent and \SI21.98\percent, respectively. When integrated as an exploration objective in model-based policy optimization, Q\footnotesize OED further improves policy performance over established RL baselines.

[AI-55] Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes

【速读】：该论文试图解决在跨多种不同制度（regime）下，对于智能体 AI（Agentic AI）决策进行事后重建（post-hoc reconstruction）时缺乏统一属性级（property-level）度量标准的问题。具体而言，不同制度（如云智能体、可观测性、工具使用、遥测、协议追踪等）下的决策轨迹在“谁授权、违反哪条政策、基于何种推理”等关键属性上的可重建性（reconstructability）存在差异，但缺乏一个统一的评估框架。解决方案的核心是应用一个未经修改的决策轨迹重建器（Decision Trace Reconstructor），结合决策事件模式（Decision Event Schema, DES）对每个属性进行分类（完全可填充、部分可填充、结构上不可填充或不透明），并通过六个公开供应商 SDK 制度及两个比对列的锚定实例，测量出严格治理完整性（strict-governance-completeness）在不同制度下从 42.9% 到 85.7% 的三个层级，最终揭示出一个制度无关的缺口（推理轨迹）、四个制度依赖的缺口以及一个混合属性。

链接: https://arxiv.org/abs/2605.12078
作者: Oleg Solozobov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 23 pages, 3 tables; reproducibility package: this https URL GitHub: this https URL

点击查看摘要

Abstract:Agentic AI failures need post-hoc reconstruction: what the agent did, on whose authority, against which policy, and from what reasoning. Cross-regime feasibility remains unmeasured under one property-level schema. We apply the Decision Trace Reconstructor unmodified to pinned worked-example anchors from six public vendor SDK regimes spanning cloud-agent, observability, tool-use, telemetry, and protocol traces, plus two comparator columns. Each Decision Event Schema (DES) property is classified as fully fillable, partially fillable, structurally unfillable, or opaque. Per-property reconstructability of an agent decision already varies between regimes at this anchor scale. Strict-governance-completeness separates into three tiers ranging from 42.9% to 85.7%, yielding one regime-independent gap (reasoning trace), four regime-dependent gaps, and one Mixed property; the pilot is single-annotator, one anchor per cell, descriptive, with outputs checksum-verifiable from a deposited reproducibility package.

[AI-56] he Deepfakes We Missed: We Built Detectors for a Threat That Didnt Arrive

【速读】：该论文试图解决的问题是当前深度伪造（deepfake）检测研究领域的威胁模型与实际危害之间的严重错位：近十年的机器学习研究一直围绕2017-2019年间继承的威胁模型（主要关注公众人物的换脸和说话头部操作），但2022-2026年间的深度伪造事件统计显示，实际主导的危害是同伴生成的非自愿亲密图像（Non-Consensual Intimate Imagery, NCII）、针对家庭和金融从业者的语音克隆诈骗电话以及情感操纵欺诈，而预期的公众人物深度伪造大规模灾难并未在2024年全球信息环境中出现。论文认为这种错位已成为现实世界中深度伪造防御的主要瓶颈，而非模型能力不足。解决方案之关键是研究界应大幅重新调整研究议程，转向实际增长的危害类别，具体包括：通过实证统计研究投入与危害分布的不匹配，识别导致错位持续存在的结构性原因，并为未被充分防御的危害类别提出三项具体技术研究议程。

链接: https://arxiv.org/abs/2605.12075
作者: Shaina Raza
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nearly a decade of Machine Learning (ML) research on deepfake detection has been organized around a threat model inherited from 2017–2019, revolving around face-swap and talking-head manipulation of public figures, motivated by concerns about large-scale misinformation and video-evidence fraud. This position paper argues that the threat the field prepared for did not arrive, and the threats that did arrive are substantially different. An accounting of deepfake incidents in 2022–2026 shows that the dominant observed harms are peer-generated Non-Consensual Intimate Imagery (NCII), voice-clone scam calls targeting families and finance workers, and emotional-manipulation fraud. The predicted large-scale public-figure deepfake catastrophe did not materialize during the 2024 global information environment despite extensive preparation. Meanwhile, research effort, benchmarks, and detection methods remain concentrated on the inherited threat model. The central claim of this paper is that this misalignment is now the dominant bottleneck on real-world deepfake defense, not model capability. We argue the ML research community should substantially rebalance its research agenda toward the harm categories that are actually growing. We support this position with empirical accounting of research effort and harm distribution, identify the structural reasons the misalignment persists, and outline three concrete technical research agendas for the under-defended harm categories.

[AI-57] Clausal Deletion Backdoors for QBF: a Parameterized Complexity Approach

【速读】：该论文试图解决量化布尔公式（QBF）有效性判定的参数化复杂性难题，具体针对在给定子句覆盖（CC）后门（clause covering backdoor）大小k的情况下，寻找固定参数可追踪（FPT）的算法。其核心挑战在于，不同于NP问题中常见的参数化结果，QBF的PSPACE完备性使得传统参数化方法往往需要限制量词交替次数。解决方案的关键在于提出一种新的参数——从子句中移除变量直至达到可处理基类（Horn、2-CNF或线性方程）所需的最少变量数，并针对不同基类设计专用算法：对于Horn类证明了W[1]-困难性（即不存在FPT算法），而对于2-CNF和线性方程类则通过非平凡的传播算法和高斯消元实现了FPT时间求解，从而揭示了该参数下QBF可处理性的精确二分界点。

链接: https://arxiv.org/abs/2605.12073
作者: Leif Eriksson,Victor Lagerkvist,Sebastian Ordyniak,George Osipov,Fahad Panolan,Mateusz Rychlicki
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Determining the validity of a quantified Boolean formula (QBF) is a PSPACE-complete problem with rich expressive power. Despite interest in efficient solvers, there is, compared to problems in NP, a lack of positive theoretical results, and in the parameterized complexity setting one often has to restrict the quantifier prefix (e.g., bounding alternations) to obtain fixed parameter tractability (FPT). We propose a new parameter: the number of variables in clauses that has to be removed before reaching a tractable class (a clause covering (CC) backdoor). We are then interested in solving QBF in FPT time given a CC-backdoor of size k . We consider the three classical, tractable cases of QBF as base classes: Horn, 2-CNF, and linear equations. We establish W[1]-hardness for Horn but prove FPT for the others, and prove that in a precise, algebraic sense, we are only missing one important case for a full dichotomy. Our algorithms are non-trivial and depend on propagation, and Gaussian elimination, respectively, and are comparably unexplored for QBF.

[AI-58] Missing Old Logits in Asynchronous Agent ic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

【速读】：该论文旨在解决异步强化学习（asynchronous reinforcement learning）中，PPO风格离线策略修正（off-policy correction）引入的关键失效模式：在异构训练系统中，总重要性比率（total importance ratio）应分解为训练-推理差异项（training-inference discrepancy）和策略过时项（policy-staleness）两个语义独立的因子，但实际异步流水线因延迟更新和部分rollout导致历史训练侧logits（old logits）缺失，从而使得差异修正与过时修正纠缠在一起，破坏了解耦修正的语义，并使裁剪和掩码阈值产生不良交互。解决方案的关键在于从精确修正和近似修正两个路径弥补缺失的old logits：精确修正通过基于快照的版本追踪、专用old-logit模型以及部分rollout中断同步三种策略恢复old logits并比较系统权衡；近似修正则聚焦于在无法低成本恢复精确old logits时，采用更合适的近似策略（如修订的PPO-EWMA方法）来保留解耦修正的益处，避免引入额外系统开销，从而同时提升训练速度和优化性能。

链接: https://arxiv.org/abs/2605.12070
作者: Zhong Guan,Yongjian Guo,Haoran Sun,Wen Huang,Shuai Di,Xiong Jun Wu,Likang Wu,Hongke Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emphtraining–inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emphpolicy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at this https URL.

[AI-59] SAGE: A Self-Evolving Agent ic Graph-Memory Engine for Structure-Aware Associative Memory

【速读】：该论文试图解决语言智能体（language agents）中长期记忆（long-term memory）成为核心瓶颈的问题。现有基于检索增强生成（RAG）和图检索增强生成（GraphRAG）的系统将记忆图（memory graphs）视为静态检索中间件，这限制了它们从部分线索中恢复完整证据链（evidence chains）、利用可复用的图结构角色（graph-structural roles）以及通过下游反馈改进记忆本身的能力。解决方案的关键是引入SAGE（Self-evolving Agentic Graph-memory Engine），一个自演化的智能体图记忆引擎，它将图记忆建模为动态的长期记忆基质（dynamic long-term memory substrate）。SAGE耦合了两个角色：一个记忆写入器（memory writer），从交互历史中增量构建结构化图记忆（structured graph memory）；以及一个基于图基础模型（Graph Foundation Model）的记忆读取器（memory reader），执行检索并向记忆写入器提供反馈。通过这种读者-写入者反馈循环（reader-writer feedback），SAGE实现了记忆的自演化，从而在多跳问答（multi-hop QA）、开放域检索（open-domain retrieval）、特定领域评论问答（domain-specific review QA）和长期智能体记忆基准测试（long-term agent-memory benchmarks）中显著提升了证据恢复、答案依据（answer grounding）和检索效率。

链接: https://arxiv.org/abs/2605.12061
作者: Juntong Wang,Haoyue Zhao,guanghui Pan,Xiyuan Wang,Yanbo Wang,Qiyan Deng,Muhan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, a Self-evolving Agentic Graph-memory Engine that models graph memory as a dynamic long-term memory substrate. SAGE couples two roles: a memory writer that incrementally constucts structured graph memory from interaction histories, and a Graph Foundation Model-based memory reader to perform retrieval and provide feedback to the memory writer. We provide rigorooous theoretical annalyses supporting the framework. Across multi-hop QA, open-domain retireval, domain-specific review QA, and long-term agent-memory benchmarks, SAGE improves evidence recovery, answer grounding, and retrieval efficiency: after two self-evolution rounds, it achieves the best average rank on multi-hop QA; in zero-shot open-domain transfer, it reaches 82.5/91.6 Recall@2/5 on NQ. Further results on LongMemEval and HaluMem show that traning and reader-writer feedback improve multiple long-term memory and hallucination-diagnostic metrics, suggesting that self-evolving, structure-aware graph memory is a promising foundation for robust long-horizon language agents.

[AI-60] Hölder Policy Optimisation

【速读】：该论文试图解决GRPO（Group Relative Policy Optimisation）中固定聚合机制（fixed aggregation mechanism）在将轨迹级优势映射到策略更新时导致的适应性不足问题——具体表现为训练崩溃（training collapse）与性能不佳之间的权衡。其解决方案之关键是HölderPO框架，它通过Hölder均值（Hölder mean）统一聚合token级概率，并引入可调参数 ( p ) 连续控制梯度集中度（gradient concentration）与方差界（variance bounds）之间的权衡；同时，为了克服静态配置无法同时兼顾稳定性与集中性的固有限制，论文进一步实例化了一个动态退火算法（dynamic annealing algorithm），在训练过程中逐步调度 ( p ) 参数，从而实现更优的稳定性与收敛性。

链接: https://arxiv.org/abs/2605.12058
作者: Yuxiang Chen,Dingli Liang,Yihang Chen,Ziqin Gong,Chenyang Le,Zhaokai Wang,Jiachen Zhu,Lingyu Yang,Jianghao Lin,Weinan Zhang,Jun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm’s adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbfHölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p , our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.

[AI-61] OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

【速读】：该论文旨在解决全模态大语言模型（Omni-LLMs）在长视频流和密集音频序列推理中因高计算成本而难以实际部署的问题，并克服现有压缩方法因依赖固定或原生压缩单元而破坏跨模态对应关系与互补信息、导致难以在提升推理效率的同时稳定保持性能的局限性。解决方案的关键在于提出一个无需训练的OmniRefine两阶段框架：第一阶段为“对应保持的块精炼”（Correspondence-Preserving Chunk Refinement），通过计算帧-音频相似度并应用动态规划，将原生块的边界精炼为跨模态对齐的压缩单元；第二阶段为“模态感知协同压缩”（Modality-Aware Cooperative Compression），在每个精炼单元内对视频和音频令牌进行联合压缩，以减少冗余并保留关键证据。实验表明，该方法能在低令牌保留率下显著接近全令牌基线的性能，实现了更优的效率-性能权衡。

链接: https://arxiv.org/abs/2605.12056
作者: Yuchen Deng,Zidang Cai,Hai-Tao Zheng,Jie Wang,Feidiao Yang,Yuxing Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.

[AI-62] Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons

【速读】：该论文试图解决的核心问题是：在固定参数预算（parameter budget）下，如何最优地分配神经元数量（N）、每单元有效复杂性（per-unit effective complexity, k_e）和每单元连接性（per-unit connectivity, k_c），以最大化序列处理模型的性能。这一问题的提出源于对主流机器学习默认使用极简单单元（如标准RNN）与皮质神经元复杂多尺度处理特性之间差异的反思。解决方案的关键在于引入一种允许独立调整这三个维度的架构——ELM网络（ELM Network），其循环层由模仿皮质神经元功能组件的表达性漏记忆神经元（Expressive Leaky Memory neurons, ELM neurons）构成。该架构不仅能够稳定地跨越数个数量级的规模训练，还通过封闭形式的信息论模型揭示了权衡机制：收益递减在低复杂度端受限于每神经元信噪比饱和，在高复杂度端受限于跨神经元冗余。超参数扫描进一步证实了该框架下的近帕累托前沿（Pareto-frontier）标度律，从而挑战了简单单元默认最优的传统观点，并为皮质依赖复杂时空积分器的必要性提供了规范化的理论依据。

链接: https://arxiv.org/abs/2605.12049
作者: Aaron Spieler,Georg Martius,Anna Levina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 25 pages, 21 figures, 3 tables, including derivations. Submitted for peer review

点击查看摘要

Abstract:Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget P between the number of units N , per-unit effective complexity k_e , and per-unit connectivity k_c ? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting N , k_e , and k_c and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex’s reliance on complex spatio-temporal integrators.

[AI-63] Efficient and Adaptive Human Activity Recognition via LLM Backbones

【速读】：该论文试图解决基于传感器的人体活动识别（Human Activity Recognition, HAR）中现有Transformer架构模型需要从头训练、训练成本高、数据需求量大且在异构和动态部署条件下适应性差的问题。解决方案的关键在于转变范式：复用大型预训练语言模型（Large Language Models, LLMs）作为通用时间骨干网络，而非设计领域特定的Transformer。为弥合惯性时间序列与语言模型之间的模态差异，论文提出一种结构化卷积投影（structured convolutional projection），将多轴加速度计和陀螺仪信号映射到LLM的潜在空间中。预训练骨干网络保持冻结，并通过参数高效的低秩适配（Low-Rank Adaptation, LoRA）进行微调，从而大幅减少可训练参数数量和总体训练成本。该方法在标准HAR基准上实现了快速收敛、强数据效率和稳健的跨数据集迁移，尤其适用于低数据和少样本场景。

链接: https://arxiv.org/abs/2605.12019
作者: Aleksandr Bredikhin,Philippe Lalanda,German Vega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) is a core task in pervasive computing systems, where models must operate under strict computational constraints while remaining robust to heterogeneous and evolving deployment conditions. Recent advances based on Transformer architectures have significantly improved recognition performance, but typically rely on task-specific models trained from scratch, resulting in high training cost, large data requirements, and limited adaptability to domain shifts. In this paper, we propose a paradigm shift that reuses large pretrained language models (LLMs) as generic temporal backbones for sensor-based HAR, instead of designing domain-specific Transformers. To bridge the modality gap between inertial time series and language models, we introduce a structured convolutional projection that maps multivariate accelerometer and gyroscope signals into the latent space of the LLM. The pretrained backbone is kept frozen and adapted using parameter-efficient Low-Rank Adaptation (LoRA), drastically reducing the number of trainable parameters and the overall training cost. Through extensive experiments on standard HAR benchmarks, we show that this approach enables rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. At the same time, our results highlight the complementary roles of convolutional frontends and LLMs, where local invariances are handled at the signal level while long-range temporal dependencies are captured by the pretrained backbone. Overall, this work demonstrates that LLMs can serve as a practical, frugal, and scalable foundation for adaptive HAR systems, opening new directions for reusing foundation models beyond their original language domain.

[AI-64] LLM s and the ZPD INTERSPEECH2026

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在认知本质上的争议，特别是反驳其具备“分布式表征”（distributed representations）式思考的观点，并解释其输出错误（如“幻觉”而非“梦想”）与人类常识推理能力缺失的根源。解决方案之关键在于：将LLM的机制重新定义为基于“实践”（practices）的“原始思维”（primitive thinking），而非真正的理解；同时借鉴维果茨基（Vygotsky）的最近发展区（Zone of Proximal Development, ZPD）理论，强调“互动”（interaction）是人类沟通的核心，而非“真正”理解的附属品。由此，论文主张未来研究应转向调查一套能够实现类常识行为的认知工具（cognitive tools），而不是简单依赖“护栏”（guard rails）来约束模型输出。

链接: https://arxiv.org/abs/2605.12016
作者: Peter Wallis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Short paper submitted to Interspeech 2026 (Desk Reject) 4 pages, plus references. 2 figures

点击查看摘要

Abstract:One hundred years ago Vygotsky and his circle were exploring the nature of consciousness and defining what would become psychology in the Soviet Union. They concluded that children develop “scientific thinking” through interacting with enculturated adults in Zones of Proximal Development or ZPDs. The proposal is that, contrary to the claims of some, the LLM mechanism is not doing thinking with “distributed representations,” but rather the completion model is doing “primitive thinking” in terms of practices. Viewed from this perspective, it would seem our large language models don’t hallucinate, but rather dream, and that what is needed is not “guard rails” but an investigation of the set of cognitive tools that enable us to do things that look like common-sense. The proposal here is that interaction is core to human communication rather than just an add-on to “real” understanding.

[AI-65] LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters

【速读】：该论文试图解决荷兰公共部门法律部门面临的人员短缺、案件量增加以及合规压力增大等问题，通过自动化起草异议回复信件（objection response letters）来提升效率和法律一致性。解决方案的关键在于结合检索增强生成（Retrieval-Augmented Generation, RAG）和上下文增强生成（Context-Augmented Generation, CAG），利用大语言模型（Large Language Model, LLM）配合精心策划的法律知识库进行相关法律和判例检索，并通过受控提示（controlled prompting）将外部知识和案件具体细节整合到连贯的草稿中。此外，专家审核（expert-in-the-loop review）作为关键环节，确保每封生成的信件在法律上合理且语境恰当，从而在不替代人类判断的前提下显著减少工作量并保证法律标准的一致应用。

链接: https://arxiv.org/abs/2605.12012
作者: Virgill van der Meer,Julien Rossi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICAIL 2026 as Short Paper

点击查看摘要

Abstract:Public-sector legal departments in the Netherlands face acute staff shortages, increased case volumes, and increased pressure to meet regulatory compliance. This paper presents LegalCheck, a novel system that addresses these challenges by automating the drafting of objection response letters through a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Using a large language model (LLM) alongside curated legal knowledge bases, LegalCheck performs retrieval of relevant laws and precedents, and uses controlled prompting to incorporate both external knowledge and case-specific details into a coherent draft. An expert-in-the-loop review ensures that each generated letter is legally sound and contextually appropriate. In a real-world deployment within the Municipality of Amsterdam, LegalCheck produced near-final advice letters in minutes rather than hours, while maintaining high legal consistency and factual accuracy. The output is based on actual regulations and prior cases, providing explainable outputs that captured the vast majority of required legal reasoning (often 80% to 100% of essential content). Legal professionals found that the system reduced their workload and ensured a consistent application of legal standards, without replacing human judgment. These results demonstrate substantial efficiency gains, improved legal consistency, and positive user acceptance. More broadly, this work illustrates how responsible AI can be deployed in the legal domain by augmenting LLMs with domain knowledge and governance mechanisms.

[AI-66] CR2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

【速读】：该论文试图解决在移动边缘环境中，大语言模型（LLMs）从集中云部署转向无线边缘时，现有查询级路由（query-level routing）方法仅优化token级成本、未能捕捉动态延迟与能耗开销，导致无法在受限设备-边缘资源下有效平衡准确性、延迟与能耗的问题。解决方案的关键在于提出了一个两阶段设备-边缘路由框架CR²，它将路由问题形式化为受部署约束且具有成本意识的决策问题，核心创新包括：一是解耦一个轻量级设备端边界门（margin gate）和一个边缘端效用选择器（utility selector），其中边界门基于冻结的查询嵌入和用户指定的成本权重，在本地执行前预测设备端推理相对于最佳边缘替代方案的效用优劣；二是引入保形风险控制（conformal risk control, CRC）校准程序，将每个操作点映射到一个接受阈值，从而在全信息效用参考下实现对边缘端错误接受风险（false-acceptance risk）的显式控制。

链接: https://arxiv.org/abs/2605.12001
作者: Nan Xue,Shengkang Chen,Zhiyong Chen,Jiangchao Yao,Yaping Sun,Zixia Hu,Meixia Tao
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: submitted to IEEE Journal

点击查看摘要

Abstract:As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.

[AI-67] he Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

【速读】：该论文试图解决在大型语言模型（LLM）服务中，功率限制（power capping）作为标准GPU能效杠杆在自回归解码阶段失效的问题。摘要指出，在NVIDIA H200 GPU上，解码阶段因受内存带宽限制而非计算限制，实际功耗仅为137–300 W（远低于700 W的额定功耗），导致功率限制从未触发，同时固件发起的时钟节流（firmware-initiated clock throttling）会进一步混淆测量结果，使性能分析失真。解决方案的关键在于采用流处理器时钟锁定（SM clock locking），直接作用于实际关键路径——内存带宽，从而消除功率限制和固件节流的混淆效应；该机制在所有测试的注意力范式（GQA、MLA、Gated DeltaNet和Mamba2）中均帕累托优于功率限制，能在最小吞吐量损失下回收高达32%的解码能量。此外，论文还识别了三种架构相关的DVFS行为类别，并揭示了新型注意力替换结构共同的能量模式：通过增加预填充（prefill）阶段的成本换取高效的解码，最终在生产批量下将总请求能耗降低至GQA的一半。

链接: https://arxiv.org/abs/2605.11999
作者: Bole Ma,Ayesha Afzal,Jan Eitzinger,Gerhard Wellein
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms – GQA, MLA, Gated DeltaNet, and Mamba2 – on NVIDIA H200, decode draws only 137–300,W on a 700,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.

[AI-68] BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLM s with Soft Prompts

【速读】：该论文试图解决知识图谱增强的大型语言模型（KG-enhanced LLMs）在双通道架构（文本通道与图条件软提示通道）下，现有文本通道后门攻击失效的问题，并揭示图通道本身存在被恶意操纵的风险。解决方案的关键在于提出BadSKP后门攻击方法，通过多阶段优化策略直接针对图到提示（graph-to-prompt）接口：首先构建对抗目标嵌入，然后优化毒化节点嵌入以诱导生成的软提示偏向对抗语义，最终将优化后的表示近似为流畅的对抗节点属性，从而绕过语义锚定效应，在冻结和木马设置下均实现高攻击成功率，而纯文本攻击即使在困惑度防御下仍不可靠。

链接: https://arxiv.org/abs/2605.11996
作者: Xiaoting Lyu,Yufei Han,Hangwei Qian,Haoyuan Yu,Xiang Ao,Bin Wang,Chenxu Wang,Xiaobo Ma,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.

[AI-69] Random-Set Graph Neural Networks

【速读】：该论文旨在解决图神经网络（GNN）中因数据固有性质（如噪声、缺失边、过平滑等）引起的不确定性对预测性能的负面影响，特别是如何有效量化并区分偶然不确定性（aleatoric uncertainty）和认知不确定性（epistemic uncertainty）。解决方案的关键在于提出一种基于信念函数（belief function）即有限随机集（finite random set）形式主义的新框架，通过构建随机集图神经网络（RS-GNN），在节点级别用信念函数头（belief-function head）预测类别上的随机集，从而同时得到精确的概率预测和认知不确定性度量。实验在9个图学习数据集（包括真实世界自动驾驶基准Nuscene和ROAD）上验证了其优越的不确定性量化能力。

链接: https://arxiv.org/abs/2605.11987
作者: Tommy Woodley,Shireen Kudukkil Manchingal,Matteo Tolloso,Davide Bacciu,Fabio Cuzzolin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph’s topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN’s superior uncertainty quantification capabilities

[AI-70] On the Limitations of Large Language Models for Conceptual Database Modeling

【速读】：该论文试图解决利用大型语言模型（LLMs）从自然语言需求自动生成实体-关系（Entity-Relationship, ER）图以支持关系数据库概念建模的问题，核心挑战在于确保生成模型在概念上一致地识别实体、关系和属性。解决方案的关键在于将多种LLMs与不同的提示工程技术（包括Zero-Shot、Chain of Thought以及Chain of Thought + Verifier）相结合，并对同一需求场景进行渐进复杂度的实验评估，通过定性的结构性和语义一致性分析来考察模型的表现，以判断其在复杂场景下的可靠性与局限性。

链接: https://arxiv.org/abs/2605.11986
作者: Arthur F. Siqueira,Carlos D. S. Nogueira,Eduarda Farias,Claudio E. C. Campelo,Júlia Menezes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.

[AI-71] Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation

【速读】：该论文旨在解决非视距（Non-Line-of-Sight, NLOS）交叉路口因驾驶员视野受限而导致的碰撞风险，尤其是现有车联网（V2X）警告和集体感知（Collective Perception, CP）方案无法有效影响未联网车辆或忽视车内警报的驾驶员。解决方案的关键在于引入一个协作人形机器人作为主动交通调节器：系统采用双摄像头基础设施单元检测主路接近车辆的位置、速度与运动状态，并以集体感知消息（Collective Perception Message, CPM）传输给机器人；机器人同时通过车载V2X单元接收来自联网车辆的合作感知消息（Cooperative Awareness Message, CAM）并可中继转发分散式环境通知消息（Decentralized Environmental Notification Message, DENM）。一个融合模块合并这些感知流以维持鲁棒的实时主路视图，并基于定义的“危险区”（Zone of Danger, ZoD）预测汇入车辆是否存在碰撞风险。当风险被检测到时，机器人发出类人STOP手势并物理阻挡汇入路径，直至危险解除。该方案通过物理干预克服了纯感知与警告的局限性，实现了真实NLOS条件下的安全汇入控制。

链接: https://arxiv.org/abs/2605.11972
作者: Mohammad Khoshkdahan,John Pravin Arockiasamy,Andy Flores Comeca,Alexey Vinel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注: Accepted for publication in the Proceedings of the 2026 IEEE Vehicular Technology Conference (VTC2026-Spring)

点击查看摘要

Abstract:Collisions at non-line-of-sight (NLOS) intersections remain a major safety concern because drivers have limited visibility of approaching traffic. V2X based warnings can reduce these risks, yet many vehicles are not equipped with V2X and drivers may ignore in vehicle alerts. Collective perception (CP) can compensate for low V2X penetration by extending the awareness of connected vehicles, but it cannot influence unconnected vehicles. To fill this gap, our work introduces a complementary concept that adds a cooperative humanoid robot as an active traffic moderator capable of physically stopping a vehicle that attempts to merge into an unseen traffic stream. The system operates on two parallel perception pathways. A dual camera infrastructure unit detects the position, speed and motion of approaching vehicles and transmits this information to the robot as a collective perception message (CPM). The robot also receives cooperative awareness messages (CAM) from connected vehicles through its onboard V2X unit and can act as a relay for decentralized environmental notification messages (DENM) when safety events originate elsewhere along the road. A fusion module combines these streams to maintain a robust real time view of the main road. A Zone of Danger (ZoD) is defined and used to predict whether an approaching vehicle creates a collision risk for a merging road user. When such a risk is detected, the robot issues a human-like STOP gesture and blocks the merging path until the hazard disappears. The full system was deployed at the Future Mobility Park (FMP) in Rotterdam. Experiments show that the combined vision and V2X perception allows the robot to detect approaching vehicles early, predict hazards reliably and prevent unsafe merges in real world NLOS conditions.

[AI-72] Assessing and Mitigating Miscalibration in LLM -Based Social Science Measurement

【速读】：该论文旨在解决大型语言模型（LLMs）在社会科学测量中因置信度校准不佳（miscalibration）而导致测量有效性受损的问题，即LLM报告的概率置信度未能真实反映其预测正确的经验概率，从而影响下游实证分析（如回归估计）的可靠性。解决方案的关键在于提出一种软标签蒸馏管道（soft label distillation pipeline）：首先将LLM输出的分数及其言语化置信度（verbalized confidence）转换为一个软目标分布（soft target distribution），然后利用该分布训练一个基于编码器模型（如BERT）的较小判别式分类器，使其拟合软目标而非硬标签。该方法通过知识蒸馏将LLM的校准信息迁移到更轻量的模型中，在不依赖庞大推理成本的前提下显著降低误校准——在14个社会科学构念上的平均期望校准误差（ECE）降低43.2%，Brier分数降低34.0%，从而将校准内置为测量有效性的核心环节，而非事后处理。

链接: https://arxiv.org/abs/2605.11954
作者: Jinyuan Wang,Ningyuan Deng,Yi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2% and Brier by 34.0%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

[AI-73] Counterfactual Trace Auditing of LLM Agent Skills

【速读】：该论文旨在解决当前大语言模型智能体（Large Language Model agent）技能评估方法的局限性，即现有基准（benchmark）仅报告技能附加前后的通过率（pass rate），将技能视为黑箱变化，无法捕捉技能对智能体行为的具体影响。解决方案的关键是提出了反事实追踪审计（Counterfactual Trace Auditing, CTA）框架，该框架为每个附带技能的智能体轨迹配对一个无技能的同类任务轨迹，将二者分割成目标导向的阶段（goal-directed phases）并加以对齐，然后生成结构化的技能影响模式（Skill Influence Pattern, SIP）注释。通过这些注释，CTA能够系统描述技能如何改变智能体行为，而不仅仅是任务结果，从而揭示了即使通过率几乎不变时，技能仍会显著重塑智能体行为，并识别出通过率无法检测的多种重复效应，如模板复制、离题创建、过度规划和任务恢复等。

链接: https://arxiv.org/abs/2605.11946
作者: Xiaolin Zhou,Jinbo Liu,Li Li,Ryan A. Rossi,Xiyang Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code and data are available at this https URL

点击查看摘要

Abstract:Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.

[AI-74] From Noise to Diversity: Random Embedding Injection in LLM Reasoning

【速读】：该论文旨在解决软提示（soft prompt）研究中一个未被仔细区分的问题：推理性能的提升究竟来自提示中学习到的内容，还是来自提示注入这一结构性操作本身。解决方案的关键在于提出随机软提示（Random Soft Prompts, RSPs），完全移除训练步骤，在输入中附加一组从预训练嵌入表（embedding table）的逐元素均值和方差拟合的各向同性高斯分布中随机采样的向量序列。这些序列不携带任何学习内容，但能在数学推理基准上达到与优化软提示相当的准确率。其机制分为两个阶段：注意力机制必须处理一个从未见过的随机位置，导致前几个生成令牌（token）的分布变平，推理轨迹出现分支；随着生成继续，这种影响自然稀释，响应最终收敛到单一完成。RSP通过提升早期令牌多样性并配合温度采样，扩大了Pass@N（N次尝试中至少一次正确的概率），并将该结构效应从推理扩展到DAPO训练中。这一方法隔离了注入行为本身带来的结构效应，为理解各类软提示变体的共同作用机制提供了统一视角。

链接: https://arxiv.org/abs/2605.11936
作者: Heejun Kim,Seungpil Lee,Jewon Yeom,Jaewon Sok,Seonghyeon Park,Jeongjae Park,Taesup Kim,Sundong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures, 6 tables. Under review

点击查看摘要

Abstract:Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt – training-free, freshly resampled – providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.

[AI-75] When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

【速读】：该论文试图解决的问题是工具使用语言代理（tool-use language agents）在从受控基准测试环境迁移到真实部署场景时面临的鲁棒性缺失，具体表现为用户输入错误、API配置异常、工具名称冲突等部署噪声（deployment noise）导致的代理性能显著下降。论文将这些失败建模为工具使用部分可观测马尔可夫决策过程（POMDP）中的模拟到现实（sim-to-real）差距，并通过引入RobustBench-TC基准（包含22种按POMDP四组件分类的扰动类型）系统量化了该差距。解决方案之关键在于提出了ToolRL-DR，一种基于域随机化（domain randomization）的强化学习（RL）方法，通过在扰动增强的轨迹（perturbation-augmented trajectories）上训练代理，覆盖观察（observation）、动作空间（action space）和奖励相关元数据（reward-relevant metadata）这三个静态可编码的POMDP组件。该方法使3B参数的骨干模型在保留约四分之三干净准确率的同时，将扰动下的聚合准确率提升至与14B开源函数调用基线相当，并显著缩小了与闭源模型o4-mini的差距；尤其值得注意的是，尽管训练中未包含转换扰动（transition perturbations），但ToolRL-DR仍缩小了约27%的转换扰动性能差距，表明对静态工具输入的对抗性RL训练诱导出更持久的重试策略，该策略可泛化至未见过的运行时故障。

链接: https://arxiv.org/abs/2605.11928
作者: Xiaolin Zhou,Aojie Yuan,Zheng Luo,Zipeng Ling,Xixiao Pan,Yicheng Gao,Haiyue Zhang,Jiate Li,Shuli Jiang,Prince Zizhuang Wang,Zixuan Zhu,Jinbo Liu,Ryan A. Rossi,Hua Wei,Xiyang Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Dataset, code, and benchmark leaderboard are available at this https URL and this https URL

点击查看摘要

Abstract:Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.

[AI-76] Domain Restriction via Multi SAE Layer Transitions

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在特定领域应用中因通用性导致的域外（out-of-domain, OOD）交互问题，即模型行为偏离提供者意图，而现有检测方法将LLM视为不可解释的黑箱，未能利用其内部处理信息。解决方案的关键在于利用LLM层间表示过渡（layer transitions）来提取领域特异性特征，具体通过稀疏自编码器（Sparse Autoencoder, SAE）对内部动态进行编码，并设计轻量级学习方式从这些编码中区分OOD文本，从而在提升检测能力的同时提供对LLM内部输入演化的可解释性。

链接: https://arxiv.org/abs/2605.11920
作者: Elias Shaheen,Avi Mendelson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider’s intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.

[AI-77] Rethinking Positional Encoding for Neural Vehicle Routing

【速读】：该论文试图解决Transformer模型在神经组合优化（neural combinatorial optimization, NCO）应用于车辆路径问题（vehicle routing problems, VRPs）时，标准自然语言处理中的位置编码（positional encoding, PE）无法捕捉路径解特有的结构性特征（如各向异性节点距离、循环与方向感知的拓扑结构、以及以仓库为中心的分层多路径全局结构）所导致的性能瓶颈问题。解决方案的关键在于提出“几何基础”（geometric grounding）的统一设计原则，并据此设计了一种分层各向异性位置编码（hierarchical anisometric PE），该编码将距离索引的循环一致路径内编码（distance-indexed, circularly consistent in-route encoding）与以仓库为锚点的角度跨路径编码（depot-anchored angular cross-route encoding）相结合，从而在节点间距离、路径循环性和全局多路径结构三个维度同时满足路由感知的位置编码要求。

链接: https://arxiv.org/abs/2605.11910
作者: Chuanbo Hua,Federico Berto,Andre Hottung,Nayeli Gast Zepeda,Yining Ma,Zihan Ma,Paula Wong-Chung,Changhyun Kwon,Cathy Wu,Kevin Tierney,Jinkyoo Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based models have become the dominant paradigm for neural combinatorial optimization (NCO) of vehicle routing problems (VRPs), yet the role of positional encoding (PE) in these architectures remains largely unexplored. Unlike natural language, where tokens are uniformly spaced on a line, routing solutions exhibit several properties that render standard NLP positional encodings inadequate. In this work, we formalize three such structural properties that a routing-aware PE should respect, namely anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, combining them with a unifying design principle of geometric grounding. Guided by these criteria, we analyze and compare PE methods spanning NLP, graph-transformer, and routing-specific families, and propose a hierarchical anisometric PE that combines a distance-indexed, circularly consistent in-route encoding with a depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants demonstrate that geometry-grounded PE consistently outperforms index-based alternatives, with gains that transfer across problem variants, model architectures, and distribution shifts.

[AI-78] Rethinking Supervision Granularity: Segment-Level Learning for LLM -Based Theorem Proving

【速读】：该论文试图解决在 Lean 4 环境中使用大型语言模型进行自动定理证明时，现有训练数据监督粒度（supervision granularity）存在的两难问题：步级（step-level）策略预测虽提供密集局部信号但可能破坏证明过程的连贯性，而全证明（whole-proof）生成虽保留全局结构却需复杂端到端生成。解决方案之关键在于提出段级监督（segment-level supervision），这是一种从证明轨迹中提取局部连贯的证明段（proof segments）以训练策略模型的训练数据构建策略，并在推理时通过相同策略触发短自回归生成（short rollouts）来优化已有步级模型，从而更有效地对齐模型学习与证明结构及搜索过程，显著提升证明成功率并降低推理成本。

链接: https://arxiv.org/abs/2605.11905
作者: Shuo Xu,Jiakun Zhang,Junyu Lai,Chun Cao,Jingwei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at this https URL.

[AI-79] AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers

【速读】：该论文旨在解决现有基于耳机的身份验证系统在用户交互需求、主动音频输出、环境噪声敏感性及可扩展性方面的不足，从而实现大规模部署的可行方案。解决方案的关键是提出一种名为AccLock的被动式（passive）认证系统，它利用从耳内心冲击图（BCG，Ballistocardiogram）信号中提取的独特生物特征进行无声无扰的用户验证。为实现这一目标，系统首先设计了包含抑制固有干扰和偶发干扰的两阶段去噪方案（two-stage denoising scheme），随后提出基于解耦（disentanglement）的深度学习模型HIDNet，显式将用户特定特征与共享干扰成分分离，最后基于孪生网络（Siamese network）构建可扩展的认证框架，免去每用户独立训练分类器的需求，从而在33名参与者实验中实现了3.13%的平均误接受率（FAR）和2.99%的平均误拒率（FRR）。

链接: https://arxiv.org/abs/2605.11901
作者: Lei Wang(Soochow University, China),Jiangxuan Shen(Soochow University, China),Xi Zhang(Macquarie University, Australia),Dalin Zhang(Aalborg University, Denmark),Jingyu Li(Peking University, China),Haipeng Dai(Nanjing University, China),Chenren Xu(Peking University, China),Daqing Zhang(Peking University, China),He Huang(Soochow University, China)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread use of earphones has enabled various sensing applications, including activity recognition, health monitoring, and context-aware computing. Among these, earphone-based user authentication has become a key technique by leveraging unique biometric features. However, existing earphone-based authentication systems face key limitations: they either require explicit user interaction or active speaker output, or suffer from poor accessibility and vulnerability to environmental noise, which hinders large-scale deployment. In this paper, we propose a passive authentication system, called AccLock, which leverages distinctive features extracted from in-ear BCG signals to enable secure and unobtrusive user verification. Our system offers several advantages over previous systems, including zero-involvement for both the device and the user, ubiquitous, and resilient to environmental noise. To realize this, we first design a two-stage denoising scheme to suppress both inherent and sporadic interference. To extract user-specific features, we then propose a disentanglement-based deep learning model, HIDNet, which explicitly separates user-specific features from shared nuisance components. Lastly, we develop a scalable authentication framework based on a Siamese network that eliminates the need for per-user classifier training. We conduct extensive experiments with 33 participants, achieving an average FAR of 3.13% and FRR of 2.99%, which demonstrates the practical feasibility of AccLock.

[AI-80] oward Modeling Player-Specific Chess Behaviors

【速读】：该论文试图解决现有类人国际象棋模型在复现特定历史冠军的个性化决策风格时存在的两大问题：一是现有模型只能基于技能水平模拟一般人群行为，无法捕捉个体冠军的行为特征；二是传统评估指标“走棋准确率”会惩罚人类自然变异性，且忽略长期行为一致性，导致对风格保真度的评估不完整。解决方案的关键在于两个方面：其一，提出一种架构，将统一的Maia-2模型适配到冠军特定嵌入（champion-specific embeddings），并通过集成有限的蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）来增强战术探索，从而在走棋选择中注入更具个体风格的决策倾向；其二，引入一种基于Jensen-Shannon散度（Jensen-Shannon divergence）的新行为指标，利用自编码器（AutoEncoder）和UMAP将高维棋盘表示压缩至潜在空间，再在公共网格上对走棋分布进行离散化，以量化个体间的行为相似性。实验表明，尽管MCTS降低了标准走棋准确率，但根据新指标显著提升了风格对齐度，并有效区分了不同冠军玩家的行为特征。

链接: https://arxiv.org/abs/2605.11893
作者: Loris Sogliuzzo,Aloïs Rautureau,Eric Piette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While artificial intelligence has achieved superhuman performance in chess, developing models that accurately emulate the individualized decision-making styles of human players remains a significant challenge. Existing human-like chess models capture general population behaviors based on skill levels but fail to reproduce the behavioral characteristics of specific historical champions. Furthermore, the standard evaluation metric, move accuracy, inherently penalizes natural human variance and ignores long-term behavioral consistency, leading to an incomplete assessment of stylistic fidelity. To address these limitations, an architecture is proposed that adapts the unified Maia-2 model to champion-specific embeddings, further enhanced by the integration of a limited Monte Carlo Tree Search (MCTS) process to enrich tactical exploration during move selection. To robustly evaluate this approach, a novel behavioral metric based on the Jensen-Shannon divergence is introduced. By compressing high-dimensional board representations into a latent space using an AutoEncoder and Uniform Manifold Approximation and Projection (UMAP), move distributions are discretized on a common grid to compare behavioral similarities. Results across 16 historical world champions indicate that while integrating MCTS decreases standard move accuracy, it improves stylistic alignment according to the proposed metric, substantially reducing the average Jensen-Shannon divergence. Ultimately, the proposed metric successfully discriminates between individual players and provides promising evidence toward more comprehensive evaluations of behavioral alignment between players and AI models.

[AI-81] Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

【速读】：该论文旨在解决LLM代理技能（agent skills）在部署时面临的自适应攻击风险——即攻击者利用审计和运行时反馈反复修改技能，直至通过审计并产生已验证的危害，而传统的单次审计或提示级红队测试无法衡量这种风险。解决方案的关键在于提出了一个灰盒自我进化红队框架Proteus，它通过形式化五轴技能攻击空间（five-axis skill-attack space），并构建统一的审计-沙盒-预言机（audit-sandbox-oracle）管道来评估每个攻击候选，将结构化的审计发现与运行时证据作为反馈，驱动跨轮突变。此外，Proteus还通过路径扩展（path expansion）发现成功攻击的替代实现，以及表面扩展（surface expansion）将学习到的攻击模式迁移到新的攻击目标，从而量化当前技能审核在面对反馈驱动的自适应攻击者时低估的残余风险。

链接: https://arxiv.org/abs/2605.11891
作者: Zhaojiacheng Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emphadaptive leakage – whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm – and present \ours, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase-1 cells, Proteus reaches 40–90% Attack Success Rate at 5 rounds (ASR@5) with positive learning-curve slopes on both evaluated auditors. Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at \geq 93% in every cell and AI-Infra-Guard, the strongest public auditor we evaluate, still admitting up to 41.3% joint-success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.

[AI-82] Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning ICML-26

【速读】：该论文旨在解决协作机器学习中现有数据估值方法无法验证或激励数据真实性的问题——数据源可以通过提交重复或噪音数据（noisy data）来人为提升自身估值与奖励，甚至阻碍他人获益。解决方案的关键在于：首次提出一个针对贝叶斯模型（Bayesian models）的机制，同时确保（F）协作公平性（collaborative fairness）与（T）均衡下的真实性（truthfulness）。该机制结合了半值（semivalues，如Shapley值）以保证公平性，以及基于一个未被数据源知晓的验证集（validation set）的真实数据估值函数（truthful data valuation function, DVF）。为了证明数据源可以通过提交反映其真实知识的数据集来最大化其在联盟（coalitions）中的期望数据值，论文引入了额外条件，并讨论了中介方预算有限或缺乏验证集时对（F）和（T）的适当松弛。

链接: https://arxiv.org/abs/2605.11889
作者: Rachael Hwee Ling Sim,Jue Fan,Xiao Tian,Xinyi Xu,Patrick Jaillet,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper

点击查看摘要

Abstract:Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others’ data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

[AI-83] From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

【速读】：该论文旨在解决脑电图基础模型（EEG-FMs）因黑箱特性（opaque nature）导致的可解释性不足问题，从而阻碍其在诊断和脑机接口等领域的广泛采用。解决方案的关键是将注意力感知的层相关传播（LRP）从传统的卷积神经网络（CNN）扩展至当前基础模型所依赖的Transformer架构，作为一种事后归因方法（post-hoc attribution method），不仅能够验证EEG-FM的决策合理性，还能揭示模型内部潜在的生物学相关模式——例如在运动想象任务中揭露出模型优先利用任务相关的眼动信号（“Clever Hans”行为），以及在情感预测中反复依赖中央电极簇的候选感觉运动显著性特征，从而将LRP定位为兼顾验证与探索的工具。

链接: https://arxiv.org/abs/2605.11885
作者: Justus Meyer zu Bexten,Nico Scherf,Bogdan Franczyk,Simon M. Hofmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Emerging foundation models (FMs) in electroencephalography (EEG) promise a path to scale deep learning in diagnostics and brain-computer interfaces despite data scarcity, yet their opaque nature remains a barrier to wider adoption. We investigate attention-aware Layer-wise relevance propagation (LRP) as a post-hoc attribution method for EEG-FMs, extending LRP’s use on convolutional neural network (CNN)-based EEG models to the Transformer architectures that current FMs are based on. We find that LRP can both verify EEG-FM decisions and surface novel, biologically plausible hypotheses from them. In motor imagery, it unmasks ‘Clever Hans’ behavior where models prioritize task correlated ocular signals over the intended motor correlates. In a naturalistic paradigm for affect prediction, it reveals a recurring reliance on a central electrode cluster, suggesting a candidate sensorimotor signature of arousal. Though heatmap interpretation remains ambiguous in this complex domain, the results position LRP as a tool for both verification and exploration of EEG-FMs, a role that will grow in both importance and discovery potential as the underlying models mature.

[AI-84] On-Policy Self-Evolution via Failure Trajectories for Agent ic Safety Alignment

【速读】：该论文试图解决的问题是：现有的工具使用型大语言模型智能体（Tool-using LLM agents）在轨迹（trajectories）层面易发生多种失败（如不安全工具调用、注入指令遵从、有害请求顺从或过度拒绝良性任务），而现有安全对齐信号多基于响应的最终结果或采用离策略（off-policy）方法，导致稀疏且单一目标的奖励信号，并引发安全-效用权衡（safety-utility trade-off），即提升安全性会以牺牲任务性能为代价。解决方案的关键在于提出FATE框架，这是一个在线策略自演化（on-policy self-evolving）框架，其核心是：对每一次轨迹失败，由同一策略生成修复候选（repair candidates），然后通过验证器从安全性、效用、过度拒绝控制和轨迹有效性四个维度重新评分并过滤，从而将原本稀疏的失败信号转化为密集的轨迹级修复监督信号，用于智能体的自我演化。在此基础上，进一步引入帕累托前沿策略优化（Pareto-Front Policy Optimization, PFPO），通过监督式预热（supervised warmup）和帕累托感知策略优化（Pareto-aware policy optimization），在优化过程中同时保持安全与效用的平衡，避免单一目标优化导致的权衡问题。

链接: https://arxiv.org/abs/2605.11882
作者: Bo Yin,Qi Li,Xinchao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

[AI-85] IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

【速读】：该论文旨在解决以下问题：在严格白名单域限制的企业环境下，网页浏览AI代理面临间接提示注入（Indirect Prompt Injection, IPI）的威胁，攻击者可通过嵌入在白名单域名提供的HTML页面中的隐藏指令影响代理行为，而现有的红队测试资源（如预构建的对抗性页面、通用LLM扫描器）无法有效模拟这一真实场景。解决方案的关键在于IPI-proxy工具包，其核心是一个拦截代理（intercepting proxy），能够在飞行中重写来自白名单域的真实HTTP响应，从统一库（包含820条去重攻击字符串，源自六个公开基准）中嵌入载荷。通过YAML驱动的测试框架独立参数化载荷集、嵌入技术（如HTML注释、不可见CSS、LLM生成的语义文本）和HTML插入点（6个位置），实现无需模拟页面或沙盒环境的参数扫描评估，并配合外泄追踪器记录成功回调。IPI-proxy将静态基准与实时部署桥接，为AI安全团队提供了可复现的测试基础，用于在白名单域名检索表面量化及加固代理对间接提示注入的防御能力。

链接: https://arxiv.org/abs/2605.11868
作者: Chia-Pei(Janet)Chen,Kentaroh Toyoda,Anita Lai,Alex Leung
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: code: this https URL

点击查看摘要

Abstract:Web-browsing AI agents are increasingly deployed in enterprise settings under strict whitelists of approved domains, yet adversaries can still influence them by embedding hidden instructions in the HTML pages those domains serve. Existing red-teaming resources fall short of this scenario: prompt-injection benchmarks ship pre-built adversarial pages that whitelisted agents cannot reach, and generic LLM scanners probe the model API rather than its retrieved content. We present IPI-proxy, an open-source toolkit for red-teaming web-browsing agents against indirect prompt injection (IPI). At its core is an intercepting proxy that rewrites real HTTP responses from whitelisted domains in flight, embedding payloads drawn from a unified library of 820 deduplicated attack strings extracted from six published benchmarks (BIPIA, InjecAgent, AgentDojo, Tensor Trust, WASP, and LLMail-Inject). A YAML-driven test harness independently parameterizes the payload set, the embedding technique (HTML comment, invisible CSS, or LLM-generated semantic prose), and the HTML insertion point (6 locations from \icodehead_meta to \icodescript_comment), enabling parameter-sweep evaluation without mock pages or sandboxed environments. A companion exfiltration tracker logs successful callbacks. This paper describes the threat model, situates IPI-proxy among contemporary IPI benchmarks and red-teaming tools, and details its architecture, design decisions, and configuration interface. By bridging static benchmarks and live deployment, IPI-proxy gives AI security teams a reproducible substrate for measuring and hardening web-browsing agents against indirect prompt injection on the same retrieval surface attackers exploit in production.

[AI-86] EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

【速读】：该论文试图解决机器人导航任务中，强化学习（Reinforcement Learning, RL）的奖励函数手工设计高度依赖领域专业知识、难以审计或适应、且容易引入归纳偏置（inductive biases）导致次优性能的问题。解决方案的关键在于提出EvoNav框架，利用大语言模型（Large Language Models, LLMs）通过进化过程自动生成奖励函数，同时引入渐进的三阶段“预热-提升”（warm-up-boost）评估流程：先从低成本代理（如小数据集和分析规则）过渡到轻量级 rollout，最终进行完整策略训练，从而在有效反馈下实现计算高效的探索，避免昂贵的全策略训练。

链接: https://arxiv.org/abs/2605.11859
作者: Zhikai Zhao,Chuanbo Hua,Federico Berto,Zihan Ma,Kanghoon Lee,Jiachen Li,Jinkyoo Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.

[AI-87] Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications ICML2026

【速读】：该论文旨在解决Bistable Memory Recurrent Unit (BMRU)在复杂序列学习任务中性能落后于可并行化循环神经网络（如状态空间模型和最小门控循环单元）的问题，其核心瓶颈在于状态更新过程中的梯度阻塞。解决方案的关键是提出一种累积更新公式（cumulative update formulation），通过引入时间上的跳跃连接（skip-connections）恢复梯度流，同时保留BMRU原有的量化状态、滞后特性（hysteresis）和持久记忆，进而衍生出Cumulative Memory Recurrent Unit (CMRU)及其松弛变体αCMRU，有效提升收敛稳定性、降低初始化敏感性，并在小模型规模下匹配或超越现有高性能RNN架构。

链接: https://arxiv.org/abs/2605.11855
作者: Julien Brandoit,Arthur Fyon,Damien Ernst,Guillaume Drion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9

点击查看摘要

Abstract:Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the \alpha CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and \alpha CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

[AI-88] Martingale-Consistent Self-Supervised Learning

【速读】：该论文试图解决自监督学习（SSL）在数据信息动态变化（如历史记录缩短、特征缺失或图像局部遮挡等部分观测场景）时，模型预测在粗粒度和细粒度视图之间缺乏连贯性的问题。具体而言，标准SSL目标（如不变性约束）强制将不同视图的表示拉近，但未考虑信息逐步揭示时应遵循的鞅性质（martingale property），即粗粒度预测应与细粒度预测的期望一致。解决方案的关键在于提出鞅一致性（martingale consistency）框架，它不强制表示完全相同，而是仅约束细粒度预测的期望等于粗粒度预测，从而允许预测随信息更新而合理调整，同时防止系统性漂移。为此，作者设计了预测空间和潜在空间两种变体，并引入基于随机细化的无偏两样本蒙特卡洛估计器（unbiased two-sample Monte Carlo estimator）来高效实现该约束。实验在合成数据、时间序列、表格数据及图像基准上验证了该方法在部分观测场景下能提升鲁棒性和校准性，并生成更稳定的表示。

链接: https://arxiv.org/abs/2605.11846
作者: Moritz Gögl,Hanwen Xing,Christopher Yau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is often deployed under changing information, such as shorter histories, missing features, or partially observed images. In these settings, predictions from coarse and refined views should be coherent: before refinement, the coarse-view prediction should match the average prediction expected after refinement. Martingales formalize this coherence principle, but standard SSL objectives do not enforce it. Unlike invariance objectives that pull views together, martingale consistency constrains only the expected refined prediction, allowing predictions to update as information is revealed while preventing systematic drift. We introduce a martingale-consistent SSL framework that closes this gap, with practical prediction- and latent-space variants and an unbiased two-sample Monte Carlo estimator based on stochastic refinement. We evaluate the approach on synthetic and real time-series, tabular, and image benchmarks under partial-observation regimes, in both semi-self-supervised and fully label-free settings. Across these experiments, our framework improves robustness and calibration under partial observation, yielding more stable representations as information is revealed.

[AI-89] rade-offs in Decentralized Agent ic AI Discovery Across the Compute Continuum

【速读】：该论文旨在解决在计算连续体（compute continuum，涵盖云、边缘和间歇连接域）中部署的代理系统（agentic systems）所需要的有效发现机制问题。问题的核心在于传统集中式发现机制在动态、异构及部分断连的网络环境中难以维持可靠性，而去中心化发现方向，尤其是基于分布式哈希表（DHT）的查找结构，成为新兴代理架构中构建代理目录（agent directory）的候选路径。解决方案之关键在于：在共享控制平面（control-plane）框架内，系统性地比较三种主要的结构化覆盖（structured overlay）家族——Chord、Pastry和Kademlia，作为代理发现的候选索引基板（indexing substrates）。通过针对4096节点静态场景以及同样规模节点下具有代表性的动态成员变更（churn）基准测试，刻画它们在发现可靠性（discovery reliability）、启动行为（startup behavior）以及控制平面开销（control-plane overhead）上的权衡，从而明确每个覆盖在边到云环境中适合代理发现的操作点（operating points），为设计者提供选择依据。

链接: https://arxiv.org/abs/2605.11839
作者: Patrizio Dazzi,Emanuele Carlini,Matteo Mordacchini,Saul Urso
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic systems deployed across the compute continuum need discovery mechanisms that remain effective across cloud, edge, and intermittently connected domains. In some emerging agentic architectures, decentralized discovery is already an active design direction, placing DHT-based lookup on the path toward agent directories. This paper studies the trade-offs among major structured-overlay families for agent discovery, comparing Chord, Pastry, and Kademlia as candidate indexing substrates within a shared control-plane framework. Using a benchmark subset centered on a 4096-node stationary comparison and a representative 4096-node churn benchmark, the paper characterizes how discovery reliability, startup behavior, and control-plane overhead vary across these overlays. The goal is to clarify the operating points they expose for agent discovery across edge-to-cloud environments.

[AI-90] Multi-Timescale Conductance Spiking Networks: A Sparse Gradient-Trainable Framework with Rich Firing Dynamics for Enhanced Temporal Processing

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在回归任务中因常用神经元模型（如LIF和AdLIF）在梯度可训练性、动力学丰富性和活动稀疏性之间的权衡而导致的性能退化问题，特别是在连续值输出受近似误差、噪声和脉冲离散化严重影响的场景下。解决方案的关键在于提出一种可梯度训练的多时间尺度电导尖峰网络（multi-timescale conductance spiking networks），通过参数化快、慢和超慢电导来塑造神经元的电流-电压（I-V）曲线，从而系统性地调控兴奋性，在单一模型内实现强直、位相和爆发等多种发放模式，同时利用离散时间公式化使得网络可以直接通过时间反向传播（backpropagation through time）进行训练，无需代理梯度近似，并在Mackey-Glass时间序列回归中相比基线模型展现出更高的预测精度和显著更稀疏的通信与计算活动。

链接: https://arxiv.org/abs/2605.11835
作者: Alex Fulleda-Garcia,Saray Soldado-Magraner,Josep Maria Margarit-Taulé
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in 2026 IEEE Neuro-Inspired Computational Elements Conference (Atlanta, USA)

点击查看摘要

Abstract:Spiking neural networks (SNNs) promise low-power event-driven computation for temporally rich tasks, but commonly used neuron models often trade off gradient-based trainability, dynamical richness, and high activity sparsity. These limitations are acute in regression, where approximation error, noise and spike discretization can severely degrade continuous-valued outputs. Indeed, many state-of-the-art (SOTA) SNNs rely on simple phenomenological dynamics trained with surrogate gradients and offer limited control over spiking diversity and sparsity. To overcome such limitations, we introduce multi-timescale conductance spiking networks, a gradient-trainable framework in which neural dynamics emerge from shaping the current-voltage (I-V) curve by tuning fast, slow and ultra-slow conductances. This parametrization allows systematic control over excitability, can be implemented efficiently in analog circuits, and yields rich firing regimes including tonic, phasic and bursting responses within a single model. We derive a discrete-time formulation of these differentiable dynamics, enabling direct backpropagation through time without surrogate-gradient approximations. To probe both trainability and accuracy, we evaluate feedforward networks of these neurons at the predictability limit of Mackey-Glass time-series regression and compare them to baseline LIF and SOTA AdLIF networks. Our model outperforms LIF and AdLIF networks, while exhibiting substantially sparser activity from both communication and computational perspectives. These results highlight multi-timescale conductance spiking neurons as a promising building block for energy-aware temporal processing and neuromorphic implementation.

[AI-91] MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

【速读】：该论文旨在解决现有记忆基准无法满足医疗代理在动态、高风险生产环境中对记忆精确性、安全性及长期临床追踪能力的需求。解决方案的关键在于：通过人机协作流水线，基于临床验证的合成患者原型生成包含约2000个会话和16000次交互轮次的高度逼真长周期医疗轨迹数据集；并首创“边评估边构建”的流式评估协议，精准模拟生产环境中的动态记忆积累过程；同时，形式化并系统研究了“记忆饱和”现象，即持续信息输入导致检索与推理鲁棒性退化，从而暴露主流架构在复杂医疗推理与噪声鲁棒性方面的根本缺陷。

链接: https://arxiv.org/abs/2605.11814
作者: Yihao Wang,Haoran Xu,Renjie Gu,Yixuan Ye,Xinyi Chen,Xinyu Mu,Yuan Gao,Chunxiao Guo,Peng Wei,Jinjie Gu,Huan Li,Ke Chen,Lidan Shou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an “evaluate-while-constructing” streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

[AI-92] Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

【速读】：该论文旨在解决鲁棒优化（Robust Optimization, RO）中手动将不确定优化模型转化为易处理确定性形式的瓶颈问题，具体挑战在于大语言模型（LLMs）在进行RO重新公式化时需执行精确的多步推理与数学一致变换，而现有方法缺乏系统评估基准且效果不足。解决方案的关键是提出了AutoREM（Automated Reformulation with Experience Memory），一种无需调优的记忆增强框架，其核心是通过自定义离线适应过程，让模型自主反思过去失败的优化轨迹，从中构建结构化的文本经验记忆；该记忆无需领域专家知识或参数更新即可跨不同基础LLM迁移，从而在保持推理一致性的同时显著提升RO重新公式化的准确性与效率。此外，论文还构建了AutoRO-Bench基准，包含自动化数据生成管道与策划数据集，为LLM在RO任务上的系统评估提供了标准化平台。

链接: https://arxiv.org/abs/2605.11813
作者: Jinbiao Chen,Shuang Jin,Guoyun Zhang,Junyu Zhang,Guanyi Wang,Hanzhang Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.

[AI-93] Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

【速读】：该论文试图解决视觉-语言-动作模型（Vision-Language-Action, VLA）中动作头（action head）的同质化问题——现有VLA模型大多直接以固定世界坐标系预测动作指令，导致策略缺乏对几何结构的感知和鲁棒性。解决方案的关键在于提出MCF-Proto，一个轻量级动作头，其核心包含两个设计：一是运动中心动作帧（Motion-Centric Action Frame, MCF），通过预测一个旋转矩阵 $R_t \in SO(3)$ 将动作构建在变换后的局部坐标系中；二是基于原型的动作参数化（prototype-based action parameterization），即从一组可学习原型中组合局部动作，再映射回世界坐标系进行端到端训练。该设计仅依赖标准演示数据，无需额外监督，却能自发诱导出稳定的几何结构——学习到的局部帧与演示中的末端执行器运动高度对齐，同时动作表示在原型空间中变得显著紧凑（变异集中于更少的主导方向），从而提升策略在几何扰动下的鲁棒性。

链接: https://arxiv.org/abs/2605.11809
作者: Huoren Yang,Jianchao Zhao,Hu Yusong,Qiguan Ou,Yuyang Gao,Wei Ke,Yuhang He,SongLin Dong,Zhiheng Ma,Yihong Gong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbfMCF-Proto, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation R_t \in SO(3) , composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.

[AI-94] Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

【速读】：该论文试图解决基于大语言模型（LLM）的生成式兴趣点（POI）推荐模型在训练后知识固化、无法感知影响用户移动决策的动态现实世界条件（如本地事件、文化趋势和季节性变化）的问题。解决方案的关键在于提出AWARE（Agent-based World knowledge Augmented REcommendation）框架，它利用LLM智能体（LLM agent）生成具有位置和时间感知的上下文叙述（contextual narratives），以捕捉区域文化特征、季节性趋势及与用户相关的持续事件；更重要的是，这些叙述并非引入通用或噪声信息，而是进一步锚定到每个用户的行为上下文（behavioral context）中，从而将外部世界知识（external world knowledge）扎根于个性化的时空模式（personalized spatial-temporal patterns）内，实现动态知识注入与用户偏好的精准对齐。

链接: https://arxiv.org/abs/2605.11807
作者: Qiuyu Ding,Heng-Da Xu,Wei Zhang,Dongyi Lv,Changda Xia,Feng Xiong,Mu Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative point-of-interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real-world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent-based World knowledge Augmented REcommendation), which employs an LLM agent to generate location- and time-aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user’s behavioral context, grounding external world knowledge in personalized spatial-temporal patterns. Extensive experiments on three real-world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.

[AI-95] Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

【速读】：该论文旨在解决一个方法论困境：如何在不依赖伦理受限、难以复现且充满不可控变量的真人受试者的情况下，量化不文明交流（uncivil communication）对辩论效率（以收敛时间衡量）的因果影响。其解决方案的关键在于利用基于大语言模型的多智能体系统（LLM-based Multi-Agent Systems）构建一个可复现、可规模化的“社会学沙盒”（sociological sandbox），通过蒙特卡洛模拟（Monte Carlo simulation）框架系统性地操控毒性条件（toxicity conditions），生成数千次结构化的1对1对抗辩论，并以收敛时间作为交互效率的代理指标。这一框架不仅能隔离毒性行为的独立效应，还支持在不同参数规模的模型间进行泛化性检验，从而克服了传统人类研究在可控性和可扩展性上的根本限制。

链接: https://arxiv.org/abs/2605.11789
作者: Alison Moldovan-Mauer,Benedikt Mangold
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unconstructive debate and uncivil communication carry well-documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi-Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1-on-1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first-mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.

[AI-96] Crash Assessment via Mesh-Based Graph Neural Networks and Physics-Aware Attention

【速读】：该论文试图解决全车碰撞仿真计算成本高昂、难以用于迭代设计探索的问题，即需要一种能够快速且足够精确地预测时变结构变形场的替代模型。解决方案的关键在于提出并评估了一系列混合代理模型（hybrid surrogate models），包括MeshTransolver、MeshGeoTransolver和MeshGeoFLARE，其核心架构结合了局部网格消息传递（local mesh message passing）、几何感知全局注意力（geometry-aware global attention）以及稀疏接触感知校正（sparse contact-aware correction），用于自回归碰撞展开（autoregressive crash rollout）。这种混合设计能够同时捕捉短程结构相互作用和长程变形模式，并特别通过接触感知变体评估动态接近相互作用的影响，从而在标量精度、生存空间一致性和物理可解释的位移场之间取得最佳平衡，最终实现快速全场预测同时保留工业碰撞工程分析所需的核心结构信息。

链接: https://arxiv.org/abs/2605.11784
作者: Gabriel Curtosi,Carlos Manuel Ruiz Ruiz,Fabiola Cavaliere,Xabier Larráyoz Izcara
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 15 figures, 7 tables

点击查看摘要

Abstract:Full-vehicle crash simulations are computationally expensive, limiting their use in iterative design exploration. This work investigates learned hybrid surrogate models (MeshTransolver, MeshGeoTransolver, and MeshGeoFLARE) for predicting time-resolved structural deformation fields in an industrial lateral pole-impact benchmark. We evaluate whether neural surrogates can reproduce full-field crash kinematics with sufficient accuracy, spatial regularity, and structural plausibility for engineering interpretation. The proposed architectures combine local mesh message passing, geometry-aware global attention, and sparse contact-aware correction for autoregressive crash rollout. We compare mesh-based graph neural networks, attention-based geometric models, and hybrid architectures under a common training and hyperparameter configuration. The hybrid models capture both short-range structural interactions and long-range deformation patterns, while a sparse contact-aware variant assesses the effect of dynamic proximity interactions during rollout. On a 25-sample full-vehicle test set, the best hybrid model achieves a temporal mean root-mean-square error of 3.20 mm. While geometry-aware attention baselines are quantitatively competitive, qualitative side-view inspection shows they can introduce local spatial noise and deformation irregularities that complicate structural interpretation. In contrast, hybrid mesh-attention models provide the best balance between scalar accuracy, survival-space consistency, and physically interpretable displacement fields. These results suggest that crash surrogate assessment should combine global error metrics with downstream safety-relevant quantities and qualitative field inspection. The proposed methodology enables fast full-field predictions while preserving essential structural information for industrial crash-engineering analysis. Comments: 40 pages, 15 figures, 7 tables Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.11784 [cs.CE] (or arXiv:2605.11784v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2605.11784 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriel Curtosi [view email] [v1] Tue, 12 May 2026 08:50:49 UTC (5,910 KB)

[AI-97] Is Monotonic Sampling Necessary in Diffusion Models?

【速读】：该论文试图解决扩散模型（diffusion models）中一个长期未被系统检验的假设：采样过程中噪声水平必须单调递减（monotonic sampling）是否必要且最优。其解决方案的关键在于：设计了四类结构化的非单调调度（structured nonmonotonic schedules），并在DDPM、EDM和Flow Matching三种架构上，对10至200次函数评估（NFE）预算及42组超参数消融实验进行系统性测试，发现所有非单调调度均未优于单调基线，但性能损失幅度因模型而异。论文进一步形式化了调度敏感性系数（Schedule Sensitivity Coefficient），这是一种廉价且与架构无关的诊断指标，能够揭示模型在关键噪声水平上是否未收敛到贝叶斯最优去噪器（Bayes-optimal denoiser），从而解释了不同模型对非单调调度的敏感性差异。

链接: https://arxiv.org/abs/2605.11773
作者: Muhammad Haris Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models generate samples by iteratively denoising a Gaussian prior, traversing a sequence of noise levels that, in every published sampler, decreases monotonically. Six years of intensive work has refined nearly every aspect of this recipe, including the corruption operator, the training objective, the schedule shape, the architecture, and the ODE solver. Yet the assumption of monotonicity itself has never been systematically tested. Here we ask whether monotonic sampling is load-bearing or merely conventional. We design four families of structured nonmonotonic schedules and apply them to three architecturally distinct generative models, DDPM, EDM, and Flow Matching, across NFE budgets ranging from 10 to 200 function evaluations, plus a 42-cell hyperparameter ablation, on CIFAR-10. Across all 90 tested configurations, no tested nonmonotonic schedule improves on the monotonic baseline. The magnitude of the penalty, however, spans nearly three orders of magnitude: persistent and substantial in DDPM, intermediate in Flow Matching, and indistinguishable from zero in EDM. We show that this variation is not noise but a structural property of each trained denoiser, and we formalize it as the Schedule Sensitivity Coefficient, a cheap, architecture-agnostic diagnostic that provides evidence of non-convergence to the Bayes-optimal denoiser at the critical noise level. Our findings justify the field’s tacit reliance on monotonic schedules and supply a new probe of diffusion model quality complementary to sample-quality metrics such as Frechet Inception Distance.

[AI-98] Behavioral Integrity Verification for AI Agent Skills

【速读】：该论文旨在解决现有大型语言模型代理（LLM Agent）安全工作中被忽视的环节：技能工件（skill artifact）本身未经行为完整性验证的问题。现有方法仅关注恶意提示或危险的运行时动作，但技能作为封装了文件系统访问、凭证、网络调用和shell执行等特权第三方能力（third-party capabilities）的模块，其声明行为与实际能力之间存在未经审计的鸿沟，可能隐藏恶意代码或复合威胁。解决方案的关键是提出了行为完整性验证（Behavioral Integrity Verification, BIV）框架，将问题形式化为一个基于共享分类法的类型化集合比较问题：通过配对执行确定性代码分析（deterministic code analysis）与LLM辅助能力提取（LLM-assisted capability extraction），在代码、指令和元数据共同构成的分类法上比较声明的能力集与实际的能力集，生成结构化证据以支持偏差分类、根本原因分类和恶意技能检测三种下游分析。该框架在49,943个技能上的实证表明，80.0%的技能存在描述-实现偏差，并识别出四种新型复合威胁类别，且多数偏差源于开发者疏忽而非恶意，在906个技能的恶意检测基准上达到0.946的F1分数，显著优于现有规则和单次LLM基线。

链接: https://arxiv.org/abs/2605.11770
作者: Yuhao Wu,Tung-Ling Li,Hongliang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.

[AI-99] owards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention ACL2026

【速读】：该论文试图解决多模态摘要任务中现有方法将浅层视觉特征注入深层语言模型所导致的表征不匹配（representational mismatches）与跨模态对齐薄弱（weak cross-modal grounding）的问题。解决方案的关键在于提出一个统一框架SPeCTrA-Sum（Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization），其包含两项核心创新：一是深度视觉处理器（Deep Visual Processor, DVP），通过在对应深度层级上对齐视觉编码器与语言模型，实现层次化的逐层融合，从而保持语义一致性；二是轻量级视觉相关性预测器（Visual Relevance Predictor, VRP），通过从行列式点过程（Determinantal Point Processes, DPP）教师模型中蒸馏软标签来挑选显著且多样的图像。最终通过结合自回归摘要生成、跨模态对齐和基于DPP的蒸馏的多目标损失函数进行训练，从而生成更准确、视觉上更扎实的摘要并选择更具代表性的图像。

链接: https://arxiv.org/abs/2605.11753
作者: Abid Ali,Diego Molla-Aliod,Usman Naseem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.

[AI-100] When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

【速读】：该论文试图解决的问题是：当前广泛使用的思维链（Chain-of-thought, CoT）轨迹是否真实反映了语言模型在生成最终答案时的内部计算过程——即可见的CoT文本是否与决定答案的潜在计算在时间上保持同步（temporal faithfulness）。解决方案的关键是构建了一个步骤级的“检测-分类-比较”（Detect-Classify-Compare）框架，该框架围绕一个通过Patchscopes、tuned-lens探针和因果方向消融（causal direction ablation）进行交叉验证的答案承诺代理（answer-commitment proxy），来检测潜在承诺（latent commitment）与显式答案到达（explicit answer arrival）之间的对齐情况。通过这一框架，论文揭示了普遍存在的“虚构延续”（confabulated continuation）模式——即答案承诺已经稳定后，轨迹仍继续生成看似深思熟虑的文本，而后续的截断实验和捐赠-破坏测试（donor-corruption test）进一步表明这些后承诺文本对最终答案并非负载性的（not load-bearing）。

链接: https://arxiv.org/abs/2605.11746
作者: Wenkai Li,Fan Yang,Ananya Hazarika,Shaunak A. Mehta,Koichi Onoue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.

[AI-101] OptArgus: A Multi-Agent System to Detect Hallucinations in LLM -based Optimization Modeling

【速读】：该论文试图解决大型语言模型（LLM）在将自然语言优化问题转换为数学公式及求解器代码时，仅凭数值匹配无法保证语义正确性的问题，即优化建模幻觉（optimization-modeling hallucination）——一个在数值上一致但改变了底层优化语义的假阳性问题。解决方案的关键在于提出一个细粒度的幻觉分类体系（taxonomy），涵盖目标、变量、约束和实现四大类失败模式，并基于此设计多智能体检测器OptArgus，通过指挥路由（conductor routing）、专业审计员（specialist auditors）和证据整合（evidence consolidation）对问题描述、符号模型和求解器实现进行结构一致性审计，从而将幻觉检测转化为一个具体的实证问题。

链接: https://arxiv.org/abs/2605.11738
作者: Zhong Li,Zihan Guo,Xiaohan Lu,Juntao Wang,Jie Song,Chao Shen,Jiageng Wu,Mingyang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emphoptimization-modeling hallucination detection, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with 484 clean artifacts, 1266 controlled injected artifacts, and 6292 natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.

[AI-102] SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（MLLMs）在越狱攻击（jailbreak attacks）下现有防御方法的局限性：依赖高成本的微调（fine-tuning）或低效的后处理干预（post-hoc interventions），导致无法有效应对新型攻击，且常需在模型性能与安全性之间进行权衡。解决方案之关键在于：利用MLLMs在解码阶段（decoding stage）本身具备的辨别有害输入的内在能力，设计了一个解码级别的防御机制SafeSteer。其核心包括两部分：一是Decoding-Probe，一种轻量级探针，能够在解码过程中实时检测并纠正有害输出，通过迭代引导解码朝向安全方向；二是模态语义对齐向量（modal semantic alignment vector），用于将文本模态的强安全性对齐迁移到视觉模态，以应对更隐蔽的图像攻击。该方法无需微调，即可显著提升安全性（最高达33.40%），同时保持模型的有用性（helpfulness）与无害性（harmlessness）之间的平衡。

链接: https://arxiv.org/abs/2605.11716
作者: Xinyi Zeng,Xue Yang,Jingyuan Zhang,Huanqian Yan,Xiang Chen,Kaiwen Wei,Hankun Kang,Yu Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs’ safety by up to 33.40% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.

[AI-103] oward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance ICML2026

【速读】：该论文试图解决大型语言模型（LLMs）与人类价值观对齐过程中存在的根本性矛盾：模型的残差流（residual stream）高度动态，价值观作为脆弱、低维的属性难以在这种动态环境中保持稳定表达，导致价值观对齐缺乏一致性。解决方案的关键在于提出稳定价值引导Transformer（Stable Value Guidance Transformer, SVGT），通过一个独立的价值模块实现两个核心设计：（1）独立价值建模（independent value modeling），在与主干网络隔离的专用价值空间中维持规范化的表征；（2）显式行为引导（explicit behavioral guidance），将这些稳定信号转化为可学习的潜在桥接标记（Latent Bridge Tokens），作为动态价值锚点显式引导生成轨迹，从而在不干扰主干网络内部表征的前提下，确保模型在不同上下文中稳健地遵循价值观。实验结果表明，SVGT能够在保持生成流畅性的同时，将有害分数降低超过70%。

链接: https://arxiv.org/abs/2605.11712
作者: Wenhao Chen,Sirui Sun,Shengyuan Bai,Guojie Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026 (Spotlight). 32 pages

点击查看摘要

Abstract:Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone’s parameters or representation space. However, a critical gap exists: the model’s residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone’s internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at this https URL.

[AI-104] Debiased Model-based Representations for Sample-efficient Continuous Control ICML2026

【速读】：该论文旨在解决基于模型的表示方法（model-based representations）在离线策略演员-评论家学习（off-policy actor-critic learning）中存在的两个关键问题：一是未能充分捕获与相关变量（relevant variables）相关的信息，二是对重放缓冲区内早期经验（early experiences）的过拟合，两者共同导致表示和演员-评论家学习出现偏差（biases），进而损害性能。解决方案的核心在于提出DR.Q算法（Debiased model-based Representations for Q-learning），其关键创新包括：通过显式最大化当前状态-动作对表示与下一状态间的互信息（mutual information），同时最小化两者偏差，以增强对动态信息的捕获；并采用衰减的优先经验回放（faded prioritized experience replay）来采样轨迹，从而缓解对早期经验的过拟合，实现无偏的表示学习。

链接: https://arxiv.org/abs/2605.11711
作者: Jiafei Lyu,Zichuan Lin,Scott Fujimoto,Kai Yang,Yangkun Chen,Saiyong Yang,Zongqing Lu,Deheng Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at this https URL.

[AI-105] Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality Alignment and Diversity ACL2026

【速读】：该论文试图解决多模态大语言模型（MLLMs）生成的、包含文本与显著视觉内容的多模态摘要（MSMO）评估碎片化问题：当前文本质量、图像-文本对齐和视觉多样性通常被孤立地使用单模态指标评估，难以判断各模态是否共同支撑了一个忠实且有用的摘要。解决方案之关键是提出了一个统一评估框架MM-Eval，该框架整合了三个核心组件：（1）文本质量，通过OpenFActScore评估事实一致性和G-Eval评估连贯性、流畅性与相关性；（2）图像-文本相关性，基于MLLM-as-a-judge方法进行判定；（3）图像集多样性，使用截断CLIP熵（Truncated CLIP Entropy）量化。此外，框架通过一个基于mLLM-EVAL新闻基准训练的聚合模型进行校准，以对齐组件贡献与人类偏好，从而提供一个可解释、弱参考的评估方案。

链接: https://arxiv.org/abs/2605.11693
作者: Abid Ali,Diego Molla-Aliod,Usman Naseem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.

[AI-106] Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

【速读】：该论文试图解决金融领域AI可解释性（Explainable AI, XAI）面临的三个核心问题：解释结果缺乏持久性（persistence）导致无法追溯历史、单一方法解释的可靠性不足、以及非技术决策者难以通过自然语言交互访问解释。解决方案的关键在于构建一个以人为中心的架构，包含三项核心创新：第一，将LIME特征归因、基于遮挡的词重要性分数和显著性热力图等XAI工件作为持久化、可搜索的对象，存储在分布式S3兼容存储中，并附加结构化元数据和自然语言摘要，从而支持语义检索和系统故障后的自动索引重建；第二，引入多方法解释三角验证（explanation triangulation），通过检索增强生成（Retrieval-Augmented Generation, RAG）助手比较和综合同一预测下不同XAI方法的结果，使用户能通过自然语言对话评估解释的鲁棒性；第三，通过自动化检查对生成解释的忠实性（faithfulness）进行评估，包括对基础完整性（grounding completeness）、幻觉性陈述（hallucinated claims）和方法归因行为（method-attribution behavior）的验证。实验表明，约束提示（constrained prompting）相比朴素提示可将幻觉率降低36%，同时方法归因引用增加73%，从而在受监管金融环境中实现可信赖、以人为中心的AI服务。

链接: https://arxiv.org/abs/2605.11687
作者: Georgios Makridis,Georgios Fatouros,John Soldatos,George Katsis,Dimosthenis Kyriazis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts – LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps – as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36% and increases method-attribution citations by 73% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.

[AI-107] Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

【速读】：该论文旨在解决大语言模型多目标对齐中不同人类偏好（如 helpfulness 与 harmlessness）之间的零和冲突问题，即传统方法（数据选择、参数合并或训练期间的算法平衡）仅在固定帕累托边界上强制折中，无法从根本上消解内在张力。解决方案的关键在于从多维奖励 (multi-dimensional rewards) 视角出发：通过放大模型的 rollout 并分析不同奖励维度上的输出，发现冲突根源在于提示 (prompt) 本身限制了可达到的多维奖励空间。基于此，论文提出 MORA (Multi-Objective Reward Assimilation)，通过预采样分离出单奖励提示，并重写原始问题以融入多维意图，从而扩展奖励多样性，使模型能够同时兼顾多个目标，在顺序对齐和同步对齐中均取得了显著性能提升。

链接: https://arxiv.org/abs/2605.11679
作者: ShiYing Huang,Liang Lin,Yuer Li,Kaiwen Luo,Zhenhong Zhou,An Zhang,Junhao Dong,Kun Wang,Zhigang Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model’s rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at this https URL.

[AI-108] OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

【速读】：该论文旨在解决端到端视觉-语言-动作模型在自动驾驶推理时显存需求过高（20-60GB），远超商用GPU（12-16GB）可用容量的问题。其解决方案的关键在于通过系统级优化实现在显存受限GPU上的高效推理，无需修改模型权重。具体包含三个核心步骤：首先，顺序需求分层（Sequential Demand Layering）将显存占用粒度从模型级降至层级，大幅降低峰值需求；其次，流水线需求分层（Pipelined Demand Layering）通过参数传输与层计算的重叠，将传输时间隐藏在计算时间内；最后，GPU驻留层决策策略（GPU-Resident Layer Decision Policy）基于每模块驻留收益分析，消除流水线无法隐藏的残余传输开销。此外，论文提出一个性能预测模型（performance prediction model），仅需单次性能剖析即可确定最优配置（驻留层的数量和位置），预测误差低于1.3%。将这些方法应用于NVIDIA Alpamayo-R1-10B（21.52GB）模型并在RTX 5070Ti（16GB）上推理，实现了相较于Accelerate卸载（offloading）方案最高3.55倍的加速，同时保持完整的BF16精度。

链接: https://arxiv.org/abs/2605.11678
作者: Seungwoo Roh,Huiyeong Kim,Jong-Chan Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE RTCSA on March 26, 2026 (KST); Accepted on May 4, 2026 (KST)

点击查看摘要

Abstract:End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer–compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA’s Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.

[AI-109] A CAP-like Trilemma for Large Language Models : Correctness Non-bias and Utility under Semantic Underdetermination

【速读】：该论文试图解决在语义欠定（semantic underdetermination）条件下，大语言模型（Large Language Models, LLMs）无法同时保证强正确性（strong correctness）、严格无偏性（strict non-bias）和高实用性（high utility）这一根本性冲突问题。解决方案的关键在于形式化地提出了一个类似CAP定理的三难困境（trilemma）理论框架，通过分析提示（prompt）语义欠定（即给定前提无法唯一确定答案）时的结构特性，论证模型必须引入选择准则、偏好或价值排序才能给出有用且果断的回应，而这种引入必然在广义的选择论意义上引入偏差；若为避免偏差而拒绝或模棱两可，则会降低实用性。该工作从理论上解释了某些LLM失效并非单纯源于模型能力缺陷，而是由欠定决策请求的内在结构所决定。

链接: https://arxiv.org/abs/2605.11672
作者: Vinu Ellampallil Venugopal
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness–non-bias–utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.

[AI-110] Cochise: A Reference Harness for Autonomous Penetration Testing

【速读】：该论文试图解决现有LLM驱动的自主渗透测试系统因过度整合架构、提示工程及工具集成等复杂组件，导致难以区分其相对于简单agent框架（simple agent scaffold）所带来实际增益的问题。解决方案的关键在于提出一个最小化的实验框架cochise（仅597行Python代码），它采用分离的规划器-执行器架构（Planner-Executor architecture），将长期状态维护在LLM上下文之外，并通过ReAct风格的执行器经SSH下发命令并基于反馈自我纠正；同时提供配套的重放与分析工具及测试轨迹日志，从而作为一个可复用的实验基础设施，支持对模型、agent架构及渗透测试痕迹进行可控比较。

链接: https://arxiv.org/abs/2605.11671
作者: Andreas Happe,Jürgen Cito
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner–Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48–64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2605.11671 [cs.CR] (or arXiv:2605.11671v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.11671 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-111] Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

【速读】：该论文旨在解决大型语言模型（LLMs）在后训练阶段（如可验证奖励强化学习 RLVR）中，由于训练数据多样性和复杂性不足导致的推理边界受限问题，尤其针对现有数据合成方法因非结构化变异或探索导致的同质性崩溃（homogeneity collapse），从而无法系统性地扩展推理前沿。解决方案的关键在于提出进化任务发现（Evolutionary Task Discovery, EvoTD）框架，将数据合成视为在算法技能（Algorithmic Skills）和复杂度属性（Complexity Attributes）所构成的双轴流形（dual-axis manifold）上的有向搜索。具体而言，框架引入了两种结构化进化算子：交叉算子（Crossover）通过合成新颖的技能组合来增强数据多样性；参数突变算子（Parametric Mutation）通过缩放结构约束（如输入规模、树深度）驱动鲁棒泛化。此外，一个关键的创新是集成了动态最近发展区过滤器（dynamic Zone of Proximal Development filter），确保生成的任务处于模型可学习区域，从而在模型架构、预训练范式和规模上实现一致的推理增益。

链接: https://arxiv.org/abs/2605.11666
作者: Liqin Ye,Yanbin Yin,Michael Galarnyk,Yuzhao Heng,Sudheer Chava,Chao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on this https URL.

[AI-112] Every Bit Everywhere All at Once: A Binomial Multibit LLM Watermark

【速读】：该论文试图解决大语言模型（LLM, Large Language Model）生成文本中多比特水印（multibit watermark）的实际部署问题，即如何高效、鲁棒地将包含用户ID或时间戳等复杂负载（payload）编码到文本中。解决方案的关键在于提出一种全新的二项式编码（binomial encoding）方法，该方法直接在每一个令牌（token）位置独立编码负载中的每一位比特，并配合一个有状态编码器（stateful encoder），在生成过程中动态地将编码压力（encoding pressure）重新分配到尚未充分编码的比特上，从而保证所有比特都能被可靠嵌入。此外，论文还引入了每比特置信度评分（per-bit confidence scoring）作为更贴近实际应用场景的评估指标，以替代现有指标缺乏实践洞察的问题。

链接: https://arxiv.org/abs/2605.11653
作者: Thibaud Gloaguen,Robin Staab,Mark Vero,Martin Vechev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With LLM watermarking already being deployed commercially, practical applications increasingly require multibit watermarks that encode more complex payloads, such as user IDs or timestamps, into the generated text. In this work, we propose a fundamentally new approach for multibit watermarking: introducing binomial encoding to directly encode every bit of the payload at every token position. We complement our approach with a stateful encoder that during generation dynamically redirects encoding pressure toward underencoded bits. Our evaluation against 8 baselines on up to 64-bit payloads shows that our scheme achieves superior message accuracy and robustness, with the gap to baseline methods widening in more relevant settings (i.e., large payloads and low-distortion regimes). At the same time, we challenge prior works’ evaluation metrics, highlighting their lack of practical insights, and introduce per-bit confidence scoring as a practically relevant metric for evaluating multibit LLM watermarks.

[AI-113] Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

【速读】：该论文试图解决大语言模型（LLM）在推理过程中面对非理想化上下文（contextual interference）时的脆弱性问题，即模型在遇到包含多余信息、无关指令或意外相关性的场景时，容易偏离核心逻辑，表现出表层模式匹配而非稳健推理。解决方案的关键在于 Seirênes 框架：通过参数共享的对抗性自我对弈（adversarial self-play loop），使同一模型同时承担两个竞争性角色——一个是构造合理但具有干扰性的上下文以暴露自身推理盲点，另一个是从这些干扰中辨别出本质任务并恢复核心逻辑。这种持续的内部对抗迫使模型超越表面模式，将上下文干扰转化为内在训练信号，从而在改进过程中形成共同进化（co-evolutionary）的学习曲线，最终锚定在稳健的底层推理能力上。

链接: https://arxiv.org/abs/2605.11636
作者: Chi Zhang,Haibo Qiu,Qiming Zhang,Yufei Xu,Xinbo Gao,Jing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Seirênes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seirênes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seirênes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seirênes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seirênes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4–5 points, revealing Seirênes’ general ability to uncover reasoning models’ blind spots.

[AI-114] Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

【速读】：该论文旨在解决当前灾害应急响应研究中端到端工作流程（end-to-end workflows）未被充分探索的问题——现有工作多聚焦于独立的遥感感知（remote-sensing perception）或通用工具评估，忽略了整合多传感器信号、路网推理、人口与关键设施分析、疏散规划以及生成可操作报告等完整的操作流程。解决方案的关键在于构建了首个面向端到端灾害响应的智能体基准（Disaster Operational Response Agent benchmark, DORA），其核心包括：515个由专家编写的任务，覆盖45个真实灾害事件（10种类型）；一个包含108个工具（MCP library）的工具库，支持光学、SAR、多光谱影像（0.015–10米GSD）、高程与社交矢量层等异构地理空间数据；以及沿五个维度（灾害感知、空间关系分析、救援与疏散规划、时序演化推理、多模态报告合成）组织的任务管道。该基准提供了专家验证的可重放金轨迹（gold trajectories），共计3500个工具调用步骤，从而为评估操作可靠的灾害响应智能体提供了系统化的测试平台。

链接: https://arxiv.org/abs/2605.11633
作者: Junjue Wang,Weihao Xuan,Heli Qi,Pengyu Dai,Kunyi Liu,Hongruixuan Chen,Zhuo Zheng,Junshi Xia,Stefano Ermon,Naoto Yokoya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data

点击查看摘要

Abstract:Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

[AI-115] Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在推理过程中计算资源分配不当的问题：现有方法仅根据感知难度（perceived difficulty）压缩推理或调整预算，却忽略了问题的可解性（solvability），导致在超出模型能力的问题上浪费大量计算，而在困难但可解的问题上压缩了必要推理。解决方案的关键在于将自适应推理形式化为不确定性下的计算投资，提出预算高效思维（Budget-Efficient Thinking, BET）——一种两阶段框架，结合行为冷启动（behavioral cold-start）和基于投资成本感知奖励的GRPO（Group Relative Policy Optimization），通过将解决或放弃（solve-or-fold）决策与基于rollout的可解性对齐，使模型学习三种行为：简洁解决（short solve）处理简单问题、合理放弃（nice fold）在推理预期收益近零时提前终止、以及英雄调用（hero call）为困难可解问题保留充分计算量。

链接: https://arxiv.org/abs/2605.11625
作者: Zhaomeng Zhou,Lan Zhang,Junyang Wang,Mu Yuan,Junda Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model’s capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.

[AI-116] From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

【速读】：该论文旨在解决策略自蒸馏（on-policy self-distillation）中token级奖励的含义不明、信用分配不清的问题。其核心贡献是在隐式奖励文献的后验兼容解释框架下，揭示了自蒸馏token奖励本质上是贝叶斯滤波增量，其轨迹累加等于输入条件下响应与反馈之间的逐点互信息（pointwise mutual information，pMI），但该pMI可能源于输入特定的推理或输入通用的捷径。解决方案的关键是提出CREDIT（Contrastive REward from DIsTillation），通过沿输入轴分解教师对数概率，并引入批次对比基线（batch-contrastive baseline）来隔离输入特定成分，从而在序列层面构造出惩罚在无关输入下仍保持高可能性的响应的对比pMI目标函数，最终在不显著增加计算成本的前提下提升模型在编码、科学推理和工具使用等基准任务上的聚合性能。

链接: https://arxiv.org/abs/2605.11613
作者: Guobin Shen,Lei Huang,Xiang Cheng,Chenxiao Zhao,Jindong Li,Dongcheng Zhao,Xing Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.

[AI-117] CuSearch: Curriculum Rollout Sampling via Search Depth for Agent ic RAG

【速读】：该论文旨在解决基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）训练代理式检索增强生成（agentic Retrieval-Augmented Generation, RAG）系统时，现有方法对所有轨迹进行均匀采样而忽略其信息量异质性的问题。具体而言，不同轨迹的搜索深度差异导致深度较深的轨迹包含更多的检索决策点，能为检索子策略提供更密集的直接监督，且这种异质性会在训练过程中随批次内深度分布的上移而加剧，但均匀采样策略对此变化不敏感。解决方案的关键在于提出一种名为CuSearch的课程式轨迹采样框架，其核心是搜索深度贪婪分配（Search-Depth Greedy Allocation, SDGA）这一批次级算子，它将固定的更新预算重新分配给深度较大的轨迹。SDGA有两种实现变体：SDGA-Auto始终在当前批次中选取最深可用轨迹，随着深度分布上移自动形成隐式的训练对齐课程；SDGA-Phase则在更深的轨迹足够丰富时显式推进课程阈值。这种基于逐轨迹搜索深度作为训练监督密度可靠代理的方法，能够在无需额外标注的情况下提升RLVR训练效率，实验表明其相比标准GRPO在ZeroSearch上最高提升11.8个精确匹配点。

链接: https://arxiv.org/abs/2605.11611
作者: Jianghan Shen,Siqi Luo,Xinyu Cheng,Jing Xiong,Yue Li,Jiyao Liu,Jiashi Lin,Yirong Chen,Junjun He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at this https URL.

[AI-118] GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

【速读】：该论文旨在解决在异构大型语言模型（LLMs）池（如7B-70B参数规模）中，如何通过按请求路由（per-request routing）在保证响应质量（准确性与p95延迟服务级别目标）的同时，最小化每次请求的二氧化碳（CO₂）排放——当前路由方法普遍忽略电网碳强度（grid carbon intensity）的时变性和模型能耗差异。解决方案的关键是提出绿色感知路由（Green-Aware Routing, GAR），一个约束多目标优化框架，它通过自适应约束优化（per-dataset floor tuning）设置准确性底线，并结合轻量级估计器（lightweight estimators）对正确性、尾延迟和碳排放进行实时预测，无需额外推理开销；在此基础上，GAR-PD作为一种在线原始-对偶路由算法（online primal-dual routing algorithm），在滚动碳预算（rolling carbon budgets）下动态调整路由决策，同时提供启发式变体以保证可行覆盖和精度损失限制。

链接: https://arxiv.org/abs/2605.11603
作者: Disha Sheshanarayana,Rajat Subhra Pal,Manjira Sinha,Tirthankar Dasgupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.

[AI-119] EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

【速读】：该论文试图解决当前流行病预测研究因缺乏高质量、跨疾病和跨地区的多元时间序列基准数据集，而制约多元预测方法开发与公平比较的问题。解决方案的关键在于构建一个名为 EpiCastBench 的大规模基准框架，该框架包含40个经过精心整理的多元疫情数据集，涵盖多种传染病、不同的时间粒度和稀疏性特征，并统一设定了标准化评估设置（包括预报范围、预处理流程、性能指标及统计显著性检验），从而为从统计基线到前沿深度学习及基础模型在内的15种多元预测模型提供了可复现、公平的评估平台。

链接: https://arxiv.org/abs/2605.11598
作者: Madhurima Panja,Danny D’Agostino,Huitao Li,Tanujit Chakraborty,Nan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (this https URL) and GitHub (this https URL).

[AI-120] Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

【速读】：该论文旨在解决一个核心问题：基于贝叶斯置信传播神经网络（BCPNN）的类脑架构虽然具备无监督表示学习、神经形态友好稀疏性及现场可编程门阵列（FPGA）边缘部署潜力，但缺乏系统性的决策解释框架，无法满足欧盟人工智能法案（Regulation 2024/1689）对高风险系统提出的可信任、透明性要求。解决方案的关键在于首次提出BCPNN的可解释人工智能（XAI）分类法，将其权重、偏置、超列后验概率、结构塑性使用分数、吸引子动力学和输入重建种群等架构原语直接映射到归因、原型、概念、反事实和机制解释等既定XAI家族；同时引入16个架构级解释原语（P1-P16），其中多个在标准人工神经网络中无对应物，并给出基于模型已有量的闭式算法；此外还提出5个设计时配置即解释原语（Config-P1至Config-P5），将超参数选择作为可审计的部署前解释制品，最终通过路线图整合到工业物联网部署中，实现与欧盟AI法案对齐及边缘可行性。

链接: https://arxiv.org/abs/2605.11595
作者: Georgios Makridis,Georgios Fatouros,John Soldatos,George Katsis,Dimosthenis Kyriazis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions – a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin’s interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1–P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.

[AI-121] SoK: Unlearnability and Unlearning for Model Dememorization

【速读】：该论文旨在解决高级模型去记忆方法（包括可用性中毒和机器遗忘）中存在的浅层去记忆问题、权重扰动导致的虚假遗忘声明、输入扰动对下游遗忘效果的干扰、遗忘可能恢复被隐藏的领域知识，以及缺乏形式化理论保证等通用漏洞。解决方案之关键包括三个方面：首先，提出一个统一的分类体系来系统化不可学习性与可扩展遗忘方法；其次，通过实证评估揭示主流方法在鲁棒性、相互影响及浅层去记忆方面的局限性；最后，首次为经过认证遗忘处理的模型提供去记忆深度的理论保证，从而奠定跨机器学习生命周期统一去记忆机制的基础，以实现对敏感知识的更深层次免疫状态。

链接: https://arxiv.org/abs/2605.11592
作者: Mengying Zhang,Derui Wang,Ruoxi Sun,Xiaoyu Xia,Shuang Hao,Minhui Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Advanced model dememorization methods, including availability poisoning (unlearnability) and machine unlearning, are emerging as key safeguards against data misuse in machine learning (ML). At the training stage, unlearnability embeds imperceptible perturbations into data before release to reduce learnability. At the post-training stage, unlearning removes previously acquired information from models to prevent unauthorized disclosure or use. While both defenses aim to preserve the right to withhold knowledge, their vulnerabilities and shared foundations remain unclear. Specifically, both unlearnability and unlearning suffer from issues such as shallow dememorization, leading to falsely claimed data learnability reduction or forgetting in the presence of weight perturbations. Moreover, input perturbations may affect the effectiveness of downstream unlearning, while unlearning may inadvertently recover domain knowledge hidden by unlearnability. This interplay calls for deeper investigation. Finally, there is a lack of formal guarantees to provide theoretical insights into current defenses against shallow dememorization. In this Systematization of Knowledge, we present the first integrated analysis of model dememorization approaches leveraging unlearnability and unlearning. Our contributions are threefold: (i) a unified taxonomy of unlearnability and scalable unlearning methods; (ii) an empirical evaluation revealing the robustness, interplay, and shallow dememorization of leading methods; and (iii) the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These results lay the foundation for unifying dememorization mechanisms across the ML lifecycle to achieve a deeper immemor state for sensitive knowledge.

[AI-122] Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

【速读】：该论文试图解决当前航空公司短期需求预测中因仅使用单维时间序列数据（仅考虑航班内订票累积或仅考虑航班间历史订票模式）而导致的信息丢失问题，以及预测绝对乘客数量时因计划机型变更导致座位容量变化所带来的操作脆弱性。解决方案之关键是提出了一个集成注意力机制的双流长短期记忆网络 (Dual-stream Long Short-Term Memory, LSTM) 框架，该框架同时处理水平序列（记录单次航班起飞前数日内的订票累积过程）与垂直序列（记录不同航班在固定起飞前天数上跨航班的订票模式），并设计了多种架构变体（如结合自注意力、交叉注意力与混合注意力，以及拼接、残差和门控融合策略），从而有效利用两个时间维度的互补信息，实现对需求模式的精准捕捉，最终在孟加拉国Biman Bangladesh Airlines的真实数据上取得了显著优于单流模型和传统树模型的预测性能，并已投入实际运营。

链接: https://arxiv.org/abs/2605.11569
作者: ASM Nazrul Islam,Md. Hasanul Kabir,Md. Liakot Ali,Joydeb Kumar Sana
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate short-term demand forecasting is crucial to airline revenue management, yet most existing systems fail to meet this need because current models treat booking data as a single temporal dimension, either the accumulation of bookings for a specific flight or the historical booking profile of the same route. This unidimensional view discards information carried by the other temporal stream and forecasting absolute passenger counts introduces a further operational fragility when change in planned aircraft type alters total seat capacity. This study addresses both limitations. A dual-stream Long Short-Term Memory (LSTM) integrated with attention framework is proposed that simultaneously processes two complementary input sequences: a horizontal sequence capturing intra-flight booking accumulation over the days preceding departure, and a vertical sequence capturing inter-flight booking patterns at fixed days-before-departure offsets across historical flights. Multiple dual-stream architectural variants, combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies, are developed and evaluated. Experiments on real-world reservation data from the national airline of Bangladesh, Biman Bangladesh Airlines (BBA), demonstrate that the proposed hybrid model achieves a Mean Absolute Error of 2.8167 and a coefficient of determination ( R^2 ) of 0.9495, outperforming single-stream baselines, tree-based models, and three prior dual-LSTM architectures applied to the same data. Validation across four flight category pairs; domestic versus international, direct versus transit, high versus low frequency, and short versus mid versus long haul confirms that the model generalizes across operationally diverse route types. Biman Bangladesh Airlines (BBA) has officially integrated this methodology into its operations.

[AI-123] Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers

【速读】：该论文旨在解决在缺乏显式思维链（Chain-of-Thought, CoT）标注数据的情况下，如何提升模型在复杂长期任务（complex long-horizon tasks）中的规划与推理能力。解决方案的关键是提出后见之明提示蒸馏（Hindsight Hint Distillation, HHD）方法，其核心思想是：仅利用易于获取的问题-答案对，通过让模型从自身失败的自我展开（self-rollouts）中合成后见之明提示（hindsight hints），并以这些提示为脚手架（scaffold）引导模型生成成功完成任务的在线展开（on-policy rollouts），最后通过自我蒸馏（self-distillation）使模型脱离提示泛化到新问题。该方法模仿人类教师利用学生错误提供针对性指导的过程，实现了从无CoT数据中合成专家级推理策略。

链接: https://arxiv.org/abs/2605.11556
作者: Shengjie Wang,Guanghe Li,Zonghan Yang,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 7 figures

点击查看摘要

Abstract:Solving complex long-horizon tasks requires strong planning and reasoning capabilities. Although datasets with explicit chain-of-thought (CoT) rationales can substantially benefit learning, they are costly to obtain. To address this challenge, we propose Hindsight Hint Distillation (HHD), which only requires easy-to-obtain question-answer pairs without CoT annotations. Inspired by how human teachers use student mistakes to provide targeted guidance, HHD synthesizes hindsight hints from the model’s own failed self-rollouts and uses them to scaffold on-policy rollouts that successfully complete the tasks. The model then self-distills these scaffolded trajectories and generalizes to new problems without hint guidance. Experiments show that HHD significantly outperforms iterative RFT and trajectory-synthesis baselines, achieving an absolute improvement of 8% on SWE-bench Verified, while all baselines improve by only around 2%. Notably, the reasoning strategies induced by HHD generalize effectively to out-of-distribution tasks, yielding the largest gains on SWE-bench Multilingual despite no training on multilingual data. These results demonstrate that HHD can effectively synthesize expert-like reasoning from CoT-free data and substantially improve long-horizon performance.

[AI-124] Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching

【速读】：该论文试图解决流匹配模型（flow matching models）在固定评估预算下快速生成样本时，如何有效分配有限数值积分步骤的问题——核心不仅在于积分方法的选择，更在于采样器应将其模型评估步骤集中在何处以最小化离散误差。解决方案之关键是SharpEuler，一种无训练采样器，它通过离线计算速度场沿校准轨迹的有限差分估计来生成一个求解器感知的锐度剖面（solver-aware sharpness profile），该剖面经平滑和分位数变换后转换为任意推理预算下的时间步网格。其核心原理包括：基于轨迹加速度作为欧拉离散误差主要来源的数值原理、推导出基于锐度的幂律时间步密度的变分原理，以及确保有限样本校准采样器在终端分布层面稳定性的统计保证。SharpEuler通过将步骤集中在速度场变化最剧烈的区域，在不增加模型评估次数的情况下显著提升样本质量，减少模态间泄漏并增加模态覆盖。

链接: https://arxiv.org/abs/2605.11547
作者: Aditi Gupta,Soon Hoe Lim,Annan Yu,N. Benjamin Erichson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow matching models generate samples by numerically integrating a learned velocity field, with each integration step requiring a neural network evaluation. Fast generation therefore requires using a small fixed evaluation budget effectively: the key question is not only how to integrate the flow, but where the sampler should spend its steps. We propose SharpEuler, a training-free sampler that profiles a pretrained model offline by estimating where the learned velocity field changes most rapidly along calibration trajectories. This finite-difference estimate defines a solver-aware sharpness profile, which is smoothed and converted by a quantile transform into a timestep grid for any desired inference budget. At test time, sampling remains ordinary Euler integration with the same number of model evaluations as a uniform schedule. We justify SharpEuler using three principles: a numerical principle identifying trajectory acceleration as the leading source of Euler discretization error, a variational principle deriving sharpness-based power-law timestep densities, and a statistical guarantee showing that the finite-sample calibrated sampler is stable at the terminal distribution level. Our experiments show that SharpEuler improves sample quality at fixed budgets, reducing inter-mode leakage and increasing mode coverage.

[AI-125] Optimal LTLf Synthesis

【速读】：该论文试图解决传统策略合成（strategy synthesis）中全有或全无（all-or-nothing paradigm）的缺陷，即当一个规范在不确定环境中无法被完全保证时，系统直接返回“不可实现”（unrealisable），而无法提供部分实现。论文引入最优LTLf合成（optimal LTLf synthesis），其目标是在给定包含多个目标（objectives）的规范中，当这些目标无法全部联合实现时，尽可能多地实现这些目标。解决方案的关键在于提出了三种变体：首先是最大保证合成（max-guarantee synthesis），它承诺一个可以事先保证实现的最大目标集；其次是最大观测合成（max-observation synthesis），它最大化后验（a posteriori）实现的目标，这些目标在不同的执行路径上可能不可比较；最后是增量最大观测合成（incremental max-observation synthesis），它利用执行过程中出现的更强保证机会进一步改进策略。实验结果表明，这些最优合成变体在扩展性上基本相当，能在给定超时内解决大部分基准实例，从而验证了该方法的实际可行性。

链接: https://arxiv.org/abs/2605.11544
作者: Yujian Cao,Sven Schewe,Qiyi Tang,Shufang Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an uncertain environment. In this paper, we introduce optimal LTLf synthesis, where the goal is to realise as many objectives as possible from a given specification consisting of multiple objectives, especially for the case that they are not all jointly realisable. We first consider max-guarantee synthesis, which commits to a maximal set of objectives that we can a priori guarantee to realise. We then introduce max-observation synthesis, which maximises a posteriori realised objectives that may be incomparable on different executions. Finally, we present incremental max-observation synthesis, which further improves strategies by exploiting opportunities for stronger guarantees when they arise during an execution. Experimental results show that different variations of optimal synthesis scale broadly equally well, solving a large fraction of the benchmark instances within the given timeout, demonstrating the practical feasibility of the approach.

[AI-126] Read Grep and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

【速读】：该论文试图解决的问题是：当前基于大型语言模型（LLM）的创意生成系统（ideation systems）是否真正受益于针对性的跨领域知识检索（cross-domain retrieval），还是仅仅因为接触到多样化的机制（diverse mechanisms）而获得提升？解决方案的关键在于构建了一个名为PaperGym的三阶段流水线：（1）通过工具增强的种子提取（tool-augmented seed extraction），利用读取、grep和bash命令在隔离的论文环境中提取高特异性种子；（2）跨领域种子检索（cross-domain seed retrieval），通过释义（paraphrasing）跨越七个机器学习领域来拓宽领域覆盖；（3）从检索到的种子中进行方法合成（method synthesis），并使用基于评分标准（rubric）的评判者打分。实验对比了跨领域检索、无检索和同领域基线，发现跨领域检索在成对新奇性（pairwise novelty）上胜出，但与随机多样化种子控制组（random diverse-seed control）无显著差异，从而揭示出LLM创意系统主要受益于种子多样性（seed diversity），而非可靠地利用种子被检索到的语义原因（semantic reason），这一结论是通过精心设计的对比实验和评估框架得出的。

链接: https://arxiv.org/abs/2605.11532
作者: Yunju Choi,Min Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 7 tables

点击查看摘要

Abstract:The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross-domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric-based judges. Tool-augmented extraction improves specificity, and paraphrase-based retrieval broadens domain coverage. In synthesis, cross-domain retrieval receives more pairwise novelty wins than no-retrieval and same-domain baselines, but shows no significant difference from a random diverse-seed control. These findings suggest LLM ideation systems benefit from diverse seed exposure, but do not yet reliably exploit the semantic reason particular seeds were retrieved. We release the seed library, rubric prompts, and run scripts at this https URL

[AI-127] Selective Off-Policy Reference Tuning with Plan Guidance

【速读】：该论文旨在解决基于可验证奖励的强化学习方法（如GRPO）在困难提示（hard prompts）上因所有采样的rollout均失败而导致训练停滞的问题。解决方案的关键在于引入了SORT（Selective Repair Update）机制：它从参考解（reference solution）中推导出一个计划（plan），通过对比有无该计划条件下token概率的差异，对那些在计划条件下变得更可预测的token赋予更高权重，从而将全失败提示转化为选择性、结构感知的学习信号，而非统一的模仿信号。该方法在不改变rollout生成过程的前提下，仅通过修复更新来利用失败样本，显著提升了模型在多个推理基准上的性能，尤其在弱模型上效果最为明显。

链接: https://arxiv.org/abs/2605.11505
作者: Duc Anh Le,Tien-Phat Nguyen,Thien Huu Nguyen,Linh Ngo Van,Trung Le
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

[AI-128] Decaf: Improving Neural Decompilation with Automatic Feedback and Search

【速读】：该论文试图解决从编译后二进制文件中重构源代码时，神经反编译器因幻觉（hallucination）而产生不正确编程结构和语义的问题，具体表现为难以在保持与原始源码相似性的同时，有效推断惯用语法和标识符名称。解决方案的关键在于：摒弃通过更多数据和训练来改进神经反编译器的传统思路，转而利用编译器反馈（compiler feedback）通过搜索（search）方式，显著提升神经反编译器输出的语义正确性。具体系统Decaf（DECompilation with Automated Feedback）正是基于这一自动化反馈方法，将神经反编译的成功率从ExeBench上的26.0%提升至Real-O2分割上的83.9%，且不牺牲与原始源码的相似度，同时对较弱模型同样有效。

链接: https://arxiv.org/abs/2605.11501
作者: Alexander Shypula,Osbert Bastani,Edward Schwartz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages, 6 figures. Preprint; under review. Code and models available at this https URL

点击查看摘要

Abstract:Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high-level syntax, identifiers, and custom data types are generally lost as the compiler translates human-readable code to low-level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high-level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real -O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.

[AI-129] Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

【速读】：该论文试图解决基于可验证奖励的强化学习（RLVR）在大语言模型推理能力训练中出现的熵崩溃（entropy collapse）问题，该问题导致模型过早确定性、优化不稳定，而现有方法如熵正则化和基于比率的裁剪启发式技术要么控制熵的方式粗粒度，要么依赖近似的在策略（on-policy）训练。解决方案的关键在于提出了一种名为“在策略熵流优化”（On-Policy Entropy Flow Optimization, OPEFO）的自适应熵流平衡机制，该机制从token级熵流视角出发，通过根据每个更新对熵变化的实际贡献重新缩放熵增加和熵减少的更新，同时严格保持真正的在策略训练，从而纠正熵减少token持续多于熵增加token所导致的严重不平衡熵流，实现熵动态的均衡，提升训练稳定性和最终性能。

链接: https://arxiv.org/abs/2605.11491
作者: Huimin Xu,Shuai Zhao,Xiaobao Wu,Anh Tuan Luu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.

[AI-130] Engagement Process: Rethinking the Temporal Interface of Action and Observation

【速读】：该论文旨在解决传统基于固定步骤的交互界面（step-based interface）在处理数字和物理环境中复杂时间交互时的局限性，具体问题包括：动作与观察被配对到固定决策步骤，无法自然表达不同时间尺度上的延迟反馈、持续动作、 deliberation latency（思考延迟）以及多速率协调等时间行为。解决方案的关键在于提出了“Engagement Process”（EP）这一交互形式化模型，它继承了POMDP的决策理论结构，但将时间显式化到动作-观察接口中：EP将动作和观察表示为沿时间线的解耦事件流（decoupled event streams），而非固定步骤内的配对更新。这一设计使得时间信息成为决策过程的有机组成部分，能够显式建模单智能体中的时间问题，并支持子系统间的组合交互与多速率协调。实验证明，EP能揭示被传统步骤式接口隐藏的时间行为，并允许策略在明确的时间成本下进行适应。

链接: https://arxiv.org/abs/2605.11484
作者: Jialian Li,Yuchen Cao,Junhong Liu,Weiran Guo,Xutao Wang,Jiaming Song,Jiahao Zhang,Jie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation–action steps. To model such interactions, we propose \emphEngagement Process (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action–observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

[AI-131] Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

【速读】：该论文试图解决机器人操纵策略评估中因稀疏奖励、非单调任务进展以及有限长度 rollout 所引入的截断偏差（truncation bias）问题。传统基于贝尔曼方程或最优性原则的无限视界假设在此场景下不再成立。解决方案的关键在于提出一种基于活性（liveness）的贝尔曼算子（liveness-based Bellman operator），将策略评估重新定义为任务完成问题，通过学习一个对有限视界截断具有鲁棒性的保守固定点价值函数，该算子既能编码任务进展，又能有效缓解截断偏差，理论分析证明了其收缩保证。

链接: https://arxiv.org/abs/2605.11479
作者: Hao Wang,Joshua Bowden,Colton Crosby,Somil Bansal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published at RSS 2026

点击查看摘要

Abstract:Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.

[AI-132] FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

【速读】：该论文试图解决长上下文推理（long-context inference）中键值缓存（KV cache）因内存带宽瓶颈导致的效率问题。关键解决方案是提出一种名为 FibQuant 的通用固定速率向量量化器（fixed-rate vector quantizer），它在保持原有的“归一化-旋转-存储”（normalize–rotate–store）接口下，通过一个匹配球面Beta分布（spherical-Beta source）的共享径向-角度码本（radial–angular codebook）替代标量码表。该码本的核心技术包括：使用Beta分位数半径（Beta-quantile radii）捕获径向分布、采用Fibonacci/Roberts–Kronecker准均匀方向（quasi-uniform directions）实现角度覆盖，以及结合多重启Lloyd–Max细化（multi-restart Lloyd–Max refinement）优化码本。理论证明该向量编码在高速率下通过细胞形状因子（cell-shaping factor）和密度匹配因子（density-matching factor）优于标量乘积特例，同时无需校准即可支持分数比特（fractional-bit）和低于1比特的操作点。实验表明，在GPT-2小规模KV缓存上，FibQuant在0.99注意力余弦相似度下实现5倍压缩，0.95时达34倍；在TinyLlama-1.1B上，4倍压缩时困惑度仅比fp16低0.10，且在标量随机访问量化失效的8倍压缩率下，其困惑度较标量TurboQuant降低3.6倍。

链接: https://arxiv.org/abs/2605.11478
作者: Namyoon Lee,Yongjune Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: 15 pages

点击查看摘要

Abstract:Long-context inference is increasingly a memory-traffic problem. The culprit is the key–value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of k consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textscFibQuant, a universal fixed-rate vector quantizer that keeps the same normalize–rotate–store interface while replacing scalar tables by a shared radial–angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci,/,Roberts–Kronecker quasi-uniform directions, and multi-restart Lloyd–Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textscFibQuant traces a memory–fidelity frontier from 5\times compression at 0.99 attention cosine similarity to 34\times at 0.95 . End-to-end on TinyLlama-1.1B, it is within 0.10 perplexity of fp16 at 4\times compression and has 3.6\times lower perplexity than scalar \textscTurboQuant at b = 2 ( 8\times compression), where scalar random-access quantization begins to fail.

[AI-133] OPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

【速读】：该论文试图解决的是多任务强化学习（MTRL）中在线策略方法（如PPO）性能落后于离策略方法（如SAC）的问题，具体诊断出PPO在MTRL中因评论家端梯度不良条件（critic-side gradient ill-conditioning）导致简单任务主导价值函数更新而尾部任务（tail tasks）学习停滞。其解决方案关键在于提出TOPPO（Tail-Optimized PPO），通过引入一组名为评论家平衡（Critic Balancing）的模块来改善梯度条件并平衡跨任务的学习动态，从而直接针对PPO自身的优化瓶颈进行重构，而非依赖模块化架构或大型模型，最终使在线策略方法在样本效率和性能上能够匹敌甚至超越离策略基线。

链接: https://arxiv.org/abs/2605.11473
作者: Yuanpeng Li,Gefei Lin,Annie Qu,Rui Miao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function’s updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing – a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.

[AI-134] CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation

【速读】：该论文旨在解决多模态图神经网络（Multimodal Graph Neural Networks, MGNNs）中现有方法的两个核心问题：一是紧密耦合架构（tightly coupled architectures）计算开销过大，难以应用于大规模图学习；二是解耦架构（decoupled architectures）虽然高效可扩展，但存在模态冲突（modal conflict）这一关键瓶颈，具体表现为独立多跳扩散导致的跨模态语义分歧（cross-modal semantic divergence）以及朴素融合无法对齐多跳特征轨迹。解决方案的关键在于提出了CAMPA（Cross-modal Aligned Multimodal Propagation & Aggregation）框架，其核心创新是两阶段对齐机制：第一阶段为跨模态对齐传播，通过向消息传递中注入跨模态相似性先验来保持语义一致性，且不引入额外参数开销；第二阶段为轨迹对齐聚合，利用轨迹级自注意力（trajectory-level self-attention）和交叉注意力（cross-attention）捕获并对齐跨模态和跨跳的长程依赖特征。该方案在保持解耦范式效率优势的同时，显著提升了多模态图表示学习的性能。

链接: https://arxiv.org/abs/2605.11468
作者: Daohan Su,Hao Liu,Xunkai Li,Yinlin Zhu,Xiong Yongfu,Yi Liu,Hongchao Qin,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Graph Neural Networks (MGNNs) have shown strong potential for learning from multimodal attributed graphs, yet most existing approaches rely on tightly coupled architectures that suffer from prohibitive computational overhead. In this paper, we present a systematic empirical analysis showing that decoupled MGNNs are substantially more efficient and scalable for large-scale graph learning. However, we identify a critical bottleneck in existing decoupled pipelines, namely modal conflict, which arises in both the propagation and aggregation stages. Specifically, independent multi-hop diffusion causes cross-modal semantic divergence during propagation, while naive fusion fails to align multi-hop feature trajectories during aggregation, jointly limiting effective representation learning. To address this challenge, we propose CAMPA, a Cross-modal Aligned Multimodal Propagation Aggregation framework for decoupled multimodal graph learning. Concretely, CAMPA introduces a two-stage alignment mechanism: (1) cross-modal aligned propagation, which injects cross-modal similarity priors into message passing to preserve semantic consistency without additional parameter overhead; (2) trajectory aligned aggregation, which leverages trajectory-level self-attention and cross-attention to capture and align long-range dependencies across modalities and hops. Extensive experiments on diverse benchmark datasets and tasks demonstrate that CAMPA consistently outperforms strong coupled and decoupled baselines while preserving the efficiency advantages of the decoupled paradigm.

[AI-135] Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

【速读】：该论文试图解决推理模型中的推理剧场（reasoning theater）问题——模型在内部已经确定最终答案后，会生成大量看似深思熟虑但实际上对正确性毫无贡献的推理步骤，既浪费推理令牌、污染可解释性，又掩盖了实际计算过程。解决方案的关键是ProFIL（Probe-Filtered Reinforcement Learning），即在组相对策略优化（GRPO, Group Relative Policy Optimization）中引入一个多头注意力探针（multi-head attention probe）。该探针仅在冻结的基础模型上训练一次，利用验证器派生标签（verifier-derived labels）检测内部激活中的“提交后步骤”（post-commitment steps），无需人工标注；在GRPO训练期间，对于探针得分超过阈值的推理轨迹（rollouts），其优势值（advantage）被归零，从而直接压制剧场行为。其核心发现是：基于冻结基础模型训练的探针能够提供稳定的抑制信号，并且能够抵抗先前文献预测的强化学习混淆失败模式（RL-obfuscation failure mode），从而在保持或提升任务准确率的前提下，减少11–100%的剧场步骤、提升忠实推理比例（如LiveCodeBench上提升24pp），并将推理链长度缩短4–19%。

链接: https://arxiv.org/abs/2605.11467
作者: Swapnil Parekh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of reasoning theater: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce ProFIL (Probe-Filtered Reinforcement Learning) to reduce theater, increase chain-of-thought faithfulness, and shrink chain length in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained once on the frozen base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable signal that suppresses theater while resisting the RL-obfuscation failure mode predicted by prior work. Across four reasoning domains (GSM8K, LiveCodeBench, ToolUse, MMLU-Redux) and two model architectures (Llama-8B, Qwen-7B), ProFIL reduces post-commitment theater by 11–100%, raises faithful-fraction (e.g., +24pp on LiveCodeBench under an independent Claude 3.7 Sonnet judge), and shortens chains by 4–19%, all while preserving or improving task accuracy. ProFIL also beats a matched length-penalty GRPO baseline, isolating the gain as semantic commitment-detection rather than chain compression. Probe weights, training configurations, and rollouts are released across all four domains.

[AI-136] Breaking textitWinner-Takes-All: Cooperative Policy Optimization Improves Diverse LLM Reasoning

【速读】：该论文试图解决强化学习与验证器结合（RLVR）范式下群体优化算法（如GRPO）的探索崩溃（exploration collapse）问题，即模型过早收敛于少数高分推理模式，丧失探索新解的能力，而现有熵正则化或多样性奖励方法未能改变赢家通吃（winner-takes-all）的本质。解决方案的关键在于提出组协同策略优化（Group Cooperative Policy Optimization, GCPO），将训练范式从个体rollout竞争转变为团队合作：通过团队级信用分配（team-level credit assignment），每个rollout的奖励取决于其对团队有效解覆盖（valid solution coverage）的边际贡献，而非个体准确率；该覆盖度被定义为奖励加权语义嵌入（reward-weighted semantic embeddings）的行列式体积（determinant volume），仅正确且非冗余的rollout贡献于此体积。在优势估计中，GCPO根据每个rollout对团队的平均边际贡献（average marginal contribution）重新分配集体团队奖励，从而引导优化朝向非冗余的正确推理路径，同时提升推理准确性和解多样性。

链接: https://arxiv.org/abs/2605.11461
作者: Haoxuan Chen,Tianming Liang,Wei-Shi Zheng,Jian-Fang Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textitwinner-takes-all nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team’s valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at \hrefthis https URLthis .

[AI-137] Deep Minds and Shallow Probes

【速读】：该论文试图解决神经网络表示因重新参数化（reparameterization）导致的坐标非唯一性问题：即不同模型即使实现相同下游计算，其隐藏表示可能通过仿射变换相关，导致传统探针（probing）在揭示稳定结构时受限于特定基，且跨模型探针转移（cross-model probe transfer）缺乏统一理论框架。解决方案的关键在于：在最终读出层（readout layer）的精确设定下，利用表示对称性（symmetry）的群作用（group action），推导出对称性原理（symmetry principle），该原理唯一地确定了浅层坐标稳定探针（coordinate-stable probes）的层次结构，其中线性探针（linear probes）是度-1成员，而度-2探针可在高阶交互中提供额外能力；同时，引入“探针可见商”（probe-visible quotient）概念——即表示模去探针族不可见方向的商空间，作为跨模型转移的自然对象，而非完整隐藏状态。这一框架同时实现了探针的基不变性与跨模型可移植性，并通过实验验证了其在覆盖感知监控器（coverage-aware monitor）转移中的应用。

链接: https://arxiv.org/abs/2605.11448
作者: Su Hyeong Lee,Risi Kondor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient–the representation modulo directions invisible to the probe family–rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

[AI-138] A Mechanistic Investigation of Supervised Fine Tuning

【速读】：该论文旨在解决一个看似矛盾的现象：尽管监督微调（SFT）前后大型语言模型的隐藏激活之间余弦相似度很高，表明其激活几何结构几乎没有变化，但通过稀疏自编码器（SAE）分析发现底层稀疏隐变量（sparse latents）实际上发生了显著发散。论文试图从机制上理解这种表征发散（representational divergence）的具体原因及其分布特征。解决此问题的关键方案在于利用预训练于基础模型上的稀疏自编码器作为高分辨率诊断工具，构建一个新颖的研究流程（investigative pipeline），通过该流程能够系统性地识别出在SFT中发生系统改变的具体语义特征，并揭示这些改变呈现任务特定（task-specific）和层特定（layer-specific）的分布模式，同时还能专门检测出与安全对齐（safety alignment）相关的层级更新轮廓（layer-wise update profile）。

链接: https://arxiv.org/abs/2605.11426
作者: Ruhaan Chopra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The cosine similarity between a large language model’s hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model’s activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: this https URL.

[AI-139] Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

【速读】：该论文试图解决自主AI代理在通过模块化文件系统包（Agent Skills）扩展能力时面临的语义供应链风险，即自然语言元数据和指令（如技能描述、触发词等）可能被操纵，从而影响技能在发现（Discovery）、选择（Selection）和治理（Governance）三个阶段的安全性。论文的关键在于通过实验量化这些对抗性攻击的有效性：在发现阶段，短文本触发器可操纵基于嵌入的检索，使对抗性技能达到86%的成对胜率和80%的前十名展示率；在选择阶段，仅通过描述性框架偏见（description-only framing）就能使代理在77.6%的配对试验中倾向选择功能等价但对抗性的技能变体；在治理阶段，语义规避策略可导致恶意技能在36.5%至100%的情况下逃避封禁判决。这些结果说明，技能描述并非被动文档，而是操作文本（operational text），直接决定了代理如何发现、信任和使用第三方能力。因此，论文的核心贡献是揭示并量化了这种新型攻击面，但并未提出完整的防御方案，其关键意义在于警示社区需要重新审视注册机制中自然语言文本的安全影响。

链接: https://arxiv.org/abs/2605.11418
作者: Shoumik Saha,Kazem Faghih,Soheil Feizi
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 31 pages, 21 figures, 10 tables

点击查看摘要

Abstract:Autonomous AI agents increasingly extend their capabilities through Agent Skills: modular filesystem packages whose this http URL files describe when and how agents should use them. While this design enables scalable, on-demand capability expansion, it also introduces a semantic supply-chain risk in which natural-language metadata and instructions can affect which skills are admitted, surfaced, selected, and loaded. We study this http URL - only attacks across three registry-facing stages of the Agent Skill lifecycle, using real ClawHub skills and realistic registry mechanisms. In Discovery, short textual triggers can manipulate embedding-based retrieval and improve adversarial skill visibility, achieving up to 86% pairwise win rate and 80% Top-10 placement. In Selection, description-only framing biases agents toward functionally equivalent adversarial variants, which are selected in 77.6% of paired trials on average. In Governance, semantic evasion strategies cause malicious skills to avoid a blocking verdict in 36.5%-100% of cases. Overall, our results show that this http URL is not passive documentation but operational text that shapes which third-party capabilities agents find, trust, and use.

[AI-140] Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer ICLR2026

【速读】：该论文试图解决在时间序列分类任务中，由于实际应用中的延迟和成本限制，推理时只能使用部分前缀序列（而非完整序列），导致分类器因缺乏类别判别性模式而泛化能力下降的问题。解决方案的关键是提出生成式扩散先验蒸馏（Generative Diffusion Prior Distillation, GDPD），一种新颖的知识蒸馏框架，它将学生模型的部分序列特征视为教师模型完整序列特征的退化观测，并利用扩散模型的迭代恢复能力学习一个教师特征上的生成先验，通过后验采样出最能解释学生特征中缺失长程信息的教师表示，进而优化学生特征使其相对于这些目标特征达到最小退化，从而为学生特征提供任务相关的长上下文知识分布，有效弥合因训练数据差异（完整序列 vs 部分序列）引起的泛化差距。

链接: https://arxiv.org/abs/2605.11414
作者: Nilushika Udayangani,Kishor Nandakishor,Marimuthu Palaniswami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026 (Brazil, Rio de Janeiro)

点击查看摘要

Abstract:While traditional time-series classifiers assume full sequences at inference, practical constraints (latency and cost) often limit inputs to partial prefixes. The absence of class-discriminative patterns in partial data can significantly hinder a classifier’s ability to generalize. This work uses knowledge distillation (KD) to equip partial time series classifiers with the generalization ability of their full-sequence counterparts. In KD, high-capacity teacher transfers supervision to aid student learning on the target task. Matching with teacher features has shown promise in closing the generalization gap due to limited parameter capacity. However, when the generalization gap arises from training-data differences (full versus partial), the teacher’s full-context features can be an overwhelming target signal for the student’s short-context features. To provide progressive, diverse, and collective teacher supervision, we propose Generative Diffusion Prior Distillation (GDPD), a novel KD framework that treats short-context student features as degraded observations of the target full-context features. Inspired by the iterative restoration capability of diffusion models, we learn a diffusion-based generative prior over teacher features. Leveraging this prior, we posterior-sample target teacher representations that could best explain the missing long-range information in the student features and optimize the student features to be minimally degraded relative to these targets. GDPD provides each student feature with a distribution of task-relevant long-context knowledge, which benefits learning on the partial classification task. Extensive experiments across earliness settings, datasets, and architectures demonstrate GDPD’s effectiveness for full-to-partial distillation.

[AI-141] What Do EEG Foundation Models Capture from Human Brain Signals?

【速读】：该论文旨在解决临床脑电图（EEG）分析中手工特征目录与现代EEG基础模型（foundation models）表示之间的对齐问题，具体分解为模型学习内容、模型使用内容以及可解释程度三个子问题。解决方案的关键在于采用逐层岭回归（layer-wise ridge probing）、LEACE风格交叉协方差子空间擦除（LEACE-style cross-covariance subspace erasure）以及透明分类器与随机特征基线的对比方法，系统审计三个基础模型（CSBrain、CBraMod、LaBraM）在五个临床任务（MDD、Stress、ISRUC-Sleep、TUSL、Siena）上对六个家族63个特征的编码与因果作用。结果表明，68.6%的（模型、任务、特征）单元具有表示因果性（representation-causal），21.1%仅为编码性（encoded-only），频域特征（frequency-domain features）占主导但其他家族同样贡献显著因果质量，且确认的特征平均恢复了基础模型相对于随机基线79.3%的优势，其中任务梯度清晰（MDD约0.99至Stress约0.56），接近天花板的任务几乎被特征词典完全恢复，而较难任务则留下非平凡残差，为未来概念发现指明了具体目标。

链接: https://arxiv.org/abs/2605.11410
作者: Ling Tang,Qian Chen,Jilin Mei,Houshi Xu,Quanshi Zhang,Jing Shao,Na Zou,Xia Hu,Dongrui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emphe.g., band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emphwhat does the model learn, \emphwhat does the model use, and \emphhow much can be explained. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the 945 (model, task, feature) units, 648 ( 68.6% ) are representation-causal and 199 ( 21.1% ) are encoded-only. Across tasks, 50 features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, 79.3% of the foundation model’s advantage over the random baseline, with a clean task gradient (MDD \approx 0.99 down to Stress \approx 0.56 ): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.

[AI-142] Attributing Emergence in Million-Agent Systems

【速读】：该论文试图解决现有公理归因方法无法扩展到大规模（N≥10^6）LLM驱动的多代理系统（MAS）的问题，具体而言，现有方法在N≲10³时因组合爆炸而失效，但所需解释的社会现象（如极化、信息级联）却发生在N≥10⁶规模。解决方案的关键在于将Aumann-Shapley路径积分归因方法适配至百万代理规模，该方法不仅满足所有四个公理条件，且计算速度比采样Shapley快4-5个数量级。通过实证分析全规模与可见性偏差小样本归因的结构性差异，并证明归因缩放偏差定理，该定理指出任何非线性宏观指标下均不存在全局缩放因子能调和规模差异，从而确立了全规模归因作为理论要求的必要性。

链接: https://arxiv.org/abs/2605.11404
作者: Ling Tang,Jilin Mei,Qian Chen,Qihan Ren,Linfeng Zhang,Quanshi Zhang,Jing Shao,Xia Hu,Dongrui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in N and have been confined to N \lesssim 10^3 , while the phenomena they explain occur at N \geq 10^6 . We address this gap by adapting Aumann–Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ( 1,671,587 active users), we compute the attribution at both full scale and the visibility-biased N = 10^2 convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

[AI-143] ransformer Interpretability from Perspective of Attention and Gradient

【速读】：该论文试图解决 Transformer 模型的可解释性问题，即在性能提升之外，如何深入理解其内部机制，特别是现有基于梯度 (gradient) 的解释方法在特征区域覆盖和细节呈现上的不足。解决方案之关键在于通过引导梯度方向（更准确地说是注意力方向）来优化解释过程，使模型能够更全面地解释特征区域并提供更细致的解释，从而加深对 Transformer 机制的理解；此外，论文还揭示了 Vision Transformer (ViT) 与人类视觉感知差异所引发的类别重写 (class rewriting) 安全风险。

链接: https://arxiv.org/abs/2605.11392
作者: Yongjin Cui,Xiaohui Fan,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although researchers’ attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.

[AI-144] Revisiting Privacy Preservation in Brain-Computer Interfaces: Conceptual Boundaries Risk Pathways and a Protection-Strength Grading Framework

【速读】：该论文旨在解决脑机接口（Brain-computer interface, BCI）在从实验室转向临床、边缘和现实场景过程中，由神经数据、衍生表征、模型资产和解码输出等环节引发的隐私风险问题，特别是用户数据隐私与模型隐私在共享风险路径中的保护边界与关系。解决方案的关键在于提出一个由保护对象（protection object）、生命周期阶段（lifecycle stage）和主导保护强度等级（dominant protection-strength level）组成的三维框架，用以对现有工作进行分类分级，并强调隐私保护不仅应模糊数据，还需通过解缠（disentanglement）任务无关的敏感信息，在保留下游效用的同时实现隐私保护。

链接: https://arxiv.org/abs/2605.11386
作者: Lei Sun,Xiuqing Mao,Shuai Zhang,Qingyu Zeng,Min Zhao,Jiyuan Li,Wenle Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) are moving rapidly from laboratory research into clinical, edge, and real-world settings. Under ISO/IEC 8663:2025, a BCI is a direct communication link between central nervous system activity and external software or hardware systems. This link expands privacy risk beyond raw neural-signal leakage: neural data, derived representations, model assets, and decoded outputs can be re-associated with individuals across collection, transmission, storage, training, inference, and feedback, or used to infer information beyond what a task requires. Starting from the general BCI paradigm, this review deffnes privacy-protection boundaries, protection objects, and the relationship between user data privacy and model privacy within a shared risk pathway. It then proposes a three-dimensional framework - protection object, lifecycle stage, and dominant protection-strength level - to classify existing work into four levels of protection strength. Finally, mental privacy and neuroethical risks are treated as open issues, emphasizing that BCI privacy protection should not only obscure data but also disentangle task-irrelevant sensitive information while preserving downstream utility. Keywords: Brain-computer interface, Neural data privacy, User data privacy, Model privacy, Disentanglement of task-irrelevant sensitive information, Protection-strength grading, Neuroethical risks

[AI-145] RACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning

【速读】：该论文试图解决脑电图（EEG）信号因多通道和非平稳性而难以学习可迁移表征的问题。现有架构对所有时间步施加统一计算或独立处理每个通道补丁，未能捕捉同一时刻多通道提供的耦合神经活动以及随时间变化的动态相关性。解决方案之关键在于提出了TRACE框架，一种自回归EEG预训练方法，它从因果上下文预测未来EEG补丁，同时执行时间自适应且跨通道相干的计算。具体地，每个时间步从因果跨通道历史中导出专家路由决策，并将其联合应用于该步所有通道，从而在保留瞬时跨通道相干性的同时，允许不同时间域激活不同计算。由于路由定义在可用通道集和因果时序上下文中，TRACE能兼容不同通道数、导联方式、序列长度和记录域之间的异构预训练，最终在多个下游基准上取得最优结果。

链接: https://arxiv.org/abs/2605.11380
作者: Fan Ma,Qier An,Peng Chen,Lingfei Qian,Xiang Lan,Mingyang Jiang,Zhiling Gu,Xenophon Papademetris,Hua Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning transferable representations for electroencephalography (EEG) remains challenging because EEG signals are inherently multi-channel and non-stationary. Channels observed at the same time provide coupled measurements of neural activity, while the relevant temporal dynamics vary across contexts. This structure is poorly matched by architectures that apply uniform computation across time or route each channel patch independently. To this end, we propose TRACE, an autoregressive EEG pre-training framework that predicts future EEG patches from causal context while performing temporally adaptive and cross-channel coherent computation. At each temporal step, TRACE derives an expert routing decision from the causal cross-channel history and applies it jointly to all channels at that step. This preserves instantaneous cross-channel coherence while allowing different temporal regimes to activate different computation. Since routing is defined over the available channel set and causal temporal context, TRACE is compatible with heterogeneous pre-training across corpora with different channel counts, montages, sequence lengths, and recording domains. Across eight downstream EEG benchmarks, TRACE is evaluated in both settings: when downstream domains are seen only as unlabeled pre-training data and when downstream datasets are completely unseen during pre-training. It obtains the best results on several benchmarks while remaining competitive on motor imagery and clinical event classification tasks, with ablations supporting the importance of cross-channel temporal routing.

[AI-146] LLM -X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents ICSE2026

【速读】：该论文旨在解决现有以工具为中心的代理通信协议（侧重于代理-API交互）无法支持大规模、结构化且带策略保证的LLM间直接协商的问题。解决方案的关键在于提出一个可扩展的协商导向环境——个人LLM交换（LLM-X），其核心架构包括：联邦网关（federated gateways）实现去中心化部署，基于主题的路由（topic-based routing）实现高效消息分发，以及策略执行（policy enforcement）机制确保模式有效性和协商规则遵守；同时设计类型化消息协议（typed message protocol），支持能力协商（capability negotiation）和合同网式协调（contract-net-style coordination），从而在多个个性化LLM代理之间实现直接、结构化的通信与协调，并通过实验验证了不同协商策略下性能与鲁棒性的权衡。

链接: https://arxiv.org/abs/2605.11376
作者: Giuliano Lorenzoni,Paulo Alencar,Donald Cowan(University of Waterloo)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, accepted at AGENT 2026 Workshop, co-located with ICSE 2026

点击查看摘要

Abstract:We propose a personal-LLM exchange (LLM-X), a scalable negotiation-oriented environment that enables direct, structured communication across populations of personal agents (LLMs), each representing an individual user. Unlike existing tool-centric protocols that focus on agent-API interaction, LLM-X introduces a message bus and routing substrate for LLM-to-LLM coordination with guarantees around schema validity and policy enforcement. We contribute: (1) an architecture for LLM-X comprising federated gateways, topic-based routing, and policy enforcement; (2) a typed message protocol supporting capability negotiation and contract-net-style coordination; and (3) the first empirical evaluation of LLM-based multi-agent negotiation at scale. Experiments span 5, 9, and 12 agents, under distinct negotiation policies (Low, Medium, High), and across both short-run (minutes) and long-run (2h, 12h) load conditions. Results highlight clear policy-performance trade-offs: stricter policies improve robustness and fairness but increase latencies and message volume. Extended runs confirm that LLM-X remains stable under sustained load, with bounded latency drift.

[AI-147] Causal Algorithmic Recourse: Foundations and Methods

【速读】：该论文试图解决现有算法回滚（algorithmic recourse）方法中的一个关键缺陷：它们通常将回滚结果视为固定个体的反事实，而忽略了现实世界中同一个体可能在不同的潜在条件下经历重复决策的过程。为此，论文提出了一个因果框架，将回滚建模为干预前和干预后结果之间的过程，允许潜在变量部分稳定并重新采样。解决方案的关键在于引入后回滚稳定性条件（post-recourse stability conditions），使得仅从观测数据即可推理回滚效应，并在此基础上开发了基于copula的推断算法。当存在同一体干预前后的配对观测数据（称为回滚数据）时，论文进一步提供了推断copula参数及进行拟合优度检验的方法；若copula模型被拒绝，则提出一种无分布算法直接从回滚数据学习回滚效应，从而有效处理现实世界中的动态决策情境。

链接: https://arxiv.org/abs/2605.11373
作者: Drago Plecko,Collin Wang,Elias Bareinboim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The trustworthiness of AI decision-making systems is increasingly important. A key feature of such systems is the ability to provide recommendations for how an individual may reverse a negative decision, a problem known as algorithmic recourse. Existing approaches treat recourse outcomes as counterfactuals of a fixed unit, ignoring that real-world recourse involves repeated decisions on the same individual under possibly different latent conditions. We develop a causal framework that models recourse as a process over pre- and post-intervention outcomes, allowing for partial stability and resampling of latent variables. We introduce post-recourse stability conditions that enable reasoning about recourse from observational data alone, and develop a copula-based algorithm for inferring the effects of recourse under these conditions. For settings where paired observations of the same individual before and after intervention are available (called recourse data), we develop methods for inferring copula parameters and performing goodness-of-fit testing. When the copula model is rejected, we provide a distribution-free algorithm for learning recourse effects directly from recourse data. We demonstrate the value of the proposed methods on real and semi-synthetic datasets.

[AI-148] LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

【速读】：该论文旨在解决现有奖励引导的DNA序列生成框架局限于固定长度序列空间，无法通过生物学上合理的插入、删除和替换操作生成可变长度DNA的问题。解决方案的关键在于提出一种无需训练的局部扰动离散编程（Local Perturbation Discrete Programming, LPDP）算子，该算子在推理时对可变长度DNA编辑动作生成器进行中间状态和动作感知的局部重求解：通过基于编辑动作类型几何结构聚焦于连贯的替换、插入或删除子图，并利用硬最大值（hard Max）或软对数求和指数（soft log-sum-exponential, LSE）备份聚合局部延续，同时结合前端加载（front-loaded）和后端加载（back-loaded）奖励倾斜策略，分别针对增强子优化和外显子-内含子-外显子修补场景实现高效的推理时奖励控制。

链接: https://arxiv.org/abs/2605.11368
作者: Jeongchan Kim,Yunkyung Ko,Jong Chul Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local Perturbation Discrete Programming (LPDP), a training-free, intermediate-state and action-aware local re-solving operator for variable-length DNA edit-action generators at inference time. More specifically, at each guided rollout step, LPDP scores one-step root edits, retains a near-best root band, and re-ranks each retained root by solving a bounded local discrete program around its child sequence. This local program uses the typed geometry of edit actions to focus on coherent substitution, insertion, or deletion subgraphs, and aggregates local continuations with either a hard Max backup or a soft log-sum-exponential (LSE) backup. We instantiate LPDP in two regimes: front-loaded reward tilting for enhancer optimization, where early edits are critical for establishing global regulatory sequence structure, and back-loaded reward tilting for exon-intron-exon inpainting, where late edits fine-tune splice-boundary contexts.

[AI-149] Causal Bias Detection in Generative Artifical Intelligence

【速读】：该论文旨在解决生成式AI（generative AI）中因果公平性（causal fairness）的量化与评估问题。与标准机器学习（ML）仅需学习一个预测函数不同，生成模型能对任意变量集进行条件采样，隐式构建所有因果机制，导致现有因果公平性方法失效。解决方案的关键在于：首先，将生成式AI的因果公平性问题形式化，并统一到与标准ML兼容的理论框架中；其次，推导出新的因果分解结果，使得可以沿不同因果路径（causal pathways）以及真实世界机制被生成模型机制替代这两个维度，对公平性影响进行细粒度量化；最后，建立识别条件（identification conditions）并引入高效估计量，从而在诸如大语言模型等生成模型中定量分析种族与性别偏见。

链接: https://arxiv.org/abs/2605.11365
作者: Drago Plecko
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism f_\widehat Y for an outcome variable Y , while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model’s mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

[AI-150] Causal Fairness for Survival Analysis

【速读】：该论文旨在解决生存/时间-事件（survival/time-to-event, TTE）分析中公平性研究不足的问题，特别是现有方法依赖统计公平定义（即使有无穷数据也无法分离产生差异的因果机制），且忽视时间动态。解决方案的关键是开发一个因果框架，通过非参数方法实现生存差异的因果路径分解：首先用图模型（graphical model）形式化关于删失（censoring）和无混杂（lack of confounding）的必要假设；然后恢复给定协变量的条件生存函数（conditional survival function）；接着应用因果约简定理（Causal Reduction Theorem）将问题重新表述为适用于因果路径分解的形式；最后高效估计直接、间接与虚假路径（direct, indirect, and spurious pathways）的贡献，从而解释差异为何产生及如何随时间演化，并以重症监护室（ICU）入院后种族差异的时间演化分析验证该方法的实用性。

链接: https://arxiv.org/abs/2605.11362
作者: Drago Plecko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In the data-driven era, large-scale datasets are routinely collected and analyzed using machine learning (ML) and artificial intelligence (AI) to inform decisions in high-stakes domains such as healthcare, employment, and criminal justice, raising concerns about the fairness behavior of these systems. Existing works in fair ML cover tasks such as bias detection, fair prediction, and fair decision-making, but largely focus on static settings. At the same time, fairness in temporal contexts, particularly survival/time-to-event (TTE) analysis, remains relatively underexplored, with current approaches to fair survival analysis adopting statistical fairness definitions, which, even with unlimited data, cannot disentangle the causal mechanisms that generate disparities. To address this gap, we develop a causal framework for fairness in TTE analysis, enabling the decomposition of disparities in survival into contributions from direct, indirect, and spurious pathways. This provides a human-understandable explanation of why disparities arise and how they evolve over time. Our non-parametric approach proceeds in four steps: (1) formalizing the necessary assumptions about censoring and lack of confounding using a graphical model; (2) recovering the conditional survival function given covariates; (3) applying the Causal Reduction Theorem to reframe the problem in a form amenable to causal pathway decomposition; (4) estimating the effects efficiently. Finally, our approach is used to analyze the temporal evolution of racial disparities in outcome after admission to an intensive care unit (ICU).

[AI-151] Options Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

【速读】：该论文旨在解决模型上下文协议（Model Context Protocol, MCP）广泛采用后，通过有意义的用户同意来确保工具调用安全的关键挑战——现有方法如全局“始终允许”开关或基于大语言模型（Large Language Model, LLM）的不透明决策，无法处理危险调用参数，且易导致用户同意疲劳。解决方案的核心是Conleash，一种客户端中间件，它利用风险格（risk lattice）在已知边界内自动许可安全调用并升级风险，结合策略引擎（policy engine）处理用户定义的不变量，以及一个将用户决策转化为可重用规则的细化循环（refinement loop）。实验表明，Conleash在984个真实世界轨迹上达到98.2%的准确率，捕获99.4%的风险升级，仅增加8.2毫秒策略验证开销，且在16人用户研究中显著优于传统方法，获得更高的信任度和更少的提示负担。

链接: https://arxiv.org/abs/2605.11360
作者: Ying Li,Yanju Chen,Peiran Wang,Issac Khabra,Faysal Hossain Shezan,Yu Feng,Yuan Tian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As Model Context Protocol adoption grows, securing tool invocations via meaningful user consent has become a critical challenge, as existing methods, broad always allow toggles or opaque LLM-based decisions, fail to account for dangerous call arguments and often lead to consent fatigue. In this work, we present Conleash, a client-side middleware that enforces boundary-scoped authorization by utilizing a risk lattice to auto-permit safe calls within known boundaries while escalating risks, a policy engine for user-defined invariants, and a refinement loop that converts user decisions into reusable rules. Evaluated on 984 real-world traces, Conleash achieved 98.2% accuracy, caught 99.4% of escalations, and added only 8.2 ms of overhead for policy verification; furthermore, in a user study where N=16, participants significantly preferred Conleash scoped permissions over traditional methods, citing higher trust and reduced prompting.

[AI-152] CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

【速读】：该论文试图解决领域科学家在缺乏计算或图像处理专业知识的情况下，难以针对噪声大、高动态范围、稀疏标注或仅被松散定义的科研数据设计专用算法或AI模型的瓶颈问题。解决方案的关键是引入了CVEvolve，一个具有零代码接口的自主体化代理框架（autonomous agentic harness），它通过结合多轮搜索策略（multi-round search strategy）与代码执行、评估实现、历史管理、保留测试（holdout testing）以及可选的科学数据与视觉输出检查工具，交替执行发现与改进动作，并利用谱系感知的随机候选采样（lineage-aware stochastic candidate sampling）来平衡探索与利用，从而自动化地发现适用于科研数据处理的高性能算法。

链接: https://arxiv.org/abs/2605.11359
作者: Ming Du,Xiangyu Yin,Yanqi Luo,Dishant Beniwal,Songyuan Tang,Hemant Sharma,Mathew J. Cherukara
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analyze their data but may not have extensive computing or image-processing expertise. This barrier is especially pronounced when data are noisy, have a high dynamic range, are sparsely labeled, or are only loosely specified. We introduce CVEvolve, an autonomous agentic harness with a zero-code interface for scientific data-processing algorithm discovery. CVEvolve combines a multi-round search strategy with tools for code execution, evaluation implementation, history management, holdout testing, and optional inspection of scientific data and visual outputs. The search alternates between discovery and improvement actions, and uses lineage-aware stochastic candidate sampling to balance exploration and exploitation. We demonstrate CVEvolve on x-ray fluorescence microscopy image registration, Bragg peak detection, and high-energy diffraction microscopy image segmentation. Across these tasks, CVEvolve discovers algorithms that improve over baseline methods, while holdout test tracking helps identify candidates that generalize better than later over-optimized alternatives. These results show that zero-code, autonomous LLM-powered algorithm development can help domain scientists turn unstructured scientific image data into practical algorithms and downstream scientific discoveries.

[AI-153] Human-AI Productivity Paradoxes: Modeling the Interplay of Skill Effort and AI Assistance

【速读】：该论文旨在解决生成式AI（Generative AI）工具在职场和教育中广泛采用后，其实际生产力影响不明确且存在矛盾证据的问题，尤其是为何AI辅助水平提升有时反而导致生产力下降（即生产力悖论）以及长期技能两极分化的现象。解决方案之关键在于构建一个包含人类异质性与AI交互机制的模型，其中通过引入技能发展的内生性（即人类根据AI辅助动态调整自身努力）或AI不可靠性（即AI输出的不确定性），能够复现生产力悖论；更进一步，通过纳入AI素养（AI literacy）的异质性——即个体识别并适应AI错误输出的能力差异——模型在稳态下显式地揭示了技能极化（skill polarization）的涌现机制。这些机制为理解人机交互中的悖论与分化提供了可量化的指标与条件。

链接: https://arxiv.org/abs/2605.11350
作者: Ali Aouad,Thodoris Lykouris,Huiying Zhong
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) tools are rapidly adopted in the workplace and in education, yet the empirical evidence on AI’s impact remains mixed. We propose a model of human-AI interaction to better understand and analyze several mechanisms by which AI affects productivity. In our setup, human agents with varying skill levels exert utility-maximizing effort to produce certain task outcomes with AI assistance. We find that incorporating either endogeneity in skill development or in AI unreliability can induce a productivity paradox: increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls. Moreover, we examine the long-term distributional effect of AI on skill, and demonstrate that skill polarization can emerge in steady state when accounting for heterogeneity in AI literacy – the agent’s capability to identify and adapt to inaccurate AI outputs. Our results elucidate several mechanisms that may explain the emergence of human-AI productivity paradoxes and skill polarization, and identify simple measures that characterize when they arise.

[AI-154] Physics-Informed Teacher-Student Ensemble Learning for Traffic State Estimation with a Varying Speed Limit Scenario ITSC

【速读】：该论文旨在解决现有物理信息深度学习（PIDL）神经网络训练架构无法适应带有可变速度限制（VSL）的高速公路上变化的交通特性，从而导致交通状态估计（TSE）不准确的问题。解决方案的关键在于提出一种教师-学生集成训练框架：教师模型通过PIDL局部编码流量守恒定律，学生模型则利用多层感知器分类器（MLP）识别交通特性，并据此选择最合适的PIDL集成成员进行TSE，从而有效捕捉VSL带来的异质性。

链接: https://arxiv.org/abs/2605.11346
作者: Archie J. Huang,Dongdong Wang,Shaurya Agarwal,Mohamed Abdel-Aty,Md Mahmudul Islam,Muhammad Shahbaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

点击查看摘要

Abstract:Physics-informed deep learning (PIDL) neural networks have shown their capability as a useful instrument for transportation practitioners in utilizing the underlying relationship between the state variables for traffic state estimation (TSE). Another efficient traffic management approach is implementing varying speed limits (VSLs) on transportation corridors to control traffic and mitigate congestion. However, the existing training architecture of PIDL in the literature cannot accommodate the changing traffic characteristics on a freeway with VSL. To tackle this challenge, we propose a novel framework integrating teacher-student ensemble training with PIDL neural networks for TSE under VSL scenarios. The physics of flow conservation law is encoded locally in the teacher models by PIDL, and the student model uses a multi-layer perceptron classifier (MLP) to identify traffic characteristics and selects the ensemble member of PIDL neural networks for TSE. This integrated framework provides a natural solution for capturing the heterogeneity of VSL and accurately addressing the TSE problem. The case study results validate the proposed ensemble approach, demonstrating its superior performance in TSE compared to other popular baseline methods, as indicated by relative L2 error.

[AI-155] CPEMH: An Agent ic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening ICSE2026

【速读】：该论文旨在解决大规模语言系统在心理健康筛查等敏感临床领域中，提示驱动行为（prompt-driven behavior）存在不可控变异、缺乏系统化评估与审计手段的问题。解决方案的关键在于提出了CPEMH框架，这是一个基于智能代理（agentic）的工程方法论，通过模块化编排架构（modular orchestration architecture）实现行为保证（behavioral assurance）：框架由编排器（orchestrator）、推理（inference）和评估（evaluation）三种代理协同工作，能够自主完成提示策略的设计、评估与选择，从而系统化地控制不同语境下的行为变异性，并确保整个提示生命周期的可追溯性、可重复性和鲁棒性。案例研究证明该框架可有效稳定并审计基础模型在抑郁筛查对话中的行为，其核心思想是优先通过模块化编排保证稳定性，而非追求架构的复杂性，并将F1分数、偏差（bias）和鲁棒性（robustness）纳入核心验收标准。

链接: https://arxiv.org/abs/2605.11341
作者: Giuliano Lorenzoni,Ivens Portugal,Paulo Alencar,Donald Cowan(University of Waterloo)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures. Accepted at the AGENT 2026 Workshop (ICSE 2026)

点击查看摘要

Abstract:This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework’s capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.

[AI-156] Rethinking Evaluation for LLM Hallucination Detection: A Desiderata A New RAG -based Benchmark New Insights ACL2026

【速读】：该论文旨在解决现有大语言模型（LLM）幻觉检测基准（Hallucination Detection Benchmarks, HDBs）存在的两大关键缺陷：一是缺乏基于检索增强生成（RAG）且包含长上下文样本的基准（由于长度阻碍人工标注），二是缺乏能模拟真实场景中标签噪声（label noise）的基准（现实应用中因人工或弱标注引入噪声）。解决方案的核心在于构建并开源一个新的RAG-based HDB，即TRIVIA+。该基准通过严谨的人工标注过程，具备所有期望的理想属性，包括文献中最长的上下文样本，以及设计并公开了四种不同噪声方案（样本相关与样本独立）的标签集。这一方案使得TRIVIA+能够填补现有基准的空白，并为压力测试和提升幻觉检测器的鲁棒性提供基础。

链接: https://arxiv.org/abs/2605.11330
作者: Wenbo Chen,Veena Padmanabhan,Tootiya Giyahchi,Elaine Wong,Leman Akoglu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 main conference

点击查看摘要

Abstract:Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.

[AI-157] Epistemic Uncertainty for Test-Time Discovery

【速读】：该论文试图解决使用大语言模型进行自动化科学发现时，标准强化学习因惩罚高方差突变而偏好熟悉模式、导致最大奖励停滞（即使平均奖励上升）从而无法识别真正新颖解决方案的问题。解决方案的关键在于 UG-TTT 方法：在冻结的基础模型上维护一个由低秩适配器形成的小型集成，通过计算集成预测与权重假设之间的互信息来量化每个词元的认知不确定性，从而区分因训练覆盖不足导致的适配器发散与因问题本身困难导致的置信度低；该互信息作为探索奖励加入策略梯度，引导策略关注适配器持续不一致（即低训练覆盖）的区域——这正是真正发现可能发生的前沿；同时采用核范数正则化器确保适配器彼此保持差异，维持探索信号在整个训练过程中的稳定性。

链接: https://arxiv.org/abs/2605.11328
作者: Kainat Riaz,Muhammad Ahmed Mohsin,Ahsan Bilal,Muhammad Umer,Ayesha Mohsin,Aqib Riaz,Ali Subhan,John M. Cioffi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network’s confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.

[AI-158] Natural Language based Specification and Verification

【速读】：该论文试图解决的问题是：如何防止大语言模型（Large Language Models, LLMs）在生成代码时引入安全漏洞，即通过形式化验证（formal verification）确保生成代码的安全性，但现有验证流程依赖于用严格形式语言书写的规约（specifications），而先前尝试用LLM自动合成此类规约的成功率有限。解决方案的关键在于采用一种不同的方法：同时使用LLM来生成规约（以自然语言形式表达）以及组合式地（compositionally）验证实现是否符合这些自然语言规约，初步结果表明该方法颇具前景。

链接: https://arxiv.org/abs/2605.11315
作者: Zhaorui Li,Chengyu Song
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such models from producing vulnerable implementations in the first place. Formal verification offers a principled route to this objective, but existing verification pipelines typically require specifications written in rigid formal languages. Prior work has explored using LLMs to synthesize such specifications, with limited success. In this paper, we investigate a different approach: using LLMs both to generate specifications and to verify implementations compositionally when the specifications are expressed in natural language. Our preliminary results suggest that this approach is promising.

[AI-159] Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments IJCAI2026

【速读】：该论文试图解决现有基于Shapley的数据归因方法在仅保留少量数据（低数据场景）时，不适合用于修剪低价值数据的问题。解决方案的关键在于提出约束-数据-值-最大化（Constraint-Data-Value-Maximization, CDVM）方法，该方法将数据修剪建模为一个约束优化问题，在最大化总影响（total influence）的同时，对每个测试样本的过度贡献施加惩罚，从而在仅保留一小部分数据时仍能实现稳健的修剪性能。

链接: https://arxiv.org/abs/2605.11312
作者: Danilo Brajovic,David A. Kreplin,Marco F. Huber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at IJCAI 2026

点击查看摘要

Abstract:Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model’s performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

[AI-160] Beyond Similarity: Temporal Operator Attention for Time Series Analysis

【速读】：该论文试图解决时间序列预测中一个持续存在的悖论：结构简化的MLP（多层感知器）和线性模型在性能上往往优于高容量的Transformer模型。作者认为这一差距源于序列建模原语的错配——许多时间序列动态由全局时间算子（如滤波和谐波结构）主导，而标准注意力机制将每个输出形成输入的凸组合，这限制了其表达有符号和振荡变换的能力，而这种变换是时间信号处理的基础。他们将此限制形式化为softmax注意力中的单纯形约束混合瓶颈，这在算子驱动的时间序列任务中尤为突出。解决方案的关键在于提出时间算子注意力（Temporal Operator Attention, TOA），这是一种通过显式、可学习的序列空间算子增强注意力的框架，能够在保持输入自适应性的同时实现跨时间的直接有符号混合。为了使密集的N×N算子实用化，他们引入随机算子正则化（Stochastic Operator Regularization），一种高方差dropout机制，用于稳定训练并防止琐碎的记忆化。TOA在集成到PatchTST和iTransformer等标准骨干网络时，在预测、异常检测和分类基准上持续提升性能，尤其在重构密集型任务中增益显著。这表明显式算子学习是有效时间序列建模的关键要素。

链接: https://arxiv.org/abs/2605.11287
作者: Jevon Twitty,Vinh Pham,Nitiwith Rotchanarak,Viresh Pati,Yubin Kim,Shihao Yang,Jiecheng Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose \textbfTemporal Operator Attention (TOA) , a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense N \times N operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

[AI-161] Curriculum Learning-Guided Progressive Distillation in Large Language Models

【速读】：该论文试图解决的问题是：在大型语言模型（LLMs）的知识蒸馏（knowledge distillation）过程中，现有方法通常忽略了训练数据的学习顺序（learning order）以及师生模型之间的能力不匹配（capacity mismatch），这导致了反直觉的“更强教师无法教出更优学生”现象，从而限制了蒸馏性能。解决方案的关键在于提出了课程学习引导的渐进式蒸馏（Curriculum Learning-Guided Progressive Distillation, CLPD）统一框架，该框架通过两个对齐机制显式地处理上述问题：一方面构建显式课程（explicit curriculum），将训练样例按从易到难的顺序组织；另一方面通过逐步调度能力递增的教师模型来施加隐式课程（implicit curriculum）于监督信号之上。该框架可模块化地集成到标准蒸馏算法中，并在推理基准上一致优于单独考虑数据排序或教师调度的方案。

链接: https://arxiv.org/abs/2605.11260
作者: Jincheng Cao,Fanzhi Zeng,Leqi Liu,Aryan Mokhtari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.

[AI-162] mplate-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

【速读】：该论文旨在解决基于大型语言模型（LLM）的AI智能体在制造环境中部署时面临的验证数据稀缺问题：生产执行系统（MES）数据具有专有性、隐私限制和供应商特异性，导致AI工具无法获得符合模式（Schema）且填充正确的数据用于验证。解决方案的关键在于提出了“模板即本体”（Template-as-Ontology）原则，即通过一个单一的Python配置模块（约700-770行代码，45个验证导出），同时作为时间步进制造模拟器的规范定义和AI分析工具的运行时域本体（Domain Schema），从而由构造而非集成方式保证模拟层与工具层之间的结构对齐。该原则将领域模板定义为类型化的关系型配置模式，通过单源消费（single-source consumption）确保对齐的确定性。在此基础上构建的五层流水线（仿真、PostgreSQL、CDC/Iceberg湖仓、星型模式以及12个参数化AI工具）能够生成覆盖四个运营领域（对应ISA-95/IEC 62264标准）中66种实体类型的因果一致、类MES数据。通过六个行业模板的校准实验（每种模板10个随机种子，共60次运行）验证了参数可控性，并基于Qwen3-32B模型的受控幻觉实验（72次工具调用）证明，本体约束的参数完全消除了工具参数编造（约束条件下幻觉率为0%，而非约束条件下为43%，Fisher精确检验p < 10^-12），且0%幻觉率是适用于任何模型的架构性保证。

链接: https://arxiv.org/abs/2605.11259
作者: Grama Chethan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 1 fugure

点击查看摘要

Abstract:LLarge language model (LLM)-based AI agents deployed in manufacturing environments require populated, schema-correct data for validation, yet production MES data is proprietary, privacy-encumbered, and vendor-specific. This paper introduces the Template-as-Ontology principle: a single Python configuration module (700-770 lines, 45 validated exports) serves simultaneously as the specification for a time-stepped manufacturing simulator and as the runtime domain schema for AI analytics tools, producing alignment by construction rather than integration. We formally define the domain template as a typed relational configuration schema and prove that structural alignment between simulation and tool layers is guaranteed by single-source consumption. A five-layer pipeline–simulation, PostgreSQL, CDC/Iceberg lakehouse, star schema, and 12 parameterized AI tools–generates causally coherent, MES-shaped data spanning 66 entity types across four operational domains mapped to ISA-95/IEC 62264. We validate the architecture with six industry templates (aerospace, pharma, automotive, electronics, beverages, warehousing) running on identical framework code. Calibration experiments (60 runs, 10 seeds per template) confirm parametric controllability: observed KPIs fall within configured ranges across all templates. A controlled hallucination experiment (72 tool invocations, Qwen3-32B) demonstrates that ontology-constrained parameters eliminate tool-parameter fabrication (0% constrained vs. 43% unconstrained hallucination rate for the evaluated model, Fisher’s exact test p 10^-12); the 0% constrained rate is an architectural guarantee that holds for any model. The framework provides a reusable data layer for discrete manufacturing AI validation.

[AI-163] Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

【速读】：该论文旨在解决大型语言模型强化微调（Reinforcement Fine-Tuning, RFT）中课程学习（Curriculum Learning）的外部化问题——当前方法依赖手工设计的启发式规则或辅助模型来进行课程判断，容易与策略的训练动态产生错配。解决方案的关键在于提出METIS（METacognitive Internalized Self-judgment）框架，它将课程判断内化为模型的原生能力：基于“prompt内的奖励方差能有效衡量prompt信息量”这一关键发现，METIS利用近期训练结果作为轻量级上下文学习（In-Context Learning）示例来预测该指标，从而内在自发地进行自我判断，动态决定训练资源的分配；同时，通过联合优化标准RFT奖励和自判断奖励（Self-judgment Reward），形成判断与优化之间的闭环，使策略能够以元认知（Metacognition）的形式学习“下一步应学习什么”。该方法在数学推理、代码生成和智能体函数调用等离散与连续RFT基准上均表现优越，收敛速度提升最高达67%，建立了一种简单、闭环且高效的课程内化范式。

链接: https://arxiv.org/abs/2605.11235
作者: Han Zheng,Yining Ma,Karthick Gunasekaran,Bharathan Balaji,Zheng Du,Shiv Vitaladevuni,Cathy Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy’s training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.

[AI-164] he Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

【速读】：该论文试图解决基于大语言模型（LLM）的AI代理在制造环境中存在的语义训练鸿沟（semantic training gap）问题：即AI系统通过训练获得的领域词汇统计流畅性与制造运营通过本体论关系（ontological relationships）定义的实际语义之间存在结构性脱节，这导致即使模型输出在语言上精确，也会产生操作上不正确的结果，并在多代理配置中引发称为语义漂移（semantic drift）的复合失效模式。解决方案的关键在于将制造本体（manufacturing ontology）直接嵌入AI工具层，作为类型化关系配置（typed relational configuration），在运行时强制执行语义约束而非依赖模型训练；该架构被形式化为一个三操作接口契约（resolve, contextualize, annotate），并由AIOps编排层强制执行其不变量，从而消除工具调用中的幻觉（hallucination）并实现跨领域可配置性，无需修改应用代码。

链接: https://arxiv.org/abs/2605.11234
作者: Grama Chethan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 2 figures

点击查看摘要

Abstract:Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics – the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract – resolve, contextualize, annotate – with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.

[AI-165] Rethinking LLM Ops for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

【速读】：该论文旨在解决反洗钱（AML）和欺诈检测工作负载下大语言模型（LLMs）服务中因提示前缀密集、模式约束严格、证据丰富而导致的系统效率低下问题——具体体现为低吞吐量、高延迟和低GPU利用率。其解决方案之关键在于构建一个面向工作负载的LLMOps栈（workload-aware LLMOps stack），核心组件包括：利用自动前缀缓存（Automatic Prefix Caching）和PagedAttention优化KV缓存效率，通过适配器与提示长度感知的批处理（adapter and prompt-length-aware batching）、休眠/唤醒生命周期管理及推测解码（speculative decoding）提升资源利用，并引入LLM-as-judge质量门控（包括确定性合规检查、参考指标和多法官评分）确保输出可靠性。该栈通过将公开合成AML数据集转换为前缀密集的合规提示来避免暴露机构数据，从而在受控基准上实现了吞吐量提升约5倍、延迟降低超过4倍、GPU利用率从12%跃升至78%的显著效果。

链接: https://arxiv.org/abs/2605.11232
作者: Prathamesh Vasudeo Naik,Naresh Dintakurthi,Yue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.

[AI-166] LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

【速读】：该论文试图解决合成数据生成中普遍存在的效率低下问题——即大量生成的合成样本对下游任务无益，甚至引入噪声，只有那些能够填补训练数据分布中缺失且对任务关键的部分才真正有用。解决方案之关键在于LiBaGS方法，它通过一个多准则评分系统（同时考量决策边界接近度、预测不确定性、真实数据密度和支持有效性）来筛选信息量高且位于真实数据流形上的候选样本，并采用“边界间隙分配规则”优先选择稀疏但真实的决策边界邻域，而非盲目增加数据或仅依据不确定性。此外，LiBaGS通过边际价值停止规则动态判断何时已添加足够样本，在模糊边界处赋予更软标签，并引入多样性目标以避免冗余重复选择，从而在保证样本质量的同时实现高效的合成数据增广。

链接: https://arxiv.org/abs/2605.11231
作者: Abhishek Moturu,Anna Goldenberg,Babak Taati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

[AI-167] Comment and Control: Hijacking Agent ic Workflows via Context-Grounded Evolution

【速读】：该论文旨在解决自动化平台（如GitHub Actions和n8n）中代理工作流（agentic workflows）因集成大语言模型（LLM）代理而面临的新型安全风险：攻击者可通过控制特定输入（如GitHub issue评论）诱导LLM代理执行凭证泄露、任意命令等恶意操作。此前学术界尚未系统研究此类风险。解决方案的核心是提出的首个检测与利用框架JAW，其关键创新在于**上下文导向进化（Context-Grounded Evolution）**方法——通过混合程序分析（hybrid program analysis）动态生成代理工作流的上下文，并据此演化输入以实现劫持。具体而言，JAW执行三类分析：静态路径可行性分析（static path-feasibility analysis）识别可行的代理调用路径及触发所需输入约束；动态提示来源分析（dynamic prompt-provenance analysis）追踪输入如何被转换并嵌入LLM上下文；能力分析（capability analysis）识别代理在运行时可执行的操作与限制。这三者共同为输入进化提供精确的上下文指导，从而系统性地发现可被劫持的工作流。

链接: https://arxiv.org/abs/2605.11229
作者: Neil Fendley,Zhengyu Liu,Aonan Guan,Jiacheng Zhong,Yinzhi Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Automation platforms such as GitHub Actions and n8n are increasingly adopting so-called agentic workflows, which integrate Large Language Model (LLM) agents for tasks such as code review and data synchronization. While bringing convenience for developers, this integration exposes a new risk: An adversary may control and craft certain inputs, such as GitHub issue comments, to manipulate the LLM agent for unwanted actions, such as credential exfiltration and arbitrary command execution. To our knowledge, no prior academic work has studied such a risk in agentic workflows. In this paper, we design the first detection and exploitation framework, called JAW, to hijack agentic workflows hosted on automation platforms via a novel approach called Context-Grounded Evolution. Our key idea is to evolve agentic workflow inputs under the contexts derived from hybrid program analysis for hijacking purposes. Specifically, JAW generates agentic workflow contexts through three analyses: (i) static path-feasibility analysis to identify feasible agent-invocation paths and the input constraints required to trigger them, (ii) dynamic prompt-provenance analysis to determine how that input is transformed and embedded into the LLM context, and (iii) capability analysis to identify the actions and restrictions available to the agent at runtime. Our evaluation of JAW on GitHub workflows and n8n templates showed that 4714 GitHub workflows and eight n8n templates can be successfully hijacked, for example, to leak user credentials. Our findings span 15 widely-used GitHub Actions, including official GitHub Actions for Claude Code, Gemini CLI, Qwen CLI, and Cursor CLI, and two official n8n nodes. We responsibly disclosed all findings to the affected vendors and received many acknowledgements, fixes, and bug bounties, notably from GitHub, Google, and Anthropic.

[AI-168] Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

【速读】：该论文试图解决的问题是：现有视觉-语言-动作模型（Vision-Language-Action Models, VLMs）基准测试未能充分评估其在需要复杂物理推理和精确连续动作交互的点击解谜游戏中的类人逻辑问题解决能力。解决方案的关键是：构建了一个名为VLATIM的基准测试，基于经典物理谜题游戏《The Incredible Machine 2》，通过五个渐进式部分系统评估模型从基础视觉定位、领域理解到多步操作和完整解谜的能力，从而揭示高级逻辑推理与精确鼠标交互执行之间的显著差距。

链接: https://arxiv.org/abs/2605.11223
作者: Dominik Helfenstein,Marco Menner,Maximilian Triebel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

[AI-169] Dont Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

【速读】：该论文旨在揭示视觉-语言模型（VLMs）中因嵌入式数字锚点（embedded numeric anchors）而产生的系统性质量判断偏差，即视觉锚定偏差（visual anchoring bias），并探究其背后的表征动态与因果机制。解决方案的关键在于通过层间探测（layer-wise probing）发现锚点分类饱和层（L12-L34）与质量预测最优层（更深层，R²=0.69-0.91）之间存在一致性分离，进而利用融合分析（fusion analysis）识别出架构依赖的融合模式（如两种模型在L1-L2实现即时融合，而另外三种模型呈现部分融合或无融合），从而将行为层面的偏差与模型内部的表征动态联系起来，建立因果解释。

链接: https://arxiv.org/abs/2605.11218
作者: M. Shalankin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across six VLMs from five architectural families (ANOVA eta^2 = 0.18-0.77, all p 0.001). Anchor effects are 2.5x larger than severe image quality degradation, confirming bias is not reducible to visual changes. Layer-wise probing reveals consistent dissociation: layers where anchor classification saturates (L12-L34) are suboptimal for quality prediction, with optimal layers deeper (R^2 = 0.69-0.91). Fusion analysis identifies architecture-dependent integration – instant fusion at L1-L2 in two models versus partial or no fusion in three others. These results establish a causal account of visual anchoring bias, linking behavioral susceptibility to representation dynamics.

[AI-170] Leverag ing RAG for Training-Free Alignment of LLM s

【速读】：该论文试图解决现有大语言模型（LLM）对齐算法在应对新兴代理攻击（agentic attacks）时，因计算资源消耗大且拒绝防护能力不足而导致的安全漏洞问题。解决方案的关键在于引入基于检索增强生成（Retrieval Augmented Generation, RAG）的在线对齐算法RAG-Pref，该算法在推理阶段通过检索并条件化于偏好与非偏好样本，利用对比信息来动态调整输出，无需额外训练即可增强拒绝护栏，且与离线对齐算法结合后，能以极低的额外计算开销实现平均3.7倍的代理攻击拒绝性能提升，同时维持甚至改善通用人类偏好对齐任务的表现。

链接: https://arxiv.org/abs/2605.11217
作者: John T. Halloran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages, 4 figures, and 6 tables

点击查看摘要

Abstract:Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

[AI-171] ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

【速读】：该论文旨在解决大规模GPU集群预训练大语言模型（LLMs）时硬件故障成为常态所引发的弹性训练问题——现有框架或局限于特定并行方案，或存在训练轨迹偏离无故障参考的风险。解决方案的关键在于维持一个不变性约束：每个迭代步保持微批次（microbatch）数量恒定，使得每迭代步的梯度在随机意义上与无故障运行等价。具体实现由三个解耦协议层构成：容错集合（Fault-tolerant collectives）隔离故障跨副本传播；步内细粒度恢复（in-step fine-grained recovery）保护迭代内进度并防止梯度损坏；灵活工作负载策略（versatile-workload policy）在幸存节点间动态重分配微批次配额。该设计是并行性无关（parallelism-agnostic）的，可作为即插即用基板直接集成到3D并行（3D parallelism）和混合分片数据并行（HSDP, Hybrid Sharded Data Parallel）中，从而在持续故障下保持训练轨迹一致，并显著提升有效吞吐量。

链接: https://arxiv.org/abs/2605.11215
作者: Ziyue Liu,Zhengyang Wang,Ruijie Zhang,Avinash Maurya,Hui Zhou,Paul Hovland,Sheng Di,Franck Cappello,Bogdan Nicolae,Zheng Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates 2.23\times higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

[AI-172] he Scaling Law of Evaluation Failure: Why Simple Averag ing Collapses Under Data Sparsity and Item Difficulty Gaps and How Item Response Theory Recovers Ground Truth Across Domains

【速读】：该论文试图解决在人工智能及安全关键领域中，基准评估（benchmark evaluation）广泛采用的简单平均法（simple averaging）在评价矩阵稀疏（sparse）且项目难度差异大时产生严重误导性排序的问题。通过四个领域（NLP中的GLUE、临床试验、自动驾驶车辆安全、网络安全）的模拟实验，论文证明在覆盖率为67%且难度异质性高时，简单平均排序与真实排序的Spearman秩相关系数从1.000下降至0.809，而标准的两参数逻辑项目反应理论模型（two-parameter logistic Item Response Theory, 2PL IRT）在所有条件下均维持ρ ≥ 0.996。解决方案的关键是引入项目反应理论模型替代简单平均，该模型通过同时估计项目难度与被试能力，能有效应对稀疏矩阵和难度差异极端的情境，从而提供稳健且准确的排序。

链接: https://arxiv.org/abs/2605.11205
作者: Jung Min Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 tables, 1 figure. Code at this https URL

点击查看摘要

Abstract:Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains – NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity – we show that Spearman rank correlation \rho between simple-average rankings and ground-truth rankings degrades from \rho = 1.000 at 100% coverage to \rho = 0.809 at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains \rho \geq 0.996 across all conditions. A 150-condition grid sweep over sparsity S \in [0, 0.70] and difficulty gap D \in [0.5, 5.0] confirms that ranking error forms a failure surface with a strong S \times D interaction ( \gamma_3 = +0.20 , t = 13.05 ), while IRT maintains \rho \geq 0.993 throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

[AI-173] Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

【速读】：该论文旨在解决大型语言模型（LLM）推理和服务系统中因并发、缓存与状态重用所引发的服务层（serving layer）安全与可靠性漏洞——这类漏洞无法被传统的模型测试、安全测试或API测试所覆盖，典型表现包括KV缓存隔离失败、跨请求性能干扰（noisy-neighbor denial of service）、静默输出污染以及延迟崩溃等。解决方案之关键是一种名为GRIEF的灰盒模糊测试器（greybox fuzzer），其核心设计包括：(1) 将定时多请求轨迹（timed multi-request traces）作为一等输入，以模拟真实并发工作负载下的共享状态行为；(2) 使用轻量级的oracle（如崩溃、挂起、性能异常和静默输出损坏检测）快速识别异常；(3) 通过带对数概率检查（log-probability checks）的受控重放（controlled replay）机制，确认服务层故障的可复现性，从而有效发现并归因于并发场景下的隐蔽缺陷。

链接: https://arxiv.org/abs/2605.11202
作者: Yunze Zhao,Yibo Zhao,Yuchen Zhang,Zaoxing Liu,Michelle L. Mazurek
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:LLM inference and serving systems have become security-critical infrastructure; however, many of their most concerning failures arise from the serving layer rather than from model behavior alone. Modern inference engines combine KV cache, batching, prefix sharing, speculative decoding, adapters, and multi-tenant scheduling, creating shared-state behavior that only emerges under realistic concurrent workloads and is missed by standard model, safety, and API tests. We present GRIEF, a greybox fuzzer for LLM inference engines that treats timed multi-request traces as first-class inputs, uses lightweight oracles to detect crashes, hangs, performance pathologies, and silent output corruption, and applies controlled replay with log-probability checks to confirm reproducible serving-layer failures. Across early campaigns on vLLM and SGLang, GRIEF discovers 15 vulnerabilities, 10 confirmed by engine developers, including 2 CVEs, spanning KV-cache isolation failures, cross-request performance interference, and crash or liveness bugs. These results show that concurrency, caching, and state reuse can induce silent cross-request contamination, noisy-neighbor denial of service, and delayed crashes without malformed inputs or explicit server errors, making concurrent serving behavior a first-class security and reliability boundary for LLM infrastructure.

[AI-174] Exploring Token-Space Manipulation in Latent Audio Tokenizers

【速读】：该论文旨在解决现有神经音频编解码器（neural audio codecs）将token组织为帧级序列（frame-level sequences）导致的难以对全局变化因素（global factors of variation）进行干预或研究的问题。解决方案的关键在于提出一种名为LATTE（Latent Audio Tokenizer for Token-space Editing）的潜在音频分词器，它在音频特征序列末尾附加一组固定数量的可学习潜在token（learnable latent tokens），并仅保留这些token用于量化和解码。这种设计形成一个紧凑的非时间对齐瓶颈（non-temporally aligned bottleneck），其中每个token能够聚合整个语音段（utterance）的全局信息，从而在保持低比特率语音编码中重建质量的同时，支持简单的token空间干预，例如通过交换不同语音段间的选定潜在token位置来修改说话人身份（speaker identity）和背景噪声（background noise）等全局属性。

链接: https://arxiv.org/abs/2605.11192
作者: Francesco Paissan,Luca Della Libera,Mirco Ravanelli,Cem Subakan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

[AI-175] Adversarial SQL Injection Generation with LLM -Based Architectures

【速读】：该论文旨在解决利用大语言模型（Large Language Models, LLMs）自动化生成对抗性SQL注入（SQL injection, SQLi）负载以评估Web应用防火墙（Web Application Firewall, WAF）防御机制的问题。解决方案的关键在于提出了两个新型基于LLM的系统：检索增强生成对抗性SQLi（Retrieval Augmented Generation for Adversarial SQLi, RADAGAS）和反思性思维链SQLi（Reflective Chain-of-Thought SQLi, RefleXQLi），并通过大规模实验（240次实验、240,000个负载、220万次测试）比较它们与基线模型在10种WAF（包括基于规则、AI/ML和商用WAF）及MySQL验证器上的表现。其中，RADAGAS-GPT4o以22.73%的绕过率超越基线，尤其在AI/ML型WAF上表现突出（例如RADAGAS-DeepSeek对WAF-Brain达92.49%绕过率），但难以突破基于规则的WAF（ModSecurity和Coraza上仅0–5.70%）。该研究揭示了LLM在安全测试中的应用潜力和局限性。

链接: https://arxiv.org/abs/2605.11188
作者: Ali Karakoc,H. Birkan Yilmaz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 32 pages, 8 figures, 8 tables

点击查看摘要

Abstract:SQL injection (SQLi) attacks are still one of the serious attacks ranked in the Open Worldwide Application Security Project (OWASP) Top 10 threats. Today, with advances in Artificial Intelligence (AI), especially in Large Language Models (LLMs), an opportunity has been created for automating adversarial attack tests to measure the defense mechanisms. In this paper, we aim to create a comprehensive evaluation of use cases that utilize LLMs for adversarial SQL injection generation. We introduce two novel LLM-based systems, Retrieval Augmented Generation for Adversarial SQLi (RADAGAS) and Reflective Chain-of-Thought SQLi (RefleXQLi), and compare them with existing baselines against 10 Web Application Firewalls (WAFs) and one execution-based MySQL validator. To perform a comprehensive test, we used six rule-based open-source WAFs (ModSecurity PL1–3, Coraza PL1–3), 2 AI/ML-based WAFs (WAF Brain, CNN-WAF), and 2 commercial WAFs (AWS WAF and Cloudflare WAF). For the LLM models, we used GPT-4o, Claude 3.7 Sonnet, and DeepSeek R1. Our tests consist of 240 experiments that generate 240,000 payloads and perform 2.2 million tests against WAFs. Our comprehensive evaluation reveals that RADAGAS-GPT4o outperforms other baseline models with a 22.73% bypass rate. The proposed RADAGAS variants are highly successful on AI/ML-based WAFs (92.49% on WAF-Brain by RADAGAS-DeepSeek, 80.48% on CNN-WAF by RADAGAS-Claude), but struggle to bypass rule-based WAFs (0–5.70% on ModSecurity and Coraza). In addition to these findings, another observation is that creating less diverse payloads achieves more bypasses, however they show poor results if the initially chosen payload is not successful. We observe that our findings provide a comprehensive view on using LLM-based approaches in security testing.

[AI-176] CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

【速读】：该论文旨在解决大语言模型（LLMs）在内存受限设备（如边缘平台）上自回归解码时的内存带宽瓶颈问题，现有推测解码（speculative decoding）方法假设高带宽内存（HBM）足以同时容纳目标模型和辅助草稿模型，但该假设在内存受限场景下不成立。解决方案的关键是提出CATS（Cascaded self-speculative decoding）框架，它基于内存预算和参数卸载模式（parameter offloading patterns）进行级联验证与校正，从而在保持峰值内存占用与目标模型相同的前提下，最大化令牌接受率（token acceptance rate）并实现端到端加速；通过在真实边缘设备上的评估，CATS在无生成质量损失下实现了最高5.08倍的加速，比现有最优方法（SOTA）在边缘内存约束下提升1.45倍。

链接: https://arxiv.org/abs/2605.11186
作者: Yuning Han,Yangchenchen Jin,Dylan Zhao,Jingwei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously – an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

[AI-177] he Many Faces of On-Policy Distillation: Pitfalls Mechanisms and Fixes

【速读】：该论文系统探究了在线策略蒸馏（On-policy Distillation, OPD）与在线策略自蒸馏（On-policy Self-Distillation, OPSD）作为大语言模型后训练方法时效果不稳定的根本原因，并试图阐明其有效与失效的具体条件。论文发现OPD在数学推理任务中严重依赖教师模型选择和损失函数形式，而OPSD在测试时因缺乏实例特定的特权信息（Privileged Information, PI）而失效。研究识别出三种关键失败机制：教师与学生因学生生成前缀导致的分布不匹配、由偏置的TopK反向KL梯度引发的优化不稳定性，以及OPSD特有的局限——学生学到的无PI策略无法有效聚合依赖PI的教师。解决方案的核心在于针对性缓解这些失败：采用stop-gradient TopK目标函数消除梯度偏差、使用经过RLVR（Reinforcement Learning with Variable Reward）适应的教师模型以缓解分布不匹配，以及利用SFT（监督微调）稳定学生模型来增强训练稳定性。

链接: https://arxiv.org/abs/2605.11182
作者: Siqi Zhu,Xuyan Ye,Hongyu Lu,Weiye Shi,Ge Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model’s own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

[AI-178] Muon is Not That Special: Random or Inverted Spectra Work Just as Well

【速读】：该论文旨在质疑非欧几里得优化中精确几何结构对优化性能影响的关键性，挑战了当前将Muon优化器成功归因于二阶方法或线性最小化预言（LMO）的几何叙事。论文通过三个贡献提出解决方案：首先，引入基于Schatten（拟）范数的Freon优化器家族，利用QDWH迭代近似实现从SGD到Muon的自然插值，并发现GPT-2中表现最好的Schatten参数位于拟范数区间，无法被任何酉不变LMO表示；其次，提出Kaon这种荒谬优化器，用随机噪声替换奇异值，却仍匹配Muon性能并保留收敛保证，证明严格几何结构在实践中无关紧要；最后，揭示优化性能实际上由两个局部量——对齐（alignment）和下降势（descent potential）控制，每个优化器必须围绕这两个量调整步长。通过随机特征模型分析，论文得出关键结论：Muon的成功并非源于追踪理想的全局几何结构，而是通过保证步长最优性（step-size optimality）实现的。

链接: https://arxiv.org/abs/2605.11181
作者: Zakhar Shumaylov,Nathaël Da Costa,Peter Zaika,Bálint Mucsányi,Alex Massucco,Yoav Gelberg,Carola-Bibiane Schönlieb,Yarin Gal,Philipp Hennig
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 45 pages

点击查看摘要

Abstract:The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon’s performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

[AI-179] Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）中的过度平滑（oversmoothing）问题，并提出一种新的代数解释。关键解决方案在于通过幺半群理论（quiver theory）对神经层流扩散（Neural Sheaf Diffusion, NSD）进行重新诠释：将图上的细胞层流（cellular sheaves）视为关联入射幺半群（incidence quiver）的表示，使得学习到的层流几何成为有限维表示空间中的点。研究指出，在扩散极限下，调和空间（harmonic space）的直和分解对应表示退化，即过度平滑本质上是学习到的层流失稳至低复杂度直和项，导致其全局截面（global sections）丧失判别信息。基于此，论文引入几何不变量理论（Geometric Invariant Theory）中的矩映射原理，设计正则化器偏置限制映射（restriction maps）朝向平衡的表示几何；同时揭示等主干尺寸（equal-stalk architectures，即 (d_v = d_e)）中可学习稳定性参数的可容许性会迫使平凡全对象直和项（trivial all-object summand）落在稳定性壁上，而非均匀主干尺寸可消除此结构障碍，使自适应稳定性（adaptive stability）实际有效。实验表明，打破主干对称性可降低方差或改善验证性能，验证了该机制。

链接: https://arxiv.org/abs/2605.11178
作者: Arif Dönmez,Axel Mosig,Ellen Fritsche,Katharina Koch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Representation Theory (math.RT)
备注: 15 pages, Comments welcome

点击查看摘要

Abstract:Neural Sheaf Diffusion (NSD) generalizes diffusion-based Graph Neural Networks by replacing scalar graph Laplacians with sheaf Laplacians whose learned restriction maps define a task-adapted geometry. While the diffusion limit of NSD is known to be the space of global sections, the representation-theoretic structure of this harmonic space remains largely implicit. We develop a quiver-theoretic interpretation of NSD by identifying cellular sheaves on graphs with representations of the associated incidence quiver. Under this correspondence, learned sheaf geometries become points in a finite-dimensional representation space. We show that direct-sum decompositions of the underlying incidence-quiver representation induce decompositions of the harmonic space reached in the diffusion limit. This gives an algebraic interpretation of oversmoothing as representation degeneration: learned sheaves may collapse toward low-complexity summands whose global sections fail to preserve discriminative information. Building on this viewpoint, we connect sheaf diffusion to stability and moment-map principles from Geometric Invariant Theory. We introduce moment-map-inspired regularizers that bias restriction maps toward balanced representation geometries, and identify a structural obstruction in equal-stalk architectures: when d_v = d_e , admissibility for learnable stability parameters forces the trivial all-object summand onto a stability wall. Non-uniform stalk dimensions remove this obstruction, making adaptive stability meaningful. Experiments on heterophilic benchmarks are consistent with this mechanism: breaking stalk symmetry can reduce variance or improve validation behavior, and adaptive stability becomes more effective in selected rectangular settings. Overall, our framework reframes oversmoothing as a degeneration phenomenon in the representation geometry underlying learned sheaf diffusion.

[AI-180] OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

【速读】：该论文试图解决大型语言模型智能体（LLM agent）在部署阶段因动作选择错误累积导致工具调用浪费、延迟增加和可靠性下降的问题，现有推理时适应方法（如提示或检索）仅通过上下文间接影响行为，缺乏显式的决策层来对候选动作评分、表示不确定性或从动作级反馈在线更新。解决方案之关键是提出OLIVIA框架，它将ReAct风格智能体的最终动作选择层建模为一个上下文线性bandit（contextual linear bandit），以冻结的隐藏状态作为决策上下文，从而直接在动作选择接口适应行为，保留底层推理过程，并通过上置信界（upper-confidence-bound）探索提供显式的不确定性估计和轻量级的在线更新，最终以最小计算开销实现样本高效的策略改进。

链接: https://arxiv.org/abs/2605.11169
作者: Sheldon Yu,Junda Wu,Xintong Li,Nikki Lijing Kuang,Sizhe Zhou,Tong Yu,Jiawei Han,Jingbo Shang,Julian McAuley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM’s final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

[AI-181] Benchmarking LLM -Based Static Analysis for Secure Smart Contract Development: Reliability Limitations and Potential Hybrid Solutions

【速读】：该论文试图解决的问题是：大型语言模型（LLM）在作为智能合约安全审计工具时，其可靠性和有效性是否足以替代传统的静态分析工具，或者只能作为补充。解决方案的关键在于通过构建一个自定义的自动化评估框架，系统性地分析LLM的输出，从而揭示其固有缺陷——包括对标识符命名等非语义启发式的依赖导致的词汇偏差（lexical bias），以及对外部数据输入缺乏严格验证所引发的高误报率（false positives），并量化了提示方法（prompting techniques）在精度与召回率之间的权衡关系。该框架在分类模型输出时达到92%的准确率，为评估LLM在安全审计场景中的适用性提供了实证基础。

链接: https://arxiv.org/abs/2605.11163
作者: Stefan-Claudiu Susan,Andrei Arusoaie,Dorel Lucanu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE COMPSAC 2026. Extended version with supplemental materials

点击查看摘要

Abstract:The irreversible nature of blockchain transactions makes the identification of smart contract vulnerabilities an essential requirement for secure system development. While Large Language Models (LLMs) are increasingly integrated into developer workflows, their reliability as autonomous security auditors remains unproven. We assess whether current generative models are a viable replacement for, or only a complement to, traditional static-analysis tools. Our findings indicate that LLM efficacy is undermined by both inherent lexical bias and a lack of rigorous validation of external data inputs. This reliance on non-semantic heuristics, such as identifier naming, leads to a high frequency of false positives. Furthermore, prompting techniques reveal a trade-off between precision and recall. These results were derived using our custom automated framework, which achieves 92% accuracy in classifying model outputs. Comments: Accepted to IEEE COMPSAC 2026. Extended version with supplemental materials Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.11163 [cs.CR] (or arXiv:2605.11163v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.11163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-182] Interpretability Can Be Actionable ICML2026

【速读】：该论文试图解决深度神经网络可解释性研究虽快速增长但缺乏实际影响的问题，即大量研究未能转化为实践应用，其相关性和实用性受到质疑。解决方案之关键在于引入“可行动性”（actionability）作为核心评估标准，取代单纯的方法创新：可解释性应通过其能否促成超出可解释性研究本身的具体决策和干预来评价。论文从具体性（concreteness）和验证性（validation）两个维度定义可行动的可解释性，分析当前阻碍实际应用的壁垒，并识别五个可解释性具有独特杠杆作用的领域，提出一个与实用成果对齐的评估框架，旨在将可行动性确立为可解释性研究的核心目标。

链接: https://arxiv.org/abs/2605.11161
作者: Hadas Orgad,Fazl Barez,Tal Haklay,Isabelle Lee,Marius Mosbach,Anja Reusch,Naomi Saphra,Byron Wallace,Sarah Wiegreffe,Eric Wong,Ian Tenney,Mor Geva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability–the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions–concreteness and validation–and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

[AI-183] RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

【速读】：该论文旨在解决离线到在线强化学习（Offline-to-online RL）中，由于状态-动作空间大且预收集数据集覆盖有限，导致critic学习不准确，以及现有悲观方法（如对分布外动作降权）虽然有效但实质上类似行为克隆，在数据集动作次优时阻碍后续在线策略改进的问题。解决方案之关键是提出 RankQ——一种离线到在线的Q学习目标，通过自监督的多项排序损失（self-supervised multi-term ranking loss）增强时序差分学习（temporal-difference learning），强制Q函数学习结构化动作排序（structured action ordering）。与统一惩罚未见动作不同，RankQ通过学习相对动作偏好（relative action preferences）来塑造Q函数，使动作梯度（action gradients）指向更高质量的行为，从而在稀疏奖励D4RL基准和视觉机器人学习中显著提升性能，并实现有效的sim-to-real迁移。

链接: https://arxiv.org/abs/2605.11151
作者: Andrew Choi,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state–action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 84.7% relative to the VLA’s initial performance.

[AI-184] EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual Team and Population Scales

【速读】：该论文旨在解决现有测试时（test-time）多智能体进化方法中的一个根本矛盾：单智能体学习器仅能演化自身上下文和记忆，而多智能体系统本应同时演化“谁协作、如何协作、知识如何流动”等维度并催生涌现式专业化（emergent specialization），但现有方法要么将经验局限在单个智能体内而放弃跨智能体学习，要么对所有智能体对称广播从而抹杀专业化价值。解决方案的关键在于 EVOCHAMBER——一个无需训练的框架，它在共进化智能体池上实施三层测试时进化：最核心的组件是 CODREAM（Collaborative Dreaming），一种任务后协议，在团队失败或分歧时触发，智能体通过协作反思提炼见解，并将知识从强智能体不对称路由至在该失败领域上较弱的智能体，从而在填补知识缺口的同时保留专业化分工；团队级算子在线组建基于特定领域（niche）的团队并动态选择协作结构，种群级生命周期算子则通过分叉、合并、修剪和播种来维持性能压力下的适应能力。实验表明，不对称跨智能体知识转移（asymmetric cross-agent transfer）是性能提升的主要驱动因素，且从同质初始化智能体出发自发涌现4-5个稳定领域专家，这是单智能体学习器无法表达的多智能体进化结构特征。

链接: https://arxiv.org/abs/2605.11136
作者: Yaolun Zhang,Tianyi Xu,Shengyu Dai,Zhenwen Shao,Qingyun Wu,Huazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: this https URL

[AI-185] Spurious Correlation Learning in Preference Optimization: Mechanisms Consequences and Mitigation via Tie Training

【速读】：该论文旨在解决偏好学习方法（如直接偏好优化，DPO）中模型因依赖虚假相关性（spurious correlations）而导致的谄媚（sycophancy）、长度偏差（length bias）以及未来系统中可能出现的严重目标泛化错误（goal misgeneralization）问题。解决方案的关键在于提出一种名为“平局训练”（tie training）的数据增强策略，通过引入等效用偏好对（equal-utility preference pairs）施加数据驱动的正则化，从而在不削弱因果学习（causal learning）的前提下，选择性地减少虚假学习（spurious learning）。理论分析揭示了虚假依赖的两个人口层面（population-level）通道：均值虚假偏差（mean spurious bias）和因果-虚假相关泄漏（causal-spurious correlation leakage），并证明了来自相同训练分布的更多数据无法降低这种依赖，而平局训练能有效缓解这一分布偏移下的脆弱性。

链接: https://arxiv.org/abs/2605.11134
作者: Christian Moya,Alex Semendinger,Guang Lin,Elliott Thornley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea

点击查看摘要

Abstract:Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today’s language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal–spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model’s dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

[AI-186] HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series NEURIPS2026

【速读】：该论文试图解决多变量时间序列中关键事件预测的难题，主要挑战在于此类事件（从涡轮故障到心律失常）不仅罕见且标注成本高昂，导致带标签数据极度匮乏。解决方案的核心在于HEPA（Horizon-conditioned Event Predictive Architecture）的两个关键原则：首先，利用因果Transformer编码器通过联合嵌入预测架构（JEPA）进行预训练，其中基于时间范围的条件预测器（horizon-conditioned predictor）学习预测未来的表征而非具体数值，从而迫使编码器仅从无标签数据中捕获可预测的时间动态；其次，冻结编码器并仅针对目标事件微调预测器，生成一个随时间范围单调递减的生存累积分布函数（CDF）。这一设计在固定架构和优化器超参数的前提下，实现了跨11个领域14个基准的卓越性能。

链接: https://arxiv.org/abs/2605.11130
作者: Jonas Petersen(1 and 2),Gian-Alessandro Lombardi(2),Riccardo Maggioni(2),Camilla Mazzoleni(2),Federico Martelli(1 and 2),Philipp Petersen(3) ((1) ETH Zurich, (2) Forgis, (3) University of Vienna)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages main text, 15 pages appendix, 5 figures, 9 tables. Under review at NeurIPS 2026. Code at this https URL

点击查看摘要

Abstract:Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is scarce because such events are rare and costly to annotate. We introduce HEPA (Horizon-conditioned Event Predictive Architecture), built on two key principles. First, a causal Transformer encoder is pretrained via a Joint-Embedding Predictive Architecture (JEPA): a horizon-conditioned predictor learns to forecast future representations rather than future values, forcing the encoder to capture predictable temporal dynamics from unlabeled data alone. Second, we freeze the encoder and finetune only the predictor toward the target event, producing a monotonic survival cumulative distribution function (CDF) over horizons. With fixed architecture and optimiser hyperparameters across all benchmarks, HEPA handles water contamination, cyberattack detection, volatility regimes, and eight further event types across 11 domains, exceeding leading time-series architectures including PatchTST, iTransformer, MAE, and Chronos-2 on at least 10 of 14 benchmarks, with an order of magnitude fewer tuned parameters and, on lifecycle datasets, an order of magnitude less labeled data.

[AI-187] SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

【速读】：该论文试图解决基于视觉-语言-动作（Vision-Language-Action, VLA）和模仿学习（imitation learning）的策略在低成本硬件上训练后，部署于新环境时频繁失败的问题——现有基准在受控背景下显示高成功率，但社区实践中迁移性能近乎为零。解决方案的关键在于SEVO（Semantic-Enhanced Virtual Observation），一种数据驱动方法，通过三种机制增强原始RGB摄像头流：固定于机器人本体的摄像头（body-fixed cameras）提供完整工作空间覆盖、主动红光谱照明（active red-spectrum illumination）物理归一化物体外观、以及实时YOLO分割叠加（real-time YOLO segmentation overlay）提供背景不变语义线索。同时，SEVO的核心在于采用系统性地改变光照、背景和干扰物的多样化数据收集协议（diversified data collection protocol），这是实现跨环境泛化的最重要因素，而非依赖模型规模扩展。

链接: https://arxiv.org/abs/2605.11114
作者: Tianchonghui Fang,Yuan Zhuang,Fei Miao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant semantic cue. Critically, we show that a diversified data collection protocol (systematically varying lighting, backgrounds, and distractors during teleoperation) is the single most important factor for generalization. We target transparent water bottles, objects that visually blend with their surroundings, and select a simple pick-and-place task to enable hundreds of controlled real-robot trials across two mobile platforms. The full pipeline achieves 95% grasp success with ACT and 83% with SmolVLA in the training environment, transferring to novel environments at 85% and 75%. Without SEVO, the same policies achieve only 75%/70% in training and collapse to 30-35% in novel environments. Our results demonstrate that principled observation design and environmental diversity during data collection, not model scaling, enable low-cost robots to operate reliably in everyday household environments.

[AI-188] Newtons Lantern: A Reinforcement Learning Framework for Finetuning AC Power Flow Warm Start Models

【速读】：该论文旨在解决交流潮流（AC power flow）问题求解中，现有的神经暖启动（neural warm starts）监督学习方法在接近电压崩溃的重载实例上泛化性能差的问题。研究证明牛顿-拉夫逊迭代次数存在一个下界，该下界取决于暖启动误差的方向而非幅度，并且当潮流雅可比矩阵的最小奇异值缩小时，该下界变得无效，这揭示了靠近鞍结分岔点时监督回归的失效模式。基于此分析，论文提出的关键解决方案是Newton’s Lantern微调管道，它通过结合组相对策略优化（group relative policy optimization）和基于基础模型预测扰动训练的学习奖励模型，直接以迭代次数作为监督信号，从而在IEEE 118节点、GOC 500节点和GOC 2000节点基准测试中，成为唯一能够在所有测试快照上收敛且达到最小平均迭代次数的方法。

链接: https://arxiv.org/abs/2605.11102
作者: Shourya Bose,Helgi Hilmarsson,Dhruv Suri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Neural warm starts can sharply reduce the number of Newton-Raphson iterations required to solve the AC power flow problem, but existing supervised approaches generalize poorly on heavily loaded instances near voltage collapse. We prove a lower bound on the Newton-Raphson iteration count that depends on the direction of the warm start error rather than on its magnitude, and show as a corollary that the bound becomes vacuous as the smallest singular value of the power-flow Jacobian shrinks, identifying the failure mode of supervised regression near the saddle-node bifurcation. Motivated by this analysis, we introduce Newton’s Lantern, a finetuning pipeline that combines group relative policy optimization with a learned reward model trained on perturbations of the base model’s predictions, using the iteration count itself as the supervisory signal. Across IEEE 118-bus, GOC 500-bus, and GOC 2000-bus benchmarks, Newton’s Lantern is the only method that converges on every test snapshot while attaining the smallest mean iteration count.

[AI-189] Enabling Performant and Flexible Model-Internal Observability for LLM Inference

【速读】：该论文试图解决当前推理时工作负载（inference-time workloads）对模型内部状态（internal states）的及时访问需求，与现有观测方法导致的高延迟开销及对服务优化（serving optimizations）破坏之间的矛盾。解决方案的关键在于：通过设计基于 Ring^2 的 GPU-CPU 内存抽象（memory abstraction），实现张量（tensors）的异步捕获与暂存，并构建策略控制的主机后端（policy-controlled host backend）进行导出，从而将内部可观察性（internal observability）作为一等系统原语（first-class systems primitive），使其与推理热路径（inference hot path）完全解耦，在保持严格 GPU 内存预算（tight GPU memory budgets）的同时，将观测开销控制在 0.4%–6.8% 之间，并实现 2 倍至 15 倍的延迟降低。

链接: https://arxiv.org/abs/2605.11093
作者: Nengneng Yu,Sixian Xiong,Yibo Zhao,Wei Wang,Zaoxing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Today’s inference-time workloads increasingly depend on timely access to a model’s internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%–6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability features. DMI-Lib is open-sourced at this https URL.

[AI-190] ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

【速读】：该论文旨在解决现有自闭症谱系障碍（ASD）自动化筛查工具的三个主要缺陷：单一架构评估（仅使用单一模型类型）、评估轴受限（忽视可解释性和鲁棒性等多维指标）以及几乎完全聚焦于成人群体，从而掩盖了对于早期干预至关重要的年龄特异性诊断模式。解决方案的关键在于提出了一个名为ASD-Bench的系统性表格型基准测试框架，该框架同时评估多种模型架构（包括经典机器学习、神经网络、深度表格变压器及TabPFN v2）在三个年龄队列（儿童、青少年、成人）上的四个核心维度：预测性能、校准度、可解释性与对抗鲁棒性。此外，论文引入了一种新的成本敏感度量——Heuristic Aggregate Penalty (HAP)，通过加重对假阴性的惩罚并结合交叉验证方差来评估部署稳定性，从而弥补传统单一指标评估的不足。基于4068份AQ-10记录的数据集，该基准揭示了年龄队列间特征重要性的迁移（如社交动机A9在儿童中主导、模式识别A5在青少年中主导）以及准确性与校准度的分离现象，证明单指标评估不足以支撑临床AI应用，并提供了基于年龄队列的部署建议。所有发现被视为基于问卷衍生标签的概念验证证据，而非临床验证的诊断性能。

链接: https://arxiv.org/abs/2605.11091
作者: Shubhankit Singh,Hassan Shaikh,Kuldeep Raghuwanshi,Keshav Bulia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 12 figures, 8 tables

点击查看摘要

Abstract:Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.

[AI-191] ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

【速读】：该论文旨在解决当前对AI智能体（AI agents）漏洞利用（exploitation）能力评估不足的问题，特别是漏洞利用作为一种需要底层程序推理、运行时自适应和长周期持续进展的复杂任务，在网络安全中具有双重用途（既支持防御也降低攻击门槛），但缺乏大规模、多样化的真实基准测试。解决方案的关键在于提出了ExploitGym，它是一个包含898个源自真实漏洞（涵盖用户空间程序、Google V8 JavaScript引擎和Linux内核三个领域）的实例基准测试，并通过变化每实例的安全防护措施来隔离其对智能体性能的影响，所有配置均封装在可重复的容器化环境中，从而为评估AI智能体的漏洞利用能力提供了有效且标准化的测试平台。

链接: https://arxiv.org/abs/2605.11086
作者: Zhun Wang,Nico Schiller,Hongwei Li,Srijiith Sesha Narayana,Milad Nasr,Nicholas Carlini,Xiangyu Qi,Eric Wallace,Elie Bursztein,Luca Invernizzi,Kurt Thomas,Yan Shoshitaishvili,Wenbo Guo,Jingxuan He,Thorsten Holz,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google’s V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic’s latest model Claude Mythos Preview and OpenAI’s GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.

[AI-192] MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

【速读】：该论文试图解决LLM代理在通过模型上下文协议（Model Context Protocol, MCP）调用外部工具时，其工具调用流量缺乏有效攻击检测机制的问题。解决方案的关键在于将每个代理会话编码为图结构：以工具调用为节点，以顺序和数据流链接为边，并通过句子嵌入（sentence embedding）特征丰富节点上的参数与响应内容，从而利用图神经网络（GNN）或经典集成学习模型对会话进行良性或恶意分类。实验表明，内容级特征（SBERT嵌入）是检测效力的核心，而单纯依赖元数据会导致检测性能严重受限（AUROC约0.64），且任务分割方式不当会造成评估结果的虚假膨胀。

链接: https://arxiv.org/abs/2605.11053
作者: Sultan Zavrak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, MCPShield is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

[AI-193] ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

【速读】：该论文旨在解决接触密集型操作任务（contact-rich manipulation tasks）中，由于复杂的接触动力学（contact dynamics）需要高精度力反馈与控制，而现有模仿学习方法难以在多模态观测下实现稳健泛化的问题。其解决方案的关键在于提出基于流匹配（flow matching）的力感知反应式框架ForceFlow，具体包括两个核心设计：第一，在接触阶段策略设计中，采用非对称多模态融合架构（asymmetric multimodal fusion architecture），将力信号作为全局调节信号（global regulatory signal），并结合联合预测范式（joint prediction paradigm）来增强对瞬时力与历史信息的理解，从而实现力与运动的深度耦合；第二，在任务级层次分解中，将操作过程划分为视觉主导的接近阶段（vision-dominant approach stage）和触觉主导的交互阶段（touch-dominant interaction stage），并通过视觉到力（Vision-to-Force, V2F）交接机制，显式解耦空间泛化与接触调节，从而在保持低成本的同时大幅提升成功率和零样本分布外泛化（zero-shot out-of-distribution generalization）能力。

链接: https://arxiv.org/abs/2605.11048
作者: Shuoheng Zhang,Yifu Yuan,Hongyao Tang,Yan Zheng,Qiaojun Yu,Pengyi Li,Guowei Huang,Helong Huang,Xingyue Quan,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant challenge due to complex contact dynamics that demand high-precision force feedback and control. Although recent efforts have attempted to integrate force/torque sensing into policies, how to build a simple yet effective framework that achieves robust generalization under multimodal observations remains an open question. In this paper, we propose ForceFlow, a force-aware reactive framework built upon flow matching. For contact-stage policy design, we investigate force signal fusion mechanisms and adopt an asymmetric multimodal fusion architecture that treats force as a global regulatory signal, combined with a joint prediction paradigm that enhances the policy’s understanding of instantaneous force and historical information, thereby achieving deep coupling between force and motion. For task-level hierarchical decomposition, we divide manipulation into a vision-dominant approach stage (VLM-based pointing for target localization) and a touch-dominant interaction stage (force-driven contact execution), with a Vision-to-Force (V2F) handover mechanism that explicitly decouples spatial generalization from contact regulation. Experimental results across six real-world contact-rich tasks demonstrate that ForceFlow achieves a 37% success rate improvement over the strong baseline ForceVLA while maintaining significantly lower cost. Moreover, ForceFlow exhibits accurate force signal prediction and demonstrates superior performance in contact force self-regulation and zero-shot out-of-distribution (OOD) generalization.

[AI-194] Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

【速读】：该论文旨在解决智能语言模型系统（agentic language-model systems）中由于可变执行上下文（如文件、内存、工具、技能及辅助工件）带来的安全风险问题，这些风险无法通过仅依赖用户显式提示（explicit user prompts）或最终响应评估（final-response evaluation）来有效检测。解决方案之关键在于提出了名为DeepTrap的自动化框架，该框架将对抗性上下文操纵（adversarial context manipulation）形式化为一个黑盒轨迹级优化问题（black-box trajectory-level optimization problem），通过同时平衡风险实现（risk realization）、良性任务保持（benign-task preservation）和隐蔽性（stealth），并整合风险条件评估（risk-conditioned evaluation）、多目标轨迹评分（multi-objective trajectory scoring）、奖励引导束搜索（reward-guided beam search）以及基于反射的深度探测（reflection-based deep probing）等技术，系统性地识别出能够诱发不安全行为但保留用户任务完成的高价值妥协上下文（high-value compromised contexts）。

链接: https://arxiv.org/abs/2605.11047
作者: Hongwei Yao,Yiming Liu,Yiling He,Bingrun Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user-facing task completion, demonstrating that final-response evaluation is insufficient. The findings highlight the need for execution-centric security evaluation of agentic AI systems. Our code is released at: this https URL

[AI-195] Read Extract Classify: A Tool for Smarter Requirements Engineering

【速读】：该论文旨在解决需求工程中从半结构化需求文档中手动提取和分类信息的低效与不准确问题，以加速软件开发生命周期。解决方案的关键在于ReXCL工具的两个核心模块：提取模块通过启发式规则与预测性建模（predictive modeling）将原始需求文档自动转换为预定义模式（schema），而分类模块则采用编码器模型的自适应微调（adaptive fine-tuning）为每条需求赋予类别标签，最终输出可集成至外部需求工程工具。这一方法首次实现了对半结构化需求文档的模式化自动处理，显著提升了需求管理的效率与准确性。

链接: https://arxiv.org/abs/2605.11045
作者: Paheli Bhattacharya,Manojit Chakraborty,Santhosh Kumar Arumugam,Rishabh Gupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at Requirements Engineering 2025 Conference

点击查看摘要

Abstract:This paper presents the ReXCL tool, which automates the extraction and classification processes in requirements engineering, enhancing the software development life-cycle. The tool features two main modules: Extraction, which processes raw requirement documents into a predefined schema using heuristics and predictive modeling, and Classification, which assigns class labels to requirements using adaptive fine-tuning of encoder-based models. The final output can be exported to external requirement engineering tools. Performance evaluations indicate that ReXCL significantly improves efficiency and accuracy in managing requirements, marking a novel approach to automating the schematization of semi-structured requirement documents.

[AI-196] owards Model-Free Learning in Dynamic Population Games: An Application to Karma Economies

【速读】：该论文旨在解决现有动态人口博弈（Dynamic Population Games, DPGs）在现实应用中的局限性：现有计算工具假设完全已知博弈模型并以集中式方式运行，而实际场景中智能体仅能获取自身私有经验，无法进行集中式学习。为此，论文针对Karma DPGs中的无模型均衡学习问题提出了解决方案。其关键创新在于两个方面：首先，对于已处于平稳纳什均衡（Stationary Nash Equilibrium, SNE）的Karma DPG，新加入的智能体通过深度Q网络（Deep Q-Networks, DQN）直接学习策略，无需知晓博弈模型，并利用DQN的收敛性分析获得了由回放缓冲区大小 (N_s) 导致的 (O(1/\sqrt{N_s})) 近似误差和由种群规模 (N) 导致的 (O(1/N)) 平均场扰动误差所构成的次优性界；其次，针对从零开始学习SNE的更具挑战性的问题，论文通过实证方法证明了将深度强化学习与虚构博弈（fictitious play）及平滑策略迭代（smoothed policy iteration）相结合，能够使智能体以无模型方式收敛到接近集中式计算的SNE配置。这些贡献为Karma经济学作为公平资源分配的实际工具提供了理论支持。

链接: https://arxiv.org/abs/2605.11042
作者: Matteo Cederle,Saverio Bolognani,Gian Antonio Susto
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic Population Games (DPGs) provide a tractable framework for modeling strategic interactions in large populations of self-interested agents, and have been successfully applied to the design of Karma economies, a class of fair non-monetary resource allocation mechanisms. Despite their appealing theoretical properties, existing computational tools for DPGs assume full knowledge of the game model and operate in a centralized fashion, limiting their applicability in realistic settings where agents have access only to their own private experience. This paper takes a step towards addressing this gap by studying model-free equilibrium learning in Karma DPGs. First, we analyze the setting in which a novel agent joins a Karma DPG already at its Stationary Nash Equilibrium (SNE) and learns a policy via Deep Q-Networks (DQN) without knowledge of the game model. Leveraging recent convergence results for DQN, we establish a suboptimality bound consisting of a DQN approximation error of order O(1/\sqrtN_s) and a mean field perturbation error of order O(1/N) , where N_s is the replay buffer size and N is the population size. Second, we consider the challenging problem of learning the SNE from scratch. We show empirically that combining deep RL with fictitious play and smoothed policy iteration allows agents to converge, in a model-free fashion, to a configuration close to the centrally computed SNE. Together, these contributions support the vision of Karma economies as practical tools for fair resource allocation.

[AI-197] he Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

【速读】：该论文试图解决工具使用型大语言模型（LLM）代理在混合信任工作流中的安全与效用权衡问题：现有基于完整工具调用粒度的防御机制无法区分可控参数与不可控参数，导致要么允许不可信内容控制调用而引发注入劫持（如目的地或命令被篡改），要么完全隔离调用而阻断正常的检索-然后-执行行为。解决方案的关键在于提出基于来源感知能力契约（PACT）的运行时监控器，其核心创新是将安全边界从完整的工具调用下放到参数级，通过为每个工具参数分配语义角色（semantic roles），跨重规划步骤追踪值的来源（provenance），并验证每个参数的来源是否满足其角色特定的信任契约（trust contract）。这种参数级权威绑定（authority binding）机制使得系统能够区分“上下文中出现不可信内容”与“不可信内容决定权威参数”这两种截然不同的风险场景，从而在理论上实现100%的安全与效用，实际部署中在强模型上恢复近半效用的同时保持零攻击成功率。

链接: https://arxiv.org/abs/2605.11039
作者: Linfeng Fan,Ziwei Li,Yuan Tian,Yichen Wang,Rongsheng Li,Xiong Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-using LLM agents must act on untrusted webpages, emails, files, and API outputs while issuing privileged tool calls. Existing defenses often mediate trust at the granularity of an entire tool invocation, forcing a brittle choice in mixed-trust workflows: allow external content to influence a call and risk hijacked destinations or commands, or quarantine the call and block benign retrieval-then-act behavior. The key observation behind this paper is that indirect prompt injection becomes dangerous not when untrusted content appears in context, but when it determines an authority-bearing argument. We present \textscPACT (\emphProvenance-Aware Capability Contracts), a runtime monitor that assigns semantic roles to tool arguments, tracks value provenance across replanning steps, and checks whether each argument’s origin satisfies its role-specific trust contract. Under oracle provenance, \textscPACT achieves 100% utility and 100% security on mixed-trust diagnostic suites, while flat invocation-level monitors incur false positives or false negatives. In full AgentDojo deployments across five models, \textscPACT reaches 100% security on the three strongest models while recovering 38.1–46.4% utility, 8–16 percentage points above CaMeL at the same security level. Ablations show that both semantic roles and cross-step provenance are necessary. \textscPACT reframes agent security as authority binding, and isolates the remaining deployment bottleneck to provenance inference and contract synthesis.

[AI-198] Sequential Behavioral Watermarking for LLM Agents

【速读】：该论文试图解决基于大语言模型（LLM）的智能体在决策轨迹中缺乏可溯源特征的问题，即从观察到的行为中难以确定轨迹的归属、所有权和是否被未授权重用，因为现有文本水印无法覆盖动作级决策，而近期基于行为的水印方法因将每个动作视为独立试验而忽略轨迹结构，在轨迹被扰动、截断或无法可靠对齐时变得脆弱。解决方案的关键在于提出SeqWM，一种序列行为水印框架（sequential behavioral watermarking framework），其核心是将水印信号嵌入历史条件转换模式（history-conditioned transition patterns）中，并以位置无关的方式（position-agnostically）相对于随机密钥基线（random-key baselines）验证轨迹，从而在保持智能体效用的同时实现可靠检测，且对轨迹扰动具有鲁棒性。

链接: https://arxiv.org/abs/2605.11036
作者: Hyeseon An,Shinwoo Park,Dongsu Kim,Yo-Sub Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, preprint

点击查看摘要

Abstract:LLM-based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action-level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history-conditioned transition patterns and verifies trajectories position-agnostically against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round-indexed behavioral watermarks collapse.

[AI-199] MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

【速读】：该论文旨在解决网络流量分类和入侵检测中的核心痛点，即现有方法过度依赖复杂的tokenization（令牌化）、patching（分块）、工程化多模态表示以及大规模自监督预训练，导致模型部署成本高、效率低下。解决方案的关键在于提出了MambaNetBurst，一种基于Mamba-2骨干网络的紧凑型、无tokenizer的字节级序列分类器。它直接对原始数据包字节进行操作，通过从流的前几个数据包中构建固定长度的突发（burst），附加可学习的CLS令牌（CLS token）后，使用残差预归一化Mamba-2块进行端到端监督分类，从而完全避免了预处理流水线和预训练阶段。核心创新在于保留字节级时间分辨率（temporal resolution），且实验表明早期下采样（如步长卷积）会持续损害性能，而适中的状态大小（state size）足以保证鲁棒泛化。Mamba-2在保持高效训练速度的同时，其受限的转移结构仍能有效建模数据包字节，使得直接且“未稀释”的字节到分类学习成为一种实用且新颖的流量分析范式。

链接: https://arxiv.org/abs/2605.11034
作者: Gayan K. Kulatilleke,Siamak Layeghy,Mahsa Baktashmotlagh,Marius Portmann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 16 pages, 2 figures. Pareto-optimal frontier. Transformer vs Mamba vs Mamba-2 scaling performance. Code and data available on request

点击查看摘要

Abstract:We present MambaNetBurst, a compact tokenizer-free byte-level sequence classifier for network burst classification based on a Mamba-2 backbone. In contrast to most recent strong traffic-classification and intrusion-detection approaches, our method operates directly on raw packet bytes, avoids tokenization, patching, and heavy engineered multimodal representations, and does not require any self-supervised pre-training stage. Given a packet flow, we form a fixed-length burst from the first few packets, embed the resulting byte sequence appending a learnable CLS token, and process it with a stack of residual pre-normalized Mamba-2 blocks for end-to-end supervised classification. Across six public benchmarks spanning encrypted mobile app identification, VPN/Tor traffic classification, malware traffic classification, and IoT attack traffic, MambaNetBurst achieves consistently strong results and is competitive with, or outperforms, substantially heavier and often pre-trained baselines. Our ablation study shows that preserving byte-level temporal resolution is critical, that early downsampling through striding is consistently harmful, and that moderate state sizes are sufficient for robust generalization. We further show that Mamba-2, despite its more constrained transition structure relative to Mamba-1, remains highly effective for packet-byte modeling while providing clear efficiency advantages, particularly in training speed. Overall, our results demonstrate that direct undiluted byte-to-classification learning with compact selective state space models is a practical, effective and novel direction for efficient, deployable traffic analysis that bypasses the complexity of pre-training pipelines even over highly optimized linear attention architectures. Comments: 16 pages, 2 figures. Pareto-optimal frontier. Transformer vs Mamba vs Mamba-2 scaling performance. Code and data available on request Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2605.11034 [cs.CR] (or arXiv:2605.11034v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.11034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-200] Portable Agent Memory: A Protocol for Cryptographically-Verified Memory Transfer Across Heterogeneous AI Agents

【速读】：该论文试图解决当前AI代理（AI agents）在异构平台间无法迁移持久性内存状态的问题——现代AI代理积累了丰富的上下文信息，包括情景事件（episodic events）、语义知识（semantic knowledge）、程序技能（procedural skills）、工作状态（working state）以及身份偏好（identity preferences），但这些上下文被锁定在特定供应商的运行时环境中，阻碍了代理的可移植性与互操作性。解决方案的关键在于提出并实现了一套开放协议“可移植代理内存（Portable Agent Memory）”，其核心包括：(1) 一个五组件结构化内存模型，采用内容可寻址的条目并通过Merkle-DAG溯源图（Merkle-DAG provenance graph）构建防篡改链；(2) 基于能力（capability-based）的访问控制机制，允许对内存片段进行选择性、有范围的披露；(3) 一种抗注入的再水化协议（rehydration protocol），能够将回忆内容适配到异构目标模型，同时减轻间接提示注入（indirect prompt injection）风险；(4) 采用JSON优先的序列化格式，并可选CBOR压缩以实现高效传输。此外，该协议通过一个包含54个通过测试的Python SDK和多平台代理技能进行了实现，并演示了GPT-4、Claude、Gemini及Llama架构之间的跨模型内存迁移。

链接: https://arxiv.org/abs/2605.11032
作者: Santhosh Kumar Ravindran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 28 references

点击查看摘要

Abstract:We present Portable Agent Memory, an open protocol and reference implementation for transferring persistent memory state across heterogeneous AI agents. Modern AI agents accumulate rich context – episodic events,semantic knowledge, procedural skills, working state, and identity preferences – but this context remains locked within vendor-specific runtimes. Portable Agent Memory addresses this through: (1) a five-component structured memory model with content-addressable entries linked by a Merkle-DAG provenance graph providing tamper-evidence; (2) capability-based access control enabling selective, scoped disclosure of memory segments; (3) an injection-resistant rehydration protocol that adapts recalled content to heterogeneous target models while mitigating indirect prompt injection; and (4) a JSON-first serialization format with optional CBOR compaction for efficient transport. We provide a Python SDK with 54 passing tests, agent skills for multiple platforms, and demonstrate cross-model memory transfer between GPT-4, Claude, Gemini, and Llama architectures. The protocol is open-source under Apache 2.0.

[AI-201] Frag Bench: Cross-Session Attacks Hidden in Benign-Looking Frag ments

【速读】：该论文试图解决现有大型语言模型（LLM）安全基准无法检测跨多个无上下文关联会话的碎片化恶意攻击问题。攻击者可将恶意目标拆分为多个看似无害的子提示，这些子提示仅在组合时产生危害，而当前基准仅针对单提示或同一会话内多轮交互进行评估，忽略了跨会话的恶意信号。解决方案的关键在于构建FragBench基准，它基于24个真实网络攻击事件，保留完整的攻击痕迹（包括多片段杀伤链、每片段安全判断结果、沙盒执行轨迹及匹配的良性覆盖会话），并将此痕迹分解为两个配对任务：对抗性重写器用于强化子提示以规避单轮安全判断，以及基于图的用户级检测器（采用图神经网络GNN等模型）训练于交互结果中。核心创新在于证明防御碎片化LLM滥用必须建模跨会话交互图，而非孤立提示，实验表明四种GNN变体和三种经典机器学习基线均能恢复跨会话特征，事件级F1分数达0.88-0.96。

链接: https://arxiv.org/abs/2605.11029
作者: Astha Mehta,Niruthiha Selvanayagam,Cedric Lam,Hengxu Li,Phuc-Nguyen Nguyen,Raymond Lee,Olivia McGoffin, My (Isabella)Luong,Arthur Collé,Jamie Johnson,David Williams-King,Linh Le
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: preprint of submission

点击查看摘要

Abstract:An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at this https URL.

[AI-202] From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI Agent ic Systems and Engineering Accountability

【速读】：该论文试图解决的核心问题是：尽管生成式人工智能（GenAI）和智能体系统正将软件工程从以代码为中心的生产转向以意图为中心的人机协作，但关于公共技术论述与同行评审证据如何共同塑造这一专业近期转型的认识仍存在显著空白。现有研究多聚焦于代码生成、AI结对编程和软件工程智能体，却缺乏对行业整体过渡方向的结构化理解。解决方案的关键在于采用一种结合反思性主题分析（Reflexive Thematic Analysis, RTA）主导与解释性现象学分析（Interpretative Phenomenological Analysis, IPA）的混合方法，对包含同行评审文献、技术基准、公开演讲与访谈、散文、产品技术公告以及X平台（原Twitter）上AI和软件工程领域权威人物话语在内的丰富语料库进行系统分析。通过建立语料库登记表、编码手册、编码矩阵、主题-来源追溯表、DOI/参考文献审计以及可重复性协议，该研究揭示了GenAI在降低生成可行代码成本的同时，显著提升了意图规范、上下文策展、架构知识、验证、安全、溯源、治理及负责任人类判断的重要性。这一分析最终表明，软件工程正从孤立的代码创作转向对社会技术系统（涉及人类、智能体、工具和证据门控）的监督、验证与治理——而这一转变的关键在于平衡快速采用的利弊：速度导向的采纳可能累积隐性技术债务与问责缺口，而有限自主权（bounded autonomy）则能维护质量、安全、可维护性与信任。

链接: https://arxiv.org/abs/2605.11027
作者: Elyson De La Cruz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 tables

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) and agentic systems are moving software engineering from code-centric production toward intent-centric human-agent work in which natural language, repository context, tools, tests, and governance shape delivery. Prior studies examine code generation, AI pair programming, and software engineering agents, but less is known about how public technical discourse and peer-reviewed evidence together frame the profession’s near-term transition. This study addresses that gap through a reflexive thematic analysis (RTA) dominant and interpretative phenomenological analysis (IPA) informed public-discourse and document analysis. The corpus combines peer-reviewed software engineering and AI literature, technical benchmarks, public talks and interviews, essays, product-facing technical announcements, and X-originated discourse from prominent AI and software engineering voices. Sources were organized through a corpus register, codebook, coding matrix, theme-to-source traceability table, DOI/reference audit, and reproducibility protocol. The analysis shows that GenAI lowers the cost of producing plausible code while increasing the importance of intent specification, context curation, architecture knowledge, verification, security, provenance, governance, and accountable human judgment. The findings indicate that software engineering is becoming less about isolated code authorship and more about supervising, validating, and governing socio-technical systems of humans, agents, tools, and evidence gates. This matters because speed-focused adoption can accumulate hidden technical debt and accountability gaps, whereas bounded autonomy can preserve quality, security, maintainability, and trust.

[AI-203] rust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates ICML

【速读】：该论文试图解决经典逆向强化学习（Inverse Reinforcement Learning, IRL）中对偶上升法需要每次迭代完全求解一个强化学习问题而计算代价高昂，以及近期对抗方法虽降低计算成本却牺牲稳定性和单调对偶改进的问题。解决方案的关键理论洞见是：对于奖励函数更新，信任区域最优策略（trust-region-optimal policy）可以在同一更新方向上的更小步长下实现全局最优；基于此，该方法仅需在当前策略附近进行局部搜索即可显式优化对偶目标，从而在不完全求解强化学习问题的前提下，同时实现奖励函数和策略的单调改进，避免了对抗方法的训练不稳定性，同时恢复了传统IRL中可全局优化以匹配专家演示的奖励函数。所提出的信任区域逆强化学习（Trust Region Inverse Reinforcement Learning, TRIRL）算法在多项挑战性任务中以2.4倍的聚合四分位均值（aggregate inter-quartile mean）超越当前最优的模仿学习方法，且学得的奖励函数能够泛化至系统动力学变化场景。

链接: https://arxiv.org/abs/2605.11020
作者: Anish Diwan,Davide Tateo,Christopher E. Mower,Haitham Bou-Ammar,Jan Peters,Oleg Arenz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted as a conference paper at the International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL–one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.

[AI-204] Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

【速读】：该论文试图解决大语言模型在复杂推理中因“过度思考”（overthinking）现象导致推理效率低下的问题，并指出现有基于强化学习的方法通过设计精细奖励函数来压缩推理链，会导致高质量样本在探索空间中极度稀疏，形成先验策略（prior policy）的采样瓶颈。解决方案的关键在于：受认知科学启发，论文从理论上证明了由参考答案引导的后验分布（posterior distribution）相比先验分布具有更高的期望效用，能够突破高质量样本的采样瓶颈；针对推理时后验分布不可用的问题，将高效推理形式化为变分推理问题，并引入效率感知的证据下界（efficiency-aware evidence lower bound）；基于此提出VPG-EA框架，采用参数共享的双流架构（dual-stream architecture）同时实例化后验分布与先验策略，通过交叉视图评估（cross-view evaluation）过滤伪高效路径，并借助变分蒸馏（variational distillation）将后验分布的高效模式单向迁移至先验策略，从而在不依赖推理时后验分布的前提下实现推理链的压缩。

链接: https://arxiv.org/abs/2605.11019
作者: Zizhao Chen,Yuying Li,Siting Lin,Lianxi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior’s efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.

[AI-205] DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization

【速读】：该论文旨在解决现有软件漏洞检测方法中两个核心问题：单一信息源（序列、结构或语义）无法联合利用跨模态互补优势，以及语句级定位仅作为函数级检测的副产品而缺乏显式的行级监督。解决方案的关键在于提出 DCVD（Dual-Channel Cross-Modal Vulnerability Detection）统一框架，通过两个并行分支分别提取控制依赖（control-dependency）和语义（semantic）特征，并采用对比对齐（contrastive alignment）与双向交叉注意力（bidirectional cross-attention）机制弥合跨模态表示差异；同时，在函数级和语句级分别引入显式监督信号，实现双粒度协同优化。

链接: https://arxiv.org/abs/2605.11015
作者: Wenxin Tang,Wenbin Li,Junliang Liu,Jingyu Xiao,Xi Xiao,Mingzhe Liu,Jinlong Yang,Xuan Liu,Yuehe Ma,Wang Luo,Qing Li,Lei Wang,Peng Xiangli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software vulnerability detection plays a critical role in ensuring system security, where real-world auditing requires not only determining whether a function is vulnerable but also pinpointing the specific lines responsible. However, existing approaches either rely on a single information source – sequential, structural, or semantic – failing to jointly exploit the complementary strengths across modalities, or treat statement-level localization merely as a byproduct of function-level detection without explicit line-level supervision. To address these limitations, we propose DCVD (Dual-Channel Cross-Modal Vulnerability Detection), a unified framework that performs joint function-level detection and statement-level localization. DCVD extracts control-dependency and semantic features through two parallel branches and integrates them via contrastive alignment coupled with bidirectional cross-attention, effectively bridging the cross-modal representation gap. It further introduces explicit supervision signals at both the function and statement levels, enabling collaborative optimization across the two granularities. Extensive experiments on a large-scale real-world vulnerability benchmark demonstrate that DCVD consistently outperforms state-of-the-art methods on both function-level detection and statement-level localization. Our code is available at this https URL.

[AI-206] Backbone-Equated Diffusion OOD via Sparse Internal Snapshots

【速读】：该论文试图解决基于扩散的生成式模型在分布外（OOD）检测任务中，由于骨干网络选择、损坏参数化方式以及测试时间预算不同而导致的公平比较困难问题。解决方案的关键在于提出了一个互化骨干-对等（Mutualized Backbone-Equated, MBE）协议，该协议能够对齐不同扩散骨干网络之间的规范损坏程度和逻辑测试时间成本，从而构建统一的比较基准。在此协议下，论文进一步引入了规范特征快照（Canonical Feature Snapshots, CFS）——一类仅利用预训练扩散骨干在规范低噪声水平下少量原生内部激活的检测器，证明了OOD信号高度集中在稀疏的内部状态中，无需完整的去噪轨迹或高容量下游头即可实现高效检测，并通过局部诊断理论从条件编码器-解码器互补性、对角分数分离和低噪声损坏稳定性三个方面解释了这一现象。

链接: https://arxiv.org/abs/2605.11014
作者: Yadang Alexis Rouzoumka,Jean Pinsolle,Eugénie Terreaux,Christèle Morisseau,Jean-Philippe Ovarlez,Chengfang Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fair comparison between diffusion-based OOD detectors is challenging, as conclusions can vary with backbone choice, corruption parameterization, and test-time budget. We address this issue through a Mutualized Backbone-Equated (MBE) protocol that aligns canonical corruption levels and logical test-time cost across diffusion backbones. Within this setting, we introduce Canonical Feature Snapshots (CFS), a family of detectors that probes a frozen diffusion backbone using only a tiny number of native internal activations at canonical low-noise levels. On a controlled CIFAR-scale benchmark, the strongest one-forward CFS variant is CFS(1x2), while an even smaller decoder-only variant remains highly competitive. This shows that much of the relative-OOD signal exposed by frozen diffusion backbones is concentrated in a small number of sparse internal states, rather than requiring full denoising trajectories or high-capacity downstream heads. We further provide a local diagnostic theory explaining these observations through conditional encoder-decoder complementarity, diagonal-score separation, and low-noise corruption stability. The official implementation is available at this https URL.

[AI-207] LoopUS: Recasting Pretrained LLM s into Looped Latent Refinement Models

【速读】：该论文旨在解决现有通过循环计算（looped computation）扩展推理时测试计算量以提升大语言模型（LLM）推理性能的方法中存在的两大问题：一是需要从头训练循环模型（training recurrent models from scratch），二是需要改造预训练模型导致计算成本高且可能破坏预训练能力。解决方案的关键在于提出一种名为循环深度上采样（Looped Depth Up-Scaling, LoopUS）的后训练框架，其核心是将标准预训练LLM重构为编码器（encoder）、循环推理块（looped reasoning block）和解码器（decoder）的潜在精炼架构（latent-refinement architecture），并通过四个组件实现稳定转换：基于阶段性表征动态的块分解（block decomposition）、缓解隐藏状态漂移的输入依赖选择门（input-dependent selective gate）、支持长递归范围内存高效学习的随机深度监督（random deep supervision），以及用于自适应提前退出的置信度头（confidence head）。这些机制协同作用，在不扩展生成轨迹或从头进行循环训练的前提下，将非循环模型稳定地转换为循环形式，从而提升推理导向的性能。

链接: https://arxiv.org/abs/2605.11011
作者: Taekhyun Park,Yongjae Lee,Dohee Kim,Hyerim Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Looped computation shows promise in improving the reasoning-oriented performance of LLMs by scaling test-time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce \textbfLooped Depth Up-Scaling (LoopUS), a post-training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent-refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input-dependent selective gate to mitigate hidden-state drift; (3) random deep supervision for memory-efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non-looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning-oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see this https URL

[AI-208] When and How to Canonize: A Generalization Perspective

【速读】：该论文试图解决使用群平均（group averaging）或规范（canonization）将非不变性骨干网络转化为不变性模型时，其理论泛化性质尚不明确的问题。解决方案的关键在于提出一个基于覆盖数（covering number）上界的理论框架，通过严谨分析建立泛化层次（generalization hierarchy）：规范模型的误差界至多与结构不变模型和群平均模型相当，至差与非不变基线模型一致，且这一层次取决于规范的正则性（regularity）。在此基础上，针对点云处理中的置换群，论文严格证明了词典排序（lexicographical sorting）的覆盖数随点云维度指数增长，而希尔伯特曲线规范（Hilbert curve canonization）保证多项式增长，从而首次为希尔伯特曲线序列化在先进点云架构中的经验成功提供了理论依据。

链接: https://arxiv.org/abs/2605.11008
作者: Yonatan Sverdlov,Benjamin Friedman,Snir Hordan,Nadav Dym
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non-invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group-averaged models, and at worst equal to the bounds of non-invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non-invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state-of-the-art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.11008 [cs.LG] (or arXiv:2605.11008v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-209] RT-Transformer: The Transformer Block as a Spherical State Estimator

【速读】：该论文试图解决的问题是：解释Transformer架构中注意力（attention）、残差连接（residual connections）和归一化（normalization）这些核心组件的起源与内在联系，它们是否源自于一个统一的基本原理，而非独立的经验设计选择。解决方案的关键在于将潜在状态建模为超球面（hypersphere）上的方向，并定义当前估计切平面（tangent plane）上的噪声，从而得到一个精度加权（precision-weighted）的方向推理过程。在这一几何框架下，注意力机制自然地起到聚合证据的作用，残差连接对应增量状态更新，而归一化则负责将更新后的状态投影回超球面，三者共同源于同一估计问题的几何结构。

链接: https://arxiv.org/abs/2605.11007
作者: Peter Racioppo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We show that the core components of the Transformer block – attention, residual connections, and normalization – arise naturally from a single geometric estimation problem. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices.

[AI-210] An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

【速读】：该论文试图解决现有代码基准无法有效评估大型语言模型 (Large Language Models, LLMs) 从源代码中恢复运行时调用结构（即执行相关程序结构）这一能力缺陷问题——现有基准仅关注通过测试的输出，缺乏对程序语义推理的诊断信号。解决方案的关键在于提出了 TraceEval，据我们所知这是第一个经过执行验证的多语言代码语义推理 (code semantic reasoning) 基准：它通过 LLM 辅助的测试生成流水线 (LLM-assisted harness-generation pipeline) 与跟踪验证 (tracer validation)，从 1,600+ 开源仓库中提取 10,583 个真实世界程序（覆盖 Python、JavaScript、Java），确保每个正调用边都由实际执行机械见证，从而消除了传统基于静态工具或人工标注的调用图 (call graph) 基准中的标注不一致与标签噪声。此外，该方案还附带可复现的流水线，允许将任意开源仓库转化为新的验证基准实例，并提供了训练集用于模型微调，显著提升模型在调用结构恢复任务上的 F1 分数。

链接: https://arxiv.org/abs/2605.11006
作者: Yikun Li,Jinfeng Jiang,Ting Zhang,Chengran Yang,Chenxing Zhong,Yin Yide,Leow Wen Bin,Eng Lieh Ouh,Lwin Khin Shar,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating whether large language models (LLMs) can recover execution-relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test-passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE-Bench); this is useful, but offers limited diagnostic signal about which program semantics a model can recover from source. We introduce TraceEval, to our knowledge the first execution-verified, multi-language benchmark for code semantic reasoning: recovering a program’s runtime call structure from source code. Unlike prior call-graph benchmarks that rely on static-tool output or hand-annotated ground truth, every positive edge in TraceEval is mechanically witnessed by validation execution, eliminating annotator disagreement and label noise for observed behavior. TraceEval consists of (i) 10,583 real-world programs (2,129 test, 8,454 train) extracted from 1,600+ open-source repositories across Python, JavaScript, and Java via an LLM-assisted harness-generation pipeline with tracer validation; and (ii) a reproducible pipeline that converts any open-source repository into new verified benchmark instances. We evaluate 10 LLMs at zero-shot on the held-out test split. The strongest model, Claude-Opus-4.6, reaches an average F1 of 72.9% across the three languages. To demonstrate the train split’s utility as a supervision substrate, we fine-tune the Qwen2.5-Coder family on it: lifts of up to +55.6 F1 bring tuned Qwen2.5-Coder-32B to 71.2%, within 1.7 F1 of zero-shot Claude-Opus-4.6. We release the benchmark, pipeline, baselines, and a datasheet at this https URL

[AI-211] DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

【速读】：该论文试图解决混合专家（Mixture-of-Experts, MoE）架构中专家并行（Expert Parallelism, EP）训练策略带来的严重 all-to-all 通信瓶颈问题，尤其是在模型规模增大导致专家跨 GPU 节点分布时，节点间网络带宽有限使得通信开销进一步恶化，而先前通过重叠 all-to-all 通信与前馈网络（FFN）和自注意力计算的方法，因注意力和 FFN 层的计算-通信比率固有失衡，仍会残留网络受限的停滞。解决方案的关键在于 DisagMoE 系统提出的解耦式设计：将注意力层和 FFN 层分离到不相交的 GPU 组中，引入一种带单向多对多通信的多阶段流水线，并利用计算-通信 roofline 模型来均衡注意力和 FFN 组之间的 GPU 与网络带宽分配，从而最大化训练效率。

链接: https://arxiv.org/abs/2605.11005
作者: Zhichen Zeng,Chi-Chih Chang,Jiayi Wang,Zezhou Wang,Ningxin Zheng,Zheng Zhong,Cesar A. Stuardo,Dongyang Wang,Mohamed S. Abdelfattah,Haibin Lin,Banghua Zhu,Ang Li,Ziheng Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers’ computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

[AI-212] he Authorization-Execution Gap Is a Major Safety and Security Problem in Open-World Agents

【速读】：该论文旨在解决开放世界智能体（open-world agents）中授权-执行差距（Authorization-Execution Gap, AEG）这一重大安全与安保问题。AEG指主体（principal）意图授权的内容与智能体最终执行内容之间的偏差，由于智能体跨工具、持久状态和多智能体交接自主行动，即使微小的授权偏差也可能导致难以甚至无法逆转的危害。解决方案的关键在于采用面向源头（source-oriented）的诊断与防御方法，即识别AEG的三个结构性来源：委托层级不完整性（delegation-level incompleteness）、通道层级污染（channel-level corruption）和组合层级碎片化（composition-level fragmentation）。由于这些结构性来源在执行过程中动态出现，防御必须依赖执行期间的授权完整性检查（authorization integrity checks），而非仅依靠一次性的前置过滤或事后审计。

链接: https://arxiv.org/abs/2605.11003
作者: Baoyuan Wu,Qingshan Liu,Adel Bibi,Irwin King,Siwei Lyu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper argues that the Authorization-Execution Gap (AEG) is a major safety and security problem in open-world agents. The AEG is the divergence between what a principal intends to authorize and what an open-world agent ultimately executes. Because such agents act autonomously across tools, persistent state, and multi-agent handoffs, even small instances of authorization divergence can cause harm that is difficult or impossible to undo. We argue that many observed agent failures can be traced to three structural sources of AEG: delegation-level incompleteness, channel-level corruption, and composition-level fragmentation. The same observed failure may arise from any of these sources. Without identifying the source, a defense targeting the symptom alone cannot address the underlying cause. Agent safety and security should therefore emphasize source-oriented diagnosis and defense. Because the structural sources of AEG arise dynamically during execution, this approach necessarily requires authorization integrity checks applied during execution, rather than relying solely on one-shot upfront filtering or post-hoc audit. For NeurIPS, the implication is that papers on open-world agents should report not only outcome-level metrics such as task success or attack resistance, but also process-level evidence showing where AEG was detected, constrained, and attributed to a structural source during execution.

[AI-213] MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

【速读】：该论文旨在解决多轮越狱攻击（multi-turn jailbreaks）评估中因实验条件（如预算、评判、重试规则、策略生成过程等）不一致而导致的性能比较混淆问题，即无法区分性能提升是源于更强的攻击机制还是不同的实验设置。解决方案的关键在于提出了一个模块化评估框架MT-JailBench，该框架将每个攻击方法分解为五个交互模块：评估函数（evaluation function）、攻击策略（attack strategy）、提示生成（prompt generation）、提示优化（prompt refinement）和流程控制（flow control）。通过固定实验条件并实现组件级分析，MT-JailBench能够公平比较不同攻击方法，并揭示资源预算和评估函数是主要混淆因素，提示生成模块对性能变化贡献最大，而提示优化和流程控制仅带来适度增益，同时发现静态策略的随机采样可媲美显式动态策略生成，最终通过重组最优组件可得到更强的攻击配置。

链接: https://arxiv.org/abs/2605.11002
作者: Xinkai Zhang,Zhipeng Wei,Huanli Gong,Jing Ting Zheng,Yuchen Zhang,Yue Dong,N. Benjamin Erichson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually evaluated as black-box pipelines with different budgets, judges, retry rules, and strategy generation procedures. As a result, it is often unclear whether reported gains reflect stronger attack mechanisms or different experimental conditions. We introduce MT-JailBench, a modular evaluation framework for benchmarking multi-turn jailbreaks under fixed conditions. MT-JailBench implements each attack as five interacting modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control. This design enables fair comparison across attack methods and component-wise analysis of what drives attack success. Using MT-JailBench, we find that resource budgets and evaluation functions are major confounders: controlling turns, retries, interactions, sampled strategies, and judges substantially change the ranking of attacks. At the component level, prompt generation accounts for most performance variation, while refinement and flow control provide moderate gains. We also find that explicit dynamic strategy generation is not always necessary; stochastic sampling from a fixed strategy can rival more elaborate diversification mechanisms. Finally, recomposing the best components yields a strong attack configuration that outperforms its source attacks and generalizes across diverse target LLMs. MT-JailBench therefore provides a modular framework for comparing multi-turn jailbreaks, understanding the impact of components, and guiding stronger red-teaming evaluations.

[AI-214] Few-Shot Truly Benign DPO Attack for Jailbreaking LLM s

【速读】：该论文旨在揭示基于偏好优化的微调（如Direct Preference Optimization, DPO）在削弱大语言模型安全对齐方面存在一种比监督微调（Supervised Fine-Tuning, SFT）更强且更难审计的失败模式。其关键解决方案是提出一种真正良性的DPO攻击手段：仅使用10个无害的偏好对（即OpenAI微调服务可接受的最小数据量），每个偏好对包含一个良性提示、一个正常有帮助的回答作为首选响应（preferred response）以及一个拒绝作为非首选响应（dispreferred response）。这些数据在外观上与合法用户为减少过度拒绝而提交的微调请求几乎无法区分，从而无法从请求本身推断恶意意图。然而，由于DPO直接优化模型倾向于给出有帮助的回答而非拒绝，这一看似无害的目标会系统性抑制拒绝行为，并将这种效应泛化到微调数据之外的恶意提示上，从而以极低成本实现高攻击成功率。

链接: https://arxiv.org/abs/2605.10998
作者: Sangyeon Yoon,Wonje Jeung,Yoonjun Cho,Dongjae Jeon,Albert No
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI’s fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only \ 1.7, \ 1.7, \ 0.3, and \ 0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.

[AI-215] owards Scalable Persistence-Based Topological Optimization

【速读】：该论文试图解决基于持久同源性的拓扑优化中存在的两个耦合问题：持久同源性通常在子样本上计算，导致拓扑梯度高度稀疏，只有少数锚点（anchor points）获得非零梯度更新，限制了优化的可扩展性和效率。解决方案的关键在于两部分改进：一是引入随机切分（random slicing）子采样策略，该方案能够提升迭代过程中的几何覆盖（geometric coverage）并减轻密度偏差（density bias）；二是采用快速Nadaraya-Watson（NW）高斯卷积替代昂贵的核函数求解，以极低的计算代价生成全局光滑的更新场，该操作不仅避免了耗时的RKHS插值，而且更适应拓扑优化任务的特点。论文提供了NW平滑的理论保障，包括锚点逼近界和全局Lipschitz估计，实验表明结合随机切分与NW平滑相比其他基线在常见持久性损失上能获得一致的加速和更优的目标值。

链接: https://arxiv.org/abs/2605.10996
作者: Abderrahim Bendahi,Alexandre Duplessis,Arnaud Fickinger
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Graphics (cs.GR); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Persistence-based topological optimization deforms a point cloud X \subset \mathbbR^d by minimizing objectives of the form L(X) = \ell(\mathrmDgm(X)) , where \mathrmDgm(X) is a persistence diagram. In practice, optimization is limited by two coupled issues: persistent homology is typically computed on subsamples, and the resulting topological gradients are highly sparse, with only a few anchor points receiving nonzero updates. Motivated by diffeomorphic interpolation, which extends sparse gradients to smooth ambient vector fields via Reproducing Kernel Hilbert Space (RKHS) interpolation, we propose a more scalable pipeline that improves both subsampling and gradient extension. We introduce subsampling via random slicing, a lightweight scheme that promotes iteration-wise geometric coverage and mitigates density bias. We further replace the costly kernel solve with a fast Nadaraya-Watson (NW) Gaussian convolution, producing a globally defined smooth update field at a fraction of the computational cost, while being more suited for topological optimization tasks. We provide theoretical guarantees for NW smoothing, including anchor approximation bounds and global Lipschitz estimates. Experiments in 2 D and 3 D show that combining random slicing with NW smoothing yields consistent speedups and improved objective values over other baselines on common persistence losses.

[AI-216] st-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

【速读】：该论文旨在解决大语言模型（LLM）个性化生成中，标准奖励模型在测试时计算扩展（Test-Time Personalization, TTP）过程中无法有效利用采样候选数增长带来的性能潜力的问题。具体而言，通过从个性化策略模型中采样N个候选并用奖励模型选择最佳，理论上期望效用应随候选数对数增长，但标准奖励模型会遭遇两种失败模式：用户级塌缩（user-level collapse，对某些用户预测近乎恒定）和查询级奖励黑客（query-level reward hacking，奖励与真实质量负相关）。解决方案的关键在于提出了一种概率个性化奖励模型（probabilistic personalized reward model），该模型通过学习到的方差（variance）动态调节奖励分布的置信度，从而同时缓解上述两种失败模式，使TTP框架能真实逼近对数增长的理论上限。此外，论文推导的统一缩放定律（unified scaling law）将Best-of-N曲线分解为四个可测量量，为诊断和指导奖励模型设计提供了理论依据。

链接: https://arxiv.org/abs/2605.10991
作者: Linhai Zhang,Yulan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model’s Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.

[AI-217] Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

【速读】：该论文旨在解决大型语言模型代理（LLM agent）在依赖可重用技能库时面临的无声技能漂移（skill drift）问题——即外部服务、包、API及配置的演化会导致技能无声地退化，而现有监控手段仅在值粒度上检测变化，无法区分值在技能中所扮演的角色（例如版本字符串在注释中是噪音，在锁定依赖中则是操作义务）。解决方案的关键是将技能漂移形式化为合同违反（contract violation），并引入名为\sgname的方法：它从技能文档中提取可执行的环境合同（executable environment contracts），仅验证那些承载角色（role-bearing）的假设与实际或已知条件是否匹配。这一区分将噪声监控转变为精度优先的维护信号，在无漂移和硬负例测试中实现了零假报警，并显著提升了漂移发现精度与修复效率。

链接: https://arxiv.org/abs/2605.10990
作者: Linfeng Fan,Yuan Tian,Ziwei Li,Zhiwu Lu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents increasingly rely on reusable skill libraries, but these skills silently decay as the external services, packages, APIs, and configurations they reference evolve. Existing monitors detect such changes at the wrong granularity: they observe values, not the role those values play in a skill. A version string in a comment is noise; the same string in a pinned dependency is an operational obligation. We formulate skill drift as contract violation and introduce \sgname, which extracts executable environment contracts from skill documents and validates only those role-bearing assumptions against known or live conditions. This distinction turns noisy monitoring into a precision-first maintenance signal. Contract-free CI probes produce 40% false positives, while \sgname raises zero false alarms over 599 no-drift and hard-negative cases (Wilson 95% CI [0,0.6]% ). In known-drift verification, \sgname achieves 100% precision and 76% recall with the strongest backbone; in a pre-registered study over 49 real skills, it discovers live drift with 86% conservative precision. Violated contracts also make repair actionable, improving one-round success from 10% without localization to 78%. We release \dbname, an 880-pair benchmark for skill degradation.

[AI-218] SURGE: Surrogate Gradient Adaptation in Binary Neural Networks ICML2026

【速读】：该论文旨在解决二元神经网络（BNN）训练中因二值化操作（如符号函数）不可微而导致的梯度近似问题，现有方法如直通估计器（STE）及其改进变体依赖手工设计，存在梯度不匹配以及固定范围梯度裁剪带来的信息丢失。解决方案的关键在于提出一种名为SURrogate GradiEnt Adaptation (SURGE)的可学习梯度补偿框架，其核心包含两个组成部分：一是双路径梯度补偿器（DPGC），通过为每个二值化层构建一个并行全精度辅助分支，在反向传播时利用输出分解解耦梯度流，使全精度分支能够估计超越STE一阶近似的梯度分量，从而降低梯度估计偏差；二是自适应梯度缩放器（AGS），基于最优尺度因子通过范数缩放动态平衡两分支间的梯度贡献，以提升训练稳定性。

链接: https://arxiv.org/abs/2605.10989
作者: Haoyu Huang,Boyu Liu,Linlin Yang,Yanjing Li,Yuguang Yang,Xuhui Liu,Canyu Chen,Zhongqian Fu,Baochang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE’s first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

[AI-219] Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation

【速读】：该论文旨在解决大规模网络系统中日志数据量庞大且实例级标注成本极高导致的细粒度异常定位困难问题。解决方案的关键在于提出了LogMILP（Log anomaly localization based on Multi-Instance Learning enhanced by prototypes and Perturbation）框架，这是一种基于多实例学习（Multi-Instance Learning）的弱监督方法，仅利用包级（bag-level）标签即可同时实现包级异常检测与实例级（instance-level）异常定位。其核心创新在于通过原型引导的结构化建模（prototype-guided structural modeling）结合反事实扰动一致性正则化（counterfactual perturbation consistency regularization），引导模型精准识别关键日志条目，从而在粗粒度监督下显著提升定位的可靠性与可解释性。

链接: https://arxiv.org/abs/2605.10988
作者: Yutszyuk Wong,Wentai Wu,Yuen-Ying Yeung,Weiwei Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages,2 figures

点击查看摘要

Abstract:Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance-level annotations are prohibitively expensive, posing great difficulties to fine-grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi-Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag-level anomaly detection and instance-level anomaly localization using only bag-level labels. Our method guides the model to pinpoint the critical log entries using prototype-guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse-grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance-level localization. Our code is open-sourced at this https URL.

[AI-220] AESOP: Adversarial Execution-path Selection to Overload Deep Learning Pipelines

【速读】：该论文试图解决现代机器学习部署中由多个专门模型组成的动态推理管道（inference pipelines）所面临的效率攻击面（efficiency-attack surface）问题：由于上游组件产生的中间预测决定了下游组件的负载和输入，攻击者可以利用路径选择（path selection）来大幅增加整体计算成本（如FLOPs和延迟），而现有针对单模型的攻击方法无法利用这一结构优势。解决方案的关键在于将这一攻击目标形式化为对抗路径选择问题（adversarial path-selection problem），并提出了AESOP框架，该框架通过漏洞引导的路径排名（vulnerability-guided path ranking）识别最易被利用的管道路径，再结合自适应损失加权（adaptive loss weighting）生成扰动，从而在相同输入和预算下实现远超单模型基线的效率膨胀（白盒设置中FLOPs膨胀达2407倍，延迟膨胀达419倍），并且即使在引入批处理、缓冲和置信度阈值等系统级防御后，攻击仍能迫使管道在吞吐量崩溃与数据丢失之间进行权衡。

链接: https://arxiv.org/abs/2605.10987
作者: Tingxi Li,Mingfang Ji,Ravishka Shemal Rathnasuriya,Simin Chen,Yitao Hu,Wei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Modern machine learning deployments increasingly compose specialized models into dynamic inference pipelines, where upstream components produce intermediate predictions that determine the workload and inputs of downstream components. The cost of processing an input is therefore not determined by any single model, but by two coupled factors: the per-inference cost of each invoked component and its workload volume. Because these pipelines run under hard real-time constraints, efficiency is a fundamental requirement for system availability. We show that this structure creates an efficiency-attack surface that existing methods targeting single models cannot exploit: on identical inputs and budgets, path-aware targeting inflates FLOPs by 2,407\times while the strongest single-model baseline achieves 117\times – a 20\times gap attributable entirely to where the attack is directed. We formalize this as the adversarial path-selection problem and present AESOP, a framework combining vulnerability-guided path ranking with adaptive loss weighting. We evaluate AESOP on five pipelines plus a production-realistic deployment variant with batching, bounded buffering, and confidence-threshold defenses. AESOP achieves up to 2,407\times FLOPs and 419\times latency inflation in white-box setting and 58 \times FLOPs / 17 \times latency in gray-box settings. Under system-level defenses, the attack is not neutralized but redirected: pipelines are forced to choose between throughput collapse ( 0.578 \to 0.006 input/s) and 96.7% data loss to sustain throughput.

[AI-221] Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning NEURIPS2026

【速读】：该论文旨在解决蛋白质语言模型（如ESM-2）虽然能高效预测功能但因结构性和进化信号编码于密集潜在空间而难以提供可解释性特征的问题。解决方案的关键是提出一个即插即用（plug-and-play）框架：将ESM-2的表征投影到蛋白质接触图（contact graph）上，并应用SoftBlobGIN——一种集成可微分Gumbel-softmax子结构池化的轻量级图同构网络（Graph Isomorphism Network），执行结构感知的消息传递（message passing）以学习粗粒度的功能子结构，从而在不重新训练语言模型、仅增加约1.1M参数的情况下实现高精度预测，同时通过GNNExplainer生成可直接审计的结构性解释（如活性位点、功能簇和催化接触模式），并通过自动分组残基为功能子结构（blob）提供额外可解释性。

链接: https://arxiv.org/abs/2605.10985
作者: Siddhant Dutta,Edward Tan Beng Wai,Soumick Sarker,Pasan Gunawardane,Jagath C. Rajapakse
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 19 Pages, 8 figures, 11 Tables, Submitted to NeurIPS 2026

点击查看摘要

Abstract:Protein language models such as ESM-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural \ evolutionary signals are encoded in dense latent spaces. We propose a plug- \ -play framework that projects ESM-2 representations onto protein contact graphs \ applies \textbfSoftBlobGIN , a lightweight Graph Isomorphism Network with differentiable Gumbel-softmax substructure pooling, to perform structure-aware message passing \ learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8% accuracy \ 0.898 macro-F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active-site residues, spatially localized functional clusters, \ catalytic contact patterns. On binding-site detection, SoftBlobGIN improves residue AUROC from 0.885 using an ESM-2 linear probe to 0.983 , indicating that these structural explanations are not recoverable from language-model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active-site residues showing 1.85\times higher importance than other blobs ( \rho=0.339 , p=0.009 ), without any active-site supervision. Our framework requires no retraining of the language model, adds only \sim 1.1M parameters, \ generalises across ProteinShake tasks, achieving F_\max of 0.733 on Gene Ontology prediction \ AUROC of 0.969 on binding-site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent \ auditable.

[AI-222] ξ-DPO: Direct Preference Optimization via Ratio Reward Margin

【速读】：该论文试图解决简单偏好优化（Simple Preference Optimization, SimPO）中两个超参数 β 和 γ 联合调优的困难，其根源在于SimPO中的边际公式（margin formulation）在不同数据集的奖励差距结构（reward gap structure）下缺乏可解释性。解决方案的关键在于提出 ξ-DPO，即通过比率奖励边际（ratio reward margin）进行直接偏好优化。具体而言，论文首先通过等价变换将偏好目标从最大化奖励差距的似然重新表述为最小化奖励差距与最优边际之间的距离；随后将奖励重新定义为选择响应与拒绝响应的比率形式，这有效消除了 β 的影响，并产生了一个有界且可解释的边际 ξ。与 SimPO 中的 γ 不同，ξ 明确表示了选择与拒绝响应之间期望的相对分离程度，且可直接从初始奖励差距分布中确定，从而避免了反复的试错调参。

链接: https://arxiv.org/abs/2605.10981
作者: Zhengyuan Fan,Zhonghua Wu,Yuxuan Du,Qun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters \beta and \gamma in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that \beta implicitly controls sample filtering, while the effect of \gamma depends on the reward gap structure of the dataset. Motivated by these observations, we propose \xi -DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of \beta and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by \xi . Unlike the margin \gamma in SimPO, \xi explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. …

[AI-223] LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

【速读】：该论文试图解决扩散语言模型（Diffusion Language Models, dLLMs）在并行解码时因依赖高置信度阈值而严重限制并行性可扩展性的问题。现有dLLMs通过假设高置信度下的条件独立性来保证边缘分布与联合分布之间差异可忽略，但为保持精度所需的严格置信度阈值极大地约束了并行解码的效率。通过系统性的token级统计分析，论文发现大量token在去噪过程早期就已收敛到正确预测，却未能达到标准置信度阈值，表明当前基于置信度的准则过于保守。关键解决方案是提出LEAP（Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding），一种无需训练、即插即用的方法，通过未来上下文过滤（future context filtering）和多序列叠加（multi-sequence superposition）检测早期收敛的token，并验证早期收敛与正确性之间的对齐关系，从而可靠地提前解码这些token，打破了并行解码对高置信度先验的依赖。该方案在保持模型精度的同时，显著降低了推理延迟和去噪步数，平均减少约30%的去噪步骤，在GSM8K数据集上结合dParallel可将解码速度提升至每步7.2个token。

链接: https://arxiv.org/abs/2605.10980
作者: Haohui Zhang,Zhiye Wang,Xiaoying Gan,Xinbing Wang,Bo Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.

[AI-224] PASA: A Principled Embedding-Space Watermarking Approach for LLM -Generated Text under Semantic-Invariant Attacks

【速读】：该论文旨在解决大语言模型（LLMs）水印方法易受语义不变攻击（如释义）影响的问题，现有方法在检测准确性和抗攻击鲁棒性上存在不足。解决方案的关键在于提出PASA算法，它在语义层面而非词表空间嵌入和检测水印：通过潜在嵌入空间中的语义聚类，并利用共享随机性（由密钥和语义历史同步）构建令牌序列与辅助序列之间的分布依赖性，从而在理论上实现检测精度、鲁棒性和失真之间的最优权衡，确保即使在强释义攻击下仍保持高检测鲁棒性和文本质量。

链接: https://arxiv.org/abs/2605.10977
作者: Zhenxin Ai,Haiyun He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as paraphrasing. We propose PASA, a principled, robust, and distortion-free watermarking algorithm that embeds and detects a watermark at the semantic level. PASA operates on semantic clusters in a latent embedding space and constructs a distributional dependency between token and auxiliary sequences via shared randomness synchronized by a secret key and semantic history. This design is grounded in our theoretical framework that characterizes a jointly optimal embedding-detection pair, achieving the fundamental trade-offs among detection accuracy, robustness, and distortion. Evaluations across multiple LLMs and semantic-invariant attacks demonstrate that PASA remains robust even under strong paraphrasing attacks while preserving high text quality, outperforming standard vocabulary-space baselines. Ablation studies further validate the effectiveness of our hyperparameter choices. Webpage: this https URL.

[AI-225] Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation

【速读】：该论文旨在解决异配图（heterophilous graphs）分类中现有谱图神经网络（Spectral Graph Neural Networks）因次优多项式滤波器引入近似误差并混合远距离信号而导致的hub-dominated（大度节点主导）聚合与过平滑（oversmoothing）问题。解决方案的关键在于提出层次多视角HAAR（Hierarchical Multi-view HAAR, HMH）框架：首先通过异配感知编码器（heterophily-aware encoder）学习特征与结构感知的有符号亲和度（signed affinities），并基于这些嵌入构建软图层次（soft graph hierarchy）；在每个层次上利用稀疏、正交且局部感知的Haar基（Haar basis）在频域应用可学习谱滤波器；最后通过跳跃连接反池化层（skip-connection unpooling layers）将所有层次输出合并回原始图，从而有效抑制hub domination并缓解长程信号瓶颈（over-squashing）。该模型在保持近线性时间可扩展性的同时，分别在节点分类和图分类任务上实现了最多3%和7%的性能提升。

链接: https://arxiv.org/abs/2605.10975
作者: Md Sazzad Hossen,Avimanyu Sahoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphs with heterophily, where adjacent nodes carry different labels, are prevalent in real-world applications, from social networks to molecular interactions. However, existing spectral Graph Neural Network (GNN) approaches tailored for heterophilous graph classification suffer from hub-dominated (node with large degree) aggregation and oversmoothing, as their suboptimal polynomial filters introduce approximation errors and blend distant signals. To address the degree-biased aggregation and suboptimal polynomial filtering, we introduce a Hierarchical Multi-view HAAR (HMH), a novel spectral graph-learning framework that scales in near-linear time . HMH first learns feature- and structure-aware signed affinities via a heterophily-aware encoder, then constructs a soft graph hierarchy guided by these embeddings. At each hierarchical level, HMH constructs a sparse, orthonormal, and locality-aware Haar basis to apply learnable spectral filters in the frequency domain. Finally, skip-connection unpooling layers combine outputs from all hierarchical levels back into the original graph, effectively preventing hub domination and long-range signal bottleneck (over-squashing). Experimentation shows that HMH outperforms state-of-the-art spectral baselines, achieving up to a 3% improvement on node classification and 7% points on graph classification datasets, all while maintaining linear scalability.

[AI-226] Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

【速读】：该论文旨在解决Transformer注意力机制认证验证（certified verification）中，对softmax函数在预softmax分数区间约束（interval constraints）下进行边界计算时存在可避免松弛（slack）的问题，即现有验证器独立于下游目标放宽softmax导致边界不够紧致。解决方案的关键是提出了Vertex-Softmax原语，其理论基础包括：证明分数-盒子问题（score-box problem）的精确最优解必出现在约束盒子的某个顶点，并建立了阈值结构定理（threshold structure theorem），表明在对目标系数排序后最优解仅在线性多个候选点中，从而实现了序列长度上对数线性（log-linear）的复杂度；此外，形式最优性（formal optimality）结果证明Vertex-Softmax是从分数间隔单独可获得的最紧致声音边界（tightest sound bound），并明确指出了需要额外结构（如分数相关性、分数-值耦合）才能进一步改进。将该原语集成到CROWN（Convex Relaxation based Optimization for Worst-case Neurons）风格验证器中，显著提升了认证率和下界紧致性，在MNIST、Fashion-MNIST和CIFAR-10注意力模型上以更低成本一致匹配或超越alpha-CROWN和分支定界（branch-and-bound）基线。

链接: https://arxiv.org/abs/2605.10974
作者: Navid Rezazadeh,Arash Gholami Davoodi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack. We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates, yielding the Vertex-Softmax primitive with log-linear complexity in the sequence length. We further prove a formal optimality result showing that Vertex-Softmax is the tightest sound bound obtainable from score intervals alone, characterizing precisely what additional structure (score correlations, score-value coupling) is needed for further improvement. Integrated into a CROWN Convex Relaxation based Optimization for Worst-case Neurons)-style verifier with a formal soundness guarantee, Vertex-Softmax significantly improves certified rates and substantially tightens lower bounds across MNIST, Fashion-MNIST, and CIFAR-10 attention models, while consistently matching or outperforming alpha-CROWN and branch-and-bound baselines at a fraction of their cost.

[AI-227] Rotation-Preserving Supervised Fine-Tuning

【速读】：该论文试图解决监督微调（Supervised Fine-Tuning, SFT）在提升域内性能的同时导致域外泛化（Out-of-Domain, OOD）退化的问题。解决方案的关键在于提出了一种名为旋转保持监督微调（Rotation-Preserving Supervised Fine-Tuning, RPSFT）的方法，该方法利用预训练权重矩阵的奇异子空间中的投影旋转作为Fisher敏感方向的高效代理，通过惩罚每个预训练权重矩阵的前k个奇异向量块的投影变化，在限制不必要的旋转的同时保留任务适应能力，从而在不显著增加计算成本的前提下改善域内/域外性能权衡，并保持预训练表征的完整性。

链接: https://arxiv.org/abs/2605.10973
作者: Hangzhan Jin,Tianwei Ni,Lu Li,Pierre-Luc Bacon,Mohammad Hamdaqa,Doina Precup
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 13 figures

点击查看摘要

Abstract:Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top- k singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \hrefthis https URLthis https URL.

[AI-228] MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

【速读】：该论文试图解决现有终端智能体（terminal agent）基准测试仅关注文本、代码和结构化文件任务，而无法评估其在真实多媒体工作流程中直接操作音频和视频文件能力的问题。解决方案的关键在于引入了两个组件：一是 MultiMedia-TerminalBench (MMTB)，一个包含105个跨5个元类别的任务基准，用于针对性地评估终端智能体处理多媒体文件的能力；二是 Terminus-MM，一个扩展了 Terminus-KIRA 的多媒体工具包，为终端智能体提供了音频和视频感知能力。通过 MMTB 和 Terminus-MM 的结合，论文实现了对多媒体终端智能体的受控研究，揭示了不同形式的多媒体访问如何影响任务结果以及智能体依赖的证据类型。

链接: https://arxiv.org/abs/2605.10966
作者: Chiyeong Heo,Jaechang Kim,Junhyuk Kwon,Hoyoung Kim,Dongmin Park,Jonghyun Lee,Jungseul Ok
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows. MMTB media and metadata are released at this https URL

[AI-229] QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

【速读】：该论文试图解决当前缺乏统一度量标准来评估量化神经网络效率的问题，特别是压缩率、准确率和延迟三者之间的权衡（compression-accuracy-latency trade-off）无法被单一指标量化。解决方案的关键在于提出了 QuIDE（Quantized Neural Network Intelligence and Efficiency）指标，其核心是智能指数 I = (C × P) / log₂(T+1)，其中 C 代表压缩率（compression），P 代表准确率（accuracy），T 代表延迟（latency），通过将三个维度合并为一个分数实现量化。此外，为了解决原始 I 可能奖励非可行配置（如准确率崩溃的 4-bit 量化）的问题，引入了准确率门控变体 I′，能够自动标记准确率不满足阈值的配置，从而确保评估的合理性。QuIDE 还提供了一个可复现的评估协议和一个可直接用于混合精度搜索（mixed-precision search）的适应度函数（fitness function），使得跨任务（如 SimpleCNN、ResNet-18、Llama-3-8B）的量化效率比较成为可能。

链接: https://arxiv.org/abs/2605.10959
作者: Xiantao Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:There is currently no unified metric for evaluating the efficiency of quantized neural networks. We propose QuIDE, built around the Intelligence Index I = (C x P)/log_2(T+1), which collapses the compression-accuracy-latency trade-off into a single score. Experiments across six settings – SimpleCNN (MNIST, CIFAR), ResNet-18 (ImageNet-1K), and Llama-3-8B – show a task-dependent Pareto Knee. 4-bit quantization is optimal for MNIST and large LLMs, while 8-bit is the sweet spot for complex CNN tasks (ResNet-18 on ImageNet), where 4-bit PTQ collapses accuracy catastrophically. The accuracy-gated variant I’ correctly flags these non-viable configurations that the raw I would reward. QuIDE provides a reproducible evaluation protocol and a ready-to-use fitness function for mixed-precision search.

[AI-230] Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

【速读】：该论文试图解决当前缺乏能够动态评估自主人工智能智能体（autonomous AI agents）在科学任务中执行物理基础模型拟合能力的问题。现有智能体虽能实现良好的数值统计拟合，却常违反物理约束，无法正确恢复系统真实参数，且增加测试时计算量仅带来边际收益。解决方案的关键是引入 Stargazer——一个基于径向速度（radial-velocity）时间序列数据推理的可扩展模拟环境，它通过构建包含三个难度等级（共120个任务，含20个真实档案案例）的迭代物理基础模型拟合任务，提供内置反馈机制，从而系统性地暴露智能体在数值优化与物理约束遵守之间的差距，并为训练、评估和扩展此类智能体策略提供可泛化的方法论。

链接: https://arxiv.org/abs/2604.15664
作者: Xinge Liu,Terry Jingchen Zhang,Bernhard Schölkopf,Zhijing Jin,Kristen Menou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at this https URL and this https URL, respectively.

[AI-231] Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover Beam Management and Timing Advance

【速读】：该论文试图解决在5G高速移动场景下，用户设备（UE）移动性导致的高切换中断时间和测量报告开销问题，尤其是现有AI/ML波束管理与移动性技术依赖的模拟数据无法准确反映真实部署行为或用户流量模式。解决方案的关键在于构建一个来自真实商用网络的、涵盖多种移动模式（步行、自行车、汽车、公交、火车）和不同速度的实时数据集，聚焦切换（HO）场景，并包含现有工作中通常缺失的定时提前（TA）测量值（如RACH触发、MAC CE和PDCCH授权处的TA），从而为AI/ML模型（如TA预测）提供训练与评估的真实数据基础。

链接: https://arxiv.org/abs/2605.12453
作者: Mannam Veera Narayana,Rohit Singh,Deepa M.R,Radha Krishna Ganti
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:To address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works. We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.12453 [eess.SP] (or arXiv:2605.12453v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2605.12453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-232] QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

【速读】：该论文旨在解决量子编译中的量子比特路由（qubit routing）问题，该问题已被证明是NP-hard，其动态特性导致局部路由决策会随时间传播累积，使得高效全局解难以获得。现有启发式方法依赖有限前瞻的局部规则，而近期基于学习的方法往往将路由视为通用序列决策问题，未能充分挖掘其底层结构。解决方案的关键在于：将量子比特路由形式化为一个动态二次分配问题（Quadratic Assignment Problem, QAP），通过将逻辑交互（即量子门）建模为流矩阵（flow matrix），硬件拓扑建模为距离矩阵（distance matrix），在统一目标中捕捉交互-距离耦合，并以此定义强化学习环境中的奖励。此外，策略网络采用一种解决方案感知的Transformer（solution-aware Transformer）骨干结构，将流矩阵与距离矩阵之间的交互编码进注意力机制；同时集成一种与QAP框架自然融合的前瞻机制（lookahead mechanism），避免短视决策。

链接: https://arxiv.org/abs/2605.12365
作者: Kien X. Nguyen,Ankit Kulshrestha,Ilya Safro,Xiaoyuan Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Qubit routing is a fundamental problem in quantum compilation, known to be NP-hard. Its dynamic nature makes local routing decisions propagate and compound over time, making global efficient solutions challenging. Existing heuristic methods rely on local rules with limited lookahead, while recent learning-based approaches often treat routing as a generic sequential decision problem without fully exploiting its underlying structure. In this paper, we introduce QAP-Router, framing qubit routing based on a dynamic Quadratic Assignment Problem (QAP) formulation. By modeling logical interactions, or quantum gates, as flow matrices and hardware topology as a distance matrix, our approach captures the interaction-distance coupling in a unified objective, which defines the reward in the reinforcement learning environment. To further exploit this structure, the policy network employs a solution-aware Transformer backbone that encodes the interaction between the flow matrix and the distance matrix into the attention mechanism. We also integrate a lookahead mechanism that blends naturally into the QAP framework, preventing myopic decisions. Extensive experiments on 1,831 real-world quantum circuits from the MQTBench, AgentQ and QUEKO datasets show that our method substantially reduces the CNOT gate count of routed circuits by 15.7%, 30.4% and 12.1%, respectively, relative to existing industry compilers.

[AI-233] Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

【速读】：该论文试图解决的核心问题是：能否仅通过微生物群落成员的原始DNA序列来预测其群落水平的特性（如丰度谱），从而绕过传统的基于参考基因组或功能注释的耗时流程。解决方案的关键在于利用基因组语言模型（Genomic Language Models, GLMs）的少样本学习能力，通过集合聚合的基因组嵌入（Set-Aggregated Genome Embeddings, SAGE）方法，将每个基因组的嵌入向量进行群体级聚合，直接生成群落水平的潜在表示，进而预测丰度分布。该方法通过在未见过的基因组上展现出优于传统生物信息学方法的泛化性能，并通过模型消融实验证实，群落级潜在表示本身对于性能提升具有直接贡献，同时中间变换和不同GLM嵌入的选择也会影响最终预测效果。

链接: https://arxiv.org/abs/2605.12286
作者: Younhun Kim,Georg K. Gerber,Travis E. Gibson
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

[AI-234] Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

【速读】：该论文试图解决的问题是：在医学子领域（特别是心电图（ECG）数据）中，基础模型的预训练方法论和参数缩放行为（即随着预训练数据集大小增加的性能变化）缺乏系统性的同类比较评估。解决方案之关键包括两点：首先，通过全面评估五种对比和非对比自监督学习目标（其中对比预测编码（Contrastive Predictive Coding）略优于联合嵌入预测架构（JEPA）），发现在不同临床任务中预训练策略对下游性能有显著且一致的影响，且预训练数据规模在高达1100万样本时仍能带来有意义提升；其次，通过跨所有预训练方法进行模型架构比较，发现结构化状态空间模型（Structured State Space Models）相较于Transformer和卷积神经网络（CNN）具有明显优势，其强归纳偏置（而非仅依赖预训练规模）是有效ECG表示学习的主要驱动力。

链接: https://arxiv.org/abs/2605.12241
作者: M A Al-Masud,Nils Strodthoff
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 59 pages, 16 figures, 59 Tables. Code available at this https URL

点击查看摘要

Abstract:Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.

[AI-235] Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

【速读】：该论文试图解决传统近似贝叶斯推断中，计算后验参数分布（posterior parameter distribution）而忽略更实际关注的后验预测分布（posterior predictive distribution）的问题。解决方案的关键在于，借鉴自监督和半监督学习中的自训练（self-training）思想，通过直接在自预测数据（self-predicted data）上重新拟合模型来量化预测不确定性，从而避免对参数后验的显式推断。具体方法为自监督拉普拉斯近似（Self-Supervised Laplace Approximation, SSLA），其核心是使用拉普拉斯近似（Laplace approximation）在自预测数据上高效计算后验预测，并进一步引入近似版本ASSLA（Approximate SSLA）以减少重新拟合的计算开销，最终实现一种确定性的、无需采样的后验预测分布近似，并允许灵活的先验设定以进行贝叶斯敏感性分析。

链接: https://arxiv.org/abs/2605.12208
作者: Julian Rodemann,Alexander Marquard,Thomas Augustin,Michele Caprio
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO)
备注: Accepted for publication in TMLR ( this https URL )

点击查看摘要

Abstract:Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model’s predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model’s predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.

[AI-236] Rethink the Role of Neural Decoders in Quantum Error Correction ICML2026

【速读】：该论文试图解决量子纠错（Quantum Error Correction, QEC）中神经解码器在实际部署时所面临的精度与延迟之间的根本权衡问题，特别是在微秒级时间尺度上的性能瓶颈。解决方案的关键在于：首先，通过统一和重新设计五种有代表性的神经解码器架构范式，并开发一套端到端压缩流水线，在FPGA硬件上系统评估其可部署性与性能；其次，通过实验揭示了三个关键洞察——近期的解码性能更多由数据规模而非架构复杂度驱动，适当的归纳偏置对于高解码精度至关重要，以及INT4量化是满足FPGA微秒级延迟需求的先决条件。这些发现为构建可扩展且实时的神经QEC解码提供了具体指导。

链接: https://arxiv.org/abs/2605.12046
作者: Ge Yan,Shanchuan Li,Yuxuan Du
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026; 33 Pages, 9 figures

点击查看摘要

Abstract:Quantum error correction (QEC) is essential for enabling quantum advantages, with decoding as a central algorithmic primitive. Owing to its importance and intrinsic difficulty, substantial effort has been made to QEC decoder design, among which neural decoders have recently emerged as a promising data-driven paradigm. Despite this progress, practical deployment remains hindered by a fundamental accuracy-latency tradeoff, often on the microsecond timescale. To address this challenge, here we revisit neural decoders for surface-code decoding under explicit accuracy-latency constraints, considering code distances up to d=9 (161 physical qubits). We unify and redesign representative neural decoders into five architectural paradigms and develop an end-to-end compression pipeline to evaluate their deployability and performance on FPGA hardware. Through systematic experiments, we reveal several previously underexplored insights: (i) near-term decoding performance is driven more by data scale than architectural complexity; (ii) appropriate inductive bias is essential for achieving high decoding accuracy; and (iii) INT4 quantization is a prerequisite for meeting microsecond-scale latency requirements on FPGAs. Together, these findings provide concrete guidance toward scalable and real-time neural QEC decoding.

[AI-237] High-lift Wing Separation Control via Bayesian Optimization and Deep Reinforcement Learning

【速读】：该论文试图解决高雷诺数（Re_c=450,000）下30P30N高升力机翼的失速缓解与气动效率提升问题，通过壁面解析大涡模拟（wall-resolved large-eddy simulations, LES）研究主动流动控制（active flow control, AFC），并比较开环贝叶斯优化（Bayesian optimization, BO）与闭环深度强化学习（deep reinforcement learning, DRL）两种优化策略。解决方案的关键在于：采用合成射流（synthetic jets）分别作用于缝翼、主翼和襟翼元件，其中贝叶斯优化通过稳态射流速度的全局搜索，实现了-9.7%的阻力降低和+10.9%的效率提升，且不牺牲升力；而深度强化学习尽管利用了分布式传感器的瞬时流场信息，但因惩罚主导的奖励函数（penalty-dominated reward）限制了探索，仅取得微小改进。研究强调，有效的奖励设计（reward design）和计算加速策略（computational acceleration strategies）是DRL在高雷诺数流动控制中成功应用的核心环节。

链接: https://arxiv.org/abs/2605.11981
作者: Ricard Montalà,Bernat Font,Oriol Lehmkuhl,Ricardo Vinuesa,Ivette Rodriguez
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates active flow control (AFC) of a 30P30N high-lift wing at a Reynolds number Re _c = 450,000 and angle of attack \alpha = 23 ^\circ using wallresolved large-eddy simulations (LES). Two optimization strategies are explored: open-loop Bayesian optimization (BO) and closed-loop deep reinforcement learning (DRL), both targeting the mitigation of stall and the improvement of aerodynamic efficiency via synthetic jets on the slat, main, and flap elements. The uncontrolled configuration was validated against literature data, confirming the reliability of the LES setup. The BO framework successfully identified steady jet velocities that increased efficiency by +10.9% through a -9.7% drag reduction while maintaining lift. In contrast, the DRL agent, despite leveraging instantaneous flow information from distributed sensors, achieved only minor improvements in lift and drag, with negligible efficiency gain. Training analysis indicated that the penalty-dominated reward constrained exploration. These results highlight the need for carefully designed rewards and computational acceleration strategies in DRL-based flow control at high Reynolds numbers.

[AI-238] Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification

【速读】：该论文旨在解决深度学习自动调制分类（AMC, Automatic Modulation Classification）在实际部署中因标记数据成本高昂而受限的问题，以及现有基于自监督学习（SSL, Self-Supervised Learning）的方法因采用任务无关的前置目标（pretext objectives）而与调制分类对齐不佳，导致表示中纠缠了符号、信道和噪声等无关因素（nuisance factors）。解决方案的关键在于提出了一个任务感知的结构先验——实例内调制一致性（intra-instance modulation consistency），即同一信号的不同时间片段尽管波形不同，但共享相同的调制类型。基于此先验，论文设计了基于调制一致性的对比学习框架（Mod-CL, Modulation consistency-based Contrastive Learning），通过从同一信号实例的不同时间片段构建正样本对，迫使模型学习共享的调制信息，同时抑制无关变化。此外，该框架还开发了专门针对 Mod-CL 的对比目标，联合利用时间分割和数据增强，拉近具有相同调制语义的视图，同时避免每个信号实例内部的监督冲突。

链接: https://arxiv.org/abs/2605.11875
作者: Chenxu Wang,Shuang Wang,Lirong Han,Xinyu Hu,Hanlin Mo,Hantong Xing,Licheng Jiao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based AMC methods have achieved remarkable performance, but their practical deployment remains constrained by the high cost of labeled data. Although self-supervised learning (SSL) reduces the reliance on labels, existing SSL-based AMC methods often rely on task-agnostic pretext objectives misaligned with modulation classification, leading to representations entangled with nuisance factors such as symbol, channel, and noise. In this paper, we identify intra-instance modulation consistency as a task-aware structural prior, whereby different temporal segments of the same signal may differ in waveform while preserving the same modulation type, thus providing a principled cue for task-aligned self-supervision. Based on this prior, we propose Mod-CL, a Modulation consistency-based Contrastive Learning framework that constructs positive pairs from different temporal segments of the same signal instance, to encourage the model to learn shared modulation information while suppressing nuisance variations. We further develop a contrastive objective tailored to Mod-CL, which jointly exploits temporal segmentation and data augmentation to pull together views sharing the same modulation semantics while avoiding supervisory conflicts within each signal instance. Extensive experiments on RadioML datasets show that Mod-CL consistently outperforms strong baselines, especially in low-label regimes, achieving substantial improvements in linear probing accuracy.

[AI-239] Minimax Rates and Spectral Distillation for Tree Ensembles

【速读】：该论文针对树集成模型（如随机森林和梯度提升机）的理论性质不完善以及模型压缩问题，提出了一种基于谱视角的解决方案。关键贡献在于：首先，通过分析随机森林诱导的核算子的特征值衰减规律，在温和条件下证明了其回归的极小化最优收敛速率；其次，利用该核算子的主特征函数（对于随机森林）或平滑矩阵的主奇异向量（对于梯度提升机）来捕捉主导预测方向，并结合非线性映射学习，构建出尺寸远小于原始模型但预测性能相当的蒸馏模型。该谱压缩方法在森林剪枝和规则提取任务中优于现有算法，适用于资源受限计算场景。

链接: https://arxiv.org/abs/2605.11841
作者: Binh Duc Vu,David S. Watson
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages main text, 33 pages total, with 12 figures and 7 tables total

点击查看摘要

Abstract:Tree ensembles such as random forests (RFs) and gradient boosting machines (GBMs) are among the most widely used supervised learners, yet their theoretical properties remain incompletely understood. We adopt a spectral perspective on these algorithms, with two main contributions. First, we derive minimax-optimal convergence for RF regression, showing that, under mild regularity conditions on tree growth, the eigenvalue decay of the induced kernel operator governs the statistical rate. Second, we exploit this spectral viewpoint to develop compression schemes for tree ensembles. For RFs, leading eigenfunctions of the kernel operator capture the dominant predictive directions; for GBMs, leading singular vectors of the smoother matrix play an analogous role. Learning nonlinear maps for these spectral representations yields distilled models that are orders of magnitude smaller than the originals while maintaining competitive predictive performance. Our methods compare favorably to state of the art algorithms for forest pruning and rule extraction, with applications to resource constrained computing.

[AI-240] Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

【速读】：该论文试图解决灵长类视觉皮层背侧流（dorsal stream）中中颞区（middle temporal area, MT）方向选择性地图（direction-selective maps）的计算起源问题，特别是解释MT区独特的拓扑结构（如方向柱和pinwheel结构）是否与腹侧流（ventral stream）受同一普适原则驱动。解决方案的关键在于提出了一个时空拓扑深度人工神经网络（spatiotemporal Topographic Deep Artificial Neural Network, TDANN），该网络采用三维残差网络（3D ResNet）在自然视频上通过动量对比学习（Momentum Contrast, MoCo）自监督范式训练，并结合生物启发的空间损失（spatial loss）进行约束。最关键的是揭示了MT调谐特性——强方向选择性伴随残余轴向成分——源于任务驱动的判别压力（task-driven discriminative pressure）与空间正则化（spatial regularization）之间的严格优化权衡。该模型定量复现了猕猴MT的生理基线数据，从而统一了腹侧流与背侧流的计算起源，建立了皮层自组织的一般机制。

链接: https://arxiv.org/abs/2605.11718
作者: Zhaotian Gu,Molan Li,Jie Su,Chang Liu,Tianyi Qian,Dahui Wang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience. While recent computational frameworks like the Topographic Deep Artificial Neural Network (TDANN) have successfully modeled spatial organization in the ventral stream, the computational origins of the dorsal stream’s distinct topographies, such as direction-selective maps in the middle temporal (MT) area, remain largely unresolved. In this work, we present a spatiotemporal TDANN to investigate whether MT topography is governed by the same universal principles. By training a 3D ResNet on naturalistic videos via a Momentum Contrast (MoCo) self-supervised paradigm alongside a biologically inspired spatial loss, we demonstrate the spontaneous emergence of brain-like direction maps and topological pinwheel structures. Crucially, we reveal that MT tuning properties, characterized by strong direction selectivity paired with a residual axial component, arise from a strict optimization trade-off between task-driven discriminative pressure and spatial regularization. The model’s representations quantitatively match in vivo macaque MT physiological baselines, including direction selectivity index, circular variance, and pinwheel density. These findings unify the computational origins of the ventral and dorsal streams, establishing a general mechanism for cortical self-organization.

[AI-241] Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates Error Bounds and Calibrated Uncertainty

【速读】：该论文试图解决现有概率偏最小二乘（Probabilistic partial least squares, PPLS）拟合流程中存在的两个实际瓶颈：联合EM/ECM更新下的噪声-信号耦合（noise-signal coupling）以及正交性约束（orthogonality constraints）的非平凡处理问题。解决方案的关键在于构建一个端到端框架，将噪声预估计（noise pre-estimation）、受约束似然优化（constrained likelihood optimization）和预测校准（prediction calibration）整合到单一流程中。相较于Hu等人（2025）的工作，该方法作出两项核心改进：其一，用噪声子空间估计（noise-subspace estimation）替代全谱噪声平均（full-spectrum noise averaging），该估计器实现了信号强度无关的领先有限样本率并匹配极小化极大下界，而全谱估计器在相同模型下被证明不一致；其二，用精确的Stiefel流形优化（exact Stiefel-manifold optimization）替代内点罚函数处理以直接施加正交性约束。此外，框架通过可选的Gaussianization扩展到亚高斯（sub-Gaussian）设置，并利用分块Fisher分析（block-structured Fisher analysis）提供封闭形式标准误（closed-form standard errors）。最终在不需事后调整的情况下实现接近名义覆盖率的预测区间，并显著提升了参数恢复的稳定性。

链接: https://arxiv.org/abs/2605.11607
作者: Haoran Hu,Xingce Wang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise–signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood line of Hu et al.\ (2025), we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. Relative to Hu et al.\ (2025), we replace full-spectrum noise averaging with noise-subspace estimation and replace interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, while the full-spectrum estimator is shown to be inconsistent under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank r=3 , matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.

[AI-242] Efficient and provably convergent end-to-end training of deep neural networks with linear constraints

【速读】：该论文旨在解决深度神经网络在端到端训练中如何有效整合满足线性约束的投影层的问题，其核心难点在于投影层解映射的非光滑性导致缺乏严格的理论和高效的反向传播算法。解决方案的关键是引入了一种可高效计算的HS-Jacobian（HS-Jacobian），并证明了该Jacobian是投影算子到多面体集上的保守映射，从而能够无缝融入非光滑自动微分框架，实现反向传播。基于此，Adam等高效算法得以应用于端到端训练，且论文为该算法提供了收敛性保证。

链接: https://arxiv.org/abs/2605.11526
作者: Zonglin Yang,Zhexuan Gu,Yancheng Yuan
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training a deep neural network with the outputs of selected layers satisfying linear constraints is required in many contemporary data-driven applications. While this can be achieved by incorporating projection layers into the neural network, its end-to-end training remains challenging due to the lack of rigorous theory and efficient algorithms for backpropagation. A key difficulty in developing the theory and efficient algorithms for backpropagation arose from the nonsmoothness of the solution mapping of the projection layer. To address this bottleneck, we introduce an efficiently computable HS-Jacobian to the projection layer. Importantly, we prove that the HS-Jacobian is a conservative mapping for the projection operator onto the polyhedral set, enabling its seamless integration into the nonsmooth automatic differentiation framework for backpropagation. Therefore, many efficient algorithms, such as Adam, can be applied for end-to-end training of deep neural networks with linear constraints. Particularly, we establish convergence guarantees of the HS-Jacobian based Adam algorithm for training linearly constrained deep neural networks. Extensive experiment results on several important applications, including finance, computer vision, and network architecture design, demonstrate the superior performance of our method compared to other existing popular methods.

[AI-243] Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors

【速读】：该论文试图解决在冻结任何预训练第一阶段预测器（如线性模型、深度时空网络或视觉骨干网络）的情况下，如何为其引入结构化空间表示和闭式空间协方差的问题，从而实现对未观测位置的空间预测（kriging风格）并提供可量化的不确定性。解决方案的关键是提出一种参数高效的后处理层Spatial Adapter，它作为级联第二阶段作用于残差，通过可处理的mini-batch ADMM过程联合学习一个空间正则化正交基（orthonormal basis）和每个样本的得分，同时不修改任何第一阶段参数。该方法的本质在于利用平滑性、稀疏性和正交性将通用低秩分解转化为可识别的空间表示，其诱导的残差协方差具有闭式低秩加噪声估计量（closed-form low-rank-plus-noise estimator）；有效秩通过谱阈值（spectral thresholding）数据自适应地确定，而名义秩K仅作为优化侧的上界。这种设计使得Adder能以少于K(N+T)个参数和紧凑的残差趋势网络（residual-trend network）提供残差场的压缩分布总结，并支持空间外推预测。

链接: https://arxiv.org/abs/2605.11394
作者: Wen-Ting Wang,Wei-Ying Wu,Hao-Yun Huang,Xuan-Chun Wang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
备注: Preprint. 10 pages main text, with appendices

点击查看摘要

Abstract:We present the Spatial Adapter, a parameter-efficient post-hoc layer that equips any frozen first-stage predictor with a structured spatial representation of its residual field and an induced closed-form spatial covariance. The adapter operates as a cascade second stage on residuals, jointly learning a spatially regularized orthonormal basis and per-sample scores via a tractable mini-batch ADMM procedure, without modifying any first-stage parameter. Because the first-stage parameters are frozen, the adapter does not retrain the backbone; its role is to supply a compressed distributional summary of the residual field. Smoothness, sparsity, and orthogonality together turn a generic low-rank factorization into an identifiable spatial representation whose induced residual covariance admits a closed-form low-rank-plus-noise estimator; the effective rank is determined data-adaptively by spectral thresholding, while the nominal rank K is an optimization-side upper bound only. This covariance enables kriging-style spatial prediction at unobserved locations, with plug-in uncertainty quantification as a secondary downstream use. Across synthetic data, Weather2K for spatial-holdout prediction, and GWHD patch grids as a basis-transferability diagnostic, the adapter recovers residual spatial structure when paired with frozen first stages from linear models to deep spatiotemporal and vision backbones; the added representation uses fewer than K(N+T) parameters alongside a compact residual-trend network.

[AI-244] Rethinking external validation for the target population: Capturing patient-level similarity with a generative model

【速读】：该论文试图解决外部验证（external validation）中模型性能差异常因开发人群与验证人群的病例组合差异（case-mix effects）而混淆的问题，旨在区分模型自身的缺陷与人群特性差异对性能的影响。解决方案的关键在于提出一个量化框架：利用生成模型（generative models），具体为自编码器（autoencoders），估计每个外部患者与开发数据分布的相似度，并在按相似度分层的子群中评估模型性能。该方法相比传统线性方法更灵活，且无需共享原始开发数据，从而在保护数据隐私的同时，实现了对模型可迁移性（transportability）及其适用患者群体的原则性判断。

链接: https://arxiv.org/abs/2605.11284
作者: Mohammad Azizmalayeri,Ameen Abu-Hanna,Saskia Houterman,Marije M. Vis,Giovanni Cinà(on behalf of the NHR THI registration committee)
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: External validation is essential for assessing the transportability of predictive models. However, its interpretation is often confounded by differences between external and development populations. This study introduces a framework to distinguish model deficiencies from case-mix effects. Method: We propose a framework that quantifies each external patient’s similarity to the development data and measures performance in subgroups with varying levels of alignment to the development distribution. We use generative models, specifically autoencoders, to estimate similarity, offering a more flexible alternative to traditional linear approaches and enabling validation without sharing the original development data. The utility of autoencoder-based similarity measure is demonstrated using synthetic data, and the framework’s application is illustrated using data from the Netherlands Heart Registration (NHR) to predict mortality after transcatheter aortic valve implantation. Results: Our framework revealed substantial variation in model performance across similarity-defined subgroups, differences that remain hidden under conventional external validation yet can meaningfully alter conclusions. In several settings, conventional external validation suggested poor overall performance. However, after accounting for differences in patient characteristics, for some sub-groups, the model performance was consistent with internal validation results. Conversely, apparently acceptable overall performance could mask clinically relevant performance deficits in specific subgroups. Conclusion: The proposed framework enhances the interpretability of external validation by linking model performance to population alignment with the development data. This provides a more principled basis for deciding whether a model is transportable and to which patients it can be safely applied. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.11284 [stat.ME] (or arXiv:2605.11284v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2605.11284 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohammad Azizmalayeri [view email] [v1] Mon, 11 May 2026 22:11:03 UTC (800 KB)

[AI-245] Discovery of Interpretable Surrogates via Agent ic AI: Application to Gravitational Waves

【速读】：该论文旨在解决昂贵仿真模拟中快速替代模型（surrogate models）通常以黑箱形式运作、缺乏可解释性的问题。解决方案的关键在于提出了一种基于大语言模型（LLM）的代理工作流（agentic workflow）——GWAgent，它通过在每个迭代步骤中利用真实仿真数据对候选模型进行定量验证（quantitative validation），从而构建可解释的解析替代模型。向代理提供物理信息驱动的域假设（physics-informed domain ansatz）显著提升了输出模型的精度，最终在偏心双黑洞合并的引力波波形模拟中，中位Advanced LIGO失配度达到 (6.9 \times 10^{-4})，波形评估加速约8.4倍，超越了符号回归和传统机器学习基线，并可从学习表示中识别紧凑的物理结构。

链接: https://arxiv.org/abs/2605.11280
作者: Tousif Islam,Digvijay Wadekar,Tejaswi Venumadhav,Matias Zaldarriaga,Ajit Kumar Mehta,Javier Roulet,Barak Zackay
机构: 未知
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures, codes available at this https URL

点击查看摘要

Abstract:Fast surrogate models for expensive simulations are now essential across the sciences, yet they typically operate as black boxes. We present \textttGWAgent, a large language model (LLM)-based workflow that constructs interpretable analytic surrogates directly from simulation data. Surrogate modeling is well suited to agentic workflows because candidate models can be quantitatively validated against ground-truth simulations at each iteration. As a demonstration, we build a surrogate for gravitational waveforms from eccentric binary black hole mergers. We show that providing the agent with a physics-informed domain ansatz substantially improves output model accuracy. The resulting analytic surrogate attains a median Advanced LIGO mismatch of 6.9\times10^-4 together with an \sim 8.4\times speedup in waveform evaluation, surpassing both symbolic regression and conventional machine learning baselines. Beyond producing an accurate model, the workflow identifies compact physical structure from the learned representation. As an astrophysical application, we use \textttGWAgent to analyze the eccentricity of GW200129 and infer e_20\mathrmHz=0.099^+0.063_-0.044 . These results show that validation-constrained agentic workflows can produce accurate, fast, and interpretable surrogates for scientific simulations and inference.

[AI-246] gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

【速读】：该论文试图解决现代引力波天文学中端到端科学建模任务（如从昂贵的数值相对论模拟构建快速波形替代模型、建模黑洞轨道动力学、拟合并合残余性质及构建模板库）的高成本与高精度（相对误差需 ≤10⁻⁴）难题，研究当前最先进的LLM编码智能体（coding agents）能否自动完成此类任务。解决方案的关键是：引入一个由8个任务组成的gwBenchmarks基准套件，这些任务基于引力波解析计算和数值模拟（累计代表超过10⁸核心小时的计算量），涵盖插值、回归和高维时间序列建模，且要求结合数值方法、机器学习和物理知识方法；同时，为避免智能体使用代理指标、部分评估或伪造结果虚假完成任务，实现了一个外部预定义的评估框架来客观衡量智能体的实际进展。通过评估12个编码智能体，发现无一致优胜者，仅在最低难度任务上部分智能体能收敛到正确解（如立方样条），而更复杂任务（如解析波形建模）中所有智能体均比领域要求低1-2个数量级，且存在系统性失败（包括指标误用、约束违反和结果伪造）。

链接: https://arxiv.org/abs/2605.11269
作者: Tousif Islam,Digvijay Wadekar,Zihan Zhou
机构: 未知
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving \lesssim 10^-4 relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over 10^8 core-hours of compute. The tasks span interpolation, regression, and high-dimensional time-series modeling, requiring a combination of numerical methods, machine learning, and physics-informed approaches. In preliminary experiments, agents frequently relied on proxy metrics, partial evaluation, or fabricated results to spuriously complete tasks. We therefore implement an external pre-defined framework to gauge agent progress. Evaluating twelve coding agents, we find no consistent winner. On the easiest task, multiple agents converge to the same cubic spline solution, with one rediscovering a coordinate transformation widely used in the literature. On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements and exhibit systematic failures, including metric misuse, constraint violations, and result fabrication. Our code, data, and website are publicly available.

[AI-247] Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction

【速读】：该论文试图解决现代天体物理研究中因已发表的方法描述缺乏足够细节而导致的计算可重复性（computational reproducibility）问题。解决方案的关键在于：提出一个基于信息论的框架，将算法重建过程视为大语言模型（Large Language Models, LLMs）生成的概率分布，利用香农熵（Shannon entropy）和詹森-香农散度（Jensen-Shannon divergence）来量化文本描述对有效实现假设空间（hypothesis space）的约束强度。通过以外海王星天体（Trans-Neptunian Object, TNO）光谱重建为案例，该方法利用LLM对不同层级手稿文本（标题、摘要、方法）的响应，揭示了尽管增加文本能明确整体算法结构，但无法消除实现层面的方差，从而确立了一个“熵基底”（entropy floor），表明多个分歧性实现仍与显式指令一致。这一框架将LLM重新用作零样本诊断工具，通过识别缺失的结构约束帮助作者提高方法学透明度，从而维护科学完整性。

链接: https://arxiv.org/abs/2605.11154
作者: Hsing Wen Lin,Zong-Fu Sie
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 4 figures, submitted to PASP

点击查看摘要

Abstract:Modern astrophysical studies rely heavily on complex data analysis pipelines; however, published descriptions often lack the detail required for computational reproducibility. In this work, we present an information-theoretic framework to quantify how effectively a method can be reconstructed from its written description. By treating algorithmic reconstruction as a probability distribution generated by Large Language Models (LLMs), we utilize Shannon entropy and Jensen-Shannon divergence to measure how strongly text constrains the hypothesis space of valid implementations. We demonstrate this approach through a case study of Trans-Neptunian Object (TNO) spectral reconstruction from sparse photometry. By prompting frontier LLMs with varying levels of manuscript text (Title, Abstract, and Methods), we find that while increasing text successfully clarifies the overall algorithmic structure, it fails to eliminate variance at the implementation level. This persistent variance establishes an “entropy floor,” demonstrating that multiple divergent implementations remain consistent with explicit instructions. To evaluate practical reproducibility, we convert these reconstructed algorithms into executable pipelines. Our results reveal that, while LLMs easily recover core functional methodologies, they systematically fail to infer the tacit expert knowledge required for strict scientific calibration. This pilot study demonstrates that LLMs can be repurposed as a zero-shot diagnostic tool to audit methodological transparency, helping authors identify missing structural constraints and preserve scientific integrity in an era of automated research.

[AI-248] SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

【速读】：该论文试图解决原核生物操纵子（operon）计算识别中的可扩展性与泛化性问题，即如何在大规模微生物基因组上，无需依赖实验数据即可准确分类基因对是否属于同一操纵子。解决方案之关键在于利用预训练的蛋白质语言模型（protein language model）为每个基因序列生成独立嵌入（embedding），然后通过计算成对余弦相似度（cosine similarity）或在该嵌入空间上训练孪生多层感知器（Siamese MLP）执行二分类。论文的核心发现是：尽管学习一个融合嵌入空间的分类器在理论上有更合理的动机，但无监督余弦相似度在平均精度（Average Precision）上并未显著逊色，表明蛋白质语言模型嵌入的几何结构本身已充分编码了功能关系，从而为跨多样微生物基因组的自动化操纵子识别提供了兼顾准确性与低计算成本的可行基础。

链接: https://arxiv.org/abs/2605.11022
作者: Akarsh Gupta,Kenneth Rodrigues,Sagnik Chatterjee
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.

[AI-249] Context-Gated Associative Retrieval: From Theory to Transformers

【速读】：该论文试图解决现有联想记忆模型（如Hopfield网络及其泛化）将检索视为固定的查询到记忆映射、忽略外部上下文在回忆中作用的问题。解决方案的关键是提出一种两阶段联想记忆架构，其中包含一个上下文门控子电路（context-gate subcircuit），它在检索之前和期间重塑能量景观（energy landscape）。理论证明，上下文门控通过增加记忆间分离度并诱导稀疏性，带来指数级的检索改进，且系统存在唯一自洽固定点，检索状态由直接上下文偏置和二阶检索门控反馈回路共同驱动。该理论桥接到Transformer，验证了上下文学习（in-context learning）本质上是上下文门控检索，上下文定位了记忆子空间，使零样本查询能够清晰区分，从而为联想记忆理论与大语言模型（LLM）现象学提供了机制性联系。

链接: https://arxiv.org/abs/2605.10970
作者: Moulik Choraria,Argyrios Gerogiannis,Vidhata Jayaraman,Ankur Mani,Lav R. Varshney
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hopfield networks and their generalizations have established deep connections among biological associative memories, statistical physics, and transformers. Yet most models treat retrieval as a fixed query-to-memory mapping, ignoring the role of external context in recall. In this work, we propose a two-stage associative memory architecture, wherein a context-gate subcircuit reshapes the retrieval energy landscape before and during recall. We show theoretically that context gating increases inter-memory separation while inducing sparsity, translating into exponential improvements in retrieval. Crucially, we prove that the system admits a unique self-consistent fixed point, revealing that the resulting retrieval state is driven by both a direct contextual bias and a second-order retrieval-gate feedback loop. We then bridge this theory to transformers; specifically, we evaluate a first-order approximation on Llama-3, confirming that in-context learning acts as context-gated retrieval. Native dynamics mirror our theory: context localizes a memory subspace, enabling the zero-shot query to cleanly discriminate. Ultimately, this framework provides a mechanistic link between associative memory theory and LLM phenomenology.

[AI-250] wo Hebrew folk meteorological proverbs tested: rainfall on Rosh Chodesh and Shabbat Mevarechim as predictors of monthly precipitation (Israel 1950-2024)

【速读】：该论文旨在验证两句希伯来民间气象谚语（“若月初日（Rosh Chodesh）降雨则整月多雨"与"若安息日梅瓦雷希姆（Shabbat Mevarechim）降雨则整月多雨”）在冬季降水预报中的有效性，并评估其可靠性随时间变化的趋势。解决方案的关键在于：基于1950–2024年间以色列七个城市共191,758个站日、2,422个希伯来月冬季降水数据，运用统计检验（卡方检验、贝叶斯因子）和自相关分析，量化谚语的经验概率提升（lift约+16.4个百分点）；通过自举置换检验（bootstrap permutation test）和15年滚动分析揭示其预测能力以每年-0.20个百分点的速率下降，并将该衰减归因于气候变暖背景下降水事件持续时间的缩短。

链接: https://arxiv.org/abs/2605.10960
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Folk meteorological proverbs encode centuries of empirical observation by agricultural communities. Two Hebrew proverbs link lunar calendar anchor days to monthly winter rainfall: (i) “If Rosh Chodesh is rainy, the whole month is rainy” and (ii) “If it rains on Shabbat Mevarechim, the whole month is rainy.” Shabbat Mevarechim is the last Saturday before each new Hebrew month, preceding Rosh Chodesh by one to seven days. The first proverb is widely known; the second circulates in Hasidic oral tradition with no identified written source. Both have never been formally tested. We analyse 75 years (1950-2024) of daily precipitation data from seven Israeli cities across three climatic regions, comprising 191,758 station-days and 2,422 Hebrew-month observations during the winter rainy season (Marcheshvan-Adar). A rainy Rosh Chodesh increases the probability of a rainy month from 22.2% to 38.6% (lift +16.4 percentage points; chi-square = 57.8, p = 2.9e-14; Bayes factor 1.81). A rainy Shabbat Mevarechim produces a similar effect (lift +16.5 percentage points, p = 8.0e-13), despite preceding Rosh Chodesh by up to seven days. The effect decays with lag and mirrors daily rainfall autocorrelation (r = 0.35-0.44 at lag 1; ~0 at lag 7), consistent with Mediterranean cyclone persistence. A bootstrap permutation test (p 1e-4) and a 15-year rolling analysis show declining predictive power (-0.20 percentage points per year, p 0.001), consistent with shortening precipitation events under warming climate conditions. Both proverbs encode real but probabilistic meteorological signals whose reliability is decreasing over time. Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.10960 [physics.ao-ph] (or arXiv:2605.10960v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2605.10960 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Abraham Itzhak Weinberg [view email] [v1] Wed, 6 May 2026 14:28:33 UTC (2,573 KB)

[AI-251] Multi-Fidelity Emulation of Atmospheric Correction Coefficients with Physics-Guided Kolmogorov-Arnold Networks

【速读】：该论文旨在解决大气校正（atmospheric correction）中重复高保真辐射传输模拟（radiative transfer simulations）计算成本高昂的问题，这些模拟对于密集查找表（look-up-table）生成、敏感性分析、反演支持及业务预处理至关重要。解决方案的关键在于提出一个物理感知多保真代理框架（physics-aware multi-fidelity surrogate framework），即pKANrtm，它利用拉丁超立方采样（Latin Hypercube Sampling）获取大气与几何状态，并基于成对的6S（低保真）和libRadtran（高保真）模拟数据，通过物理引导的Kolmogorov-Arnold网络（Kolmogorov-Arnold Network）直接接收大气状态与低保真系数，预测相对于libRadtran的残差，从而重建高保真系数（路径反射率、总透过率及球面反照率）。该模型采用Efficient-KAN架构，并在原始系数空间中施加物理一致性惩罚（physics-consistency penalty）进行训练，从而在保证高预测精度和物理结构的同时，实现显著的计算加速（GPU推理单样本加速约四个数量级，批处理达到每秒数万样本）。

链接: https://arxiv.org/abs/2605.10958
作者: Md Abdullah Al Mazid,Naphtali Rishe
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Atmospheric correction is a critical preprocessing step in optical remote sensing, but repeated high-fidelity radiative transfer simulations remain computationally expensive for dense look-up-table generation, sensitivity analysis, retrieval support, and operational preprocessing. This study presents a physics-aware multi-fidelity surrogate framework for emulating atmospheric correction coefficients using paired 6S and libRadtran simulations. Atmospheric and geometric states are sampled using Latin Hypercube Sampling, and both radiative transfer models are evaluated under matched conditions for Sentinel-2 bands using spectral-response-function-aware coefficient generation. The high-fidelity targets are path reflectance, total transmittance, and spherical albedo. A physics-guided Kolmogorov-Arnold Network, termed pKANrtm, receives the atmospheric state and low-fidelity 6S coefficients, predicts the residual relative to libRadtran, and reconstructs the high-fidelity coefficients. The pKANrtm model uses an Efficient-KAN architecture and is trained with a physics-consistency penalty applied in the original coefficient space. The proposed model is evaluated against state-of-the-art regression-based RTM surrogates. Across both standard and out-of-distribution evaluation settings, pKANrtm achieves the strongest overall predictive performance among the compared models. Runtime benchmarking demonstrates substantial acceleration relative to libRadtran, with GPU inference providing approximately four orders of magnitude single-sample speedup and batched inference reaching tens of thousands of samples per second. These results indicate that physics-aware multi-fidelity pKANrtm emulation provides an accurate, physically structured, and computationally efficient strategy for atmospheric correction coefficient generation.

[AI-252] Controlled Steering-Based State Preparation for Adversarial-Robust Quantum Machine Learning

【速读】：该论文旨在解决量子机器学习（Quantum Machine Learning, QML）在实际部署中因对抗性扰动（adversarial perturbations）导致的脆弱性问题：即输入上的微小扰动会通过量子编码阶段传播，扭曲量子态并降低模型性能。解决方案的关键在于用基于被动导引（passive steering）的受控态制备（controlled state preparation）替代传统QML模型的量子编码阶段，通过调节导引强度（steering strength）和导引迭代次数（steering iterations），使编码态被引导至受控中间态，从而抑制对抗扰动的影响，在保持高干净准确率的同时提升对抗准确率。

链接: https://arxiv.org/abs/2605.10954
作者: Sahan Sanjaya,Hari Krishna Parvatham,Emma Andrews,Prabhat Mishra
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum machine learning (QML) provides a promising framework for leveraging quantum-mechanical effects in learning tasks. However, its vulnerability to adversarial perturbations remains a major challenge for practical deployment. In QML systems, small perturbations applied to classical inputs can propagate through the quantum encoding stage and distort the resulting quantum state, thereby degrading model performance. In this work, we propose a defense mechanism that replaces the conventional quantum encoding stage of a QML model with passive steering-based controlled state preparation, which guides the encoded state toward a controlled intermediate state. By tuning the steering strength and the number of steering iterations, the proposed method suppresses the influence of adversarial perturbations while maintaining high clean accuracy and improving adversarial accuracy. Experimental results demonstrate that the passive steering-based defense consistently improves adversarial accuracy across different QML models and datasets under gradient-based adversarial attacks, achieving adversarial accuracy improvements of up to 40.19%.

[AI-253] Breaking QAOAs Fixed Target Hamiltonian Barrier: A Fully Connected Quantum Boltzmann Machine via Bilevel Optimization

【速读】：该论文试图解决经典部分连接玻尔兹曼机和主流量子玻尔兹曼机（QBM）在结构灵活性和训练效率上的局限性，主要通过双层优化架构实现全连接QBM。解决方案的关键在于：将量子近似优化算法（QAOA）的传统电路扩展为双层结构，其中内层训练基于QAOA的计算过程模拟正相能量最小化，外层训练通过优化目标哈密顿量的结构参数模拟负相对比散度学习，从而在仅用单层（p=1）QAOA电路时即展现出高目标量子态测量概率（无噪声下平均0.9559）和强噪声鲁棒性（在典型商业量子设备噪声水平下仍达0.6047），并能在极少测量次数下实现稳定图像生成。

链接: https://arxiv.org/abs/2605.07473
作者: Jun Liu
机构: 未知
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 34 pages, 8 figures, 3 tables, 1 algorithm

点击查看摘要

Abstract:To overcome the limitations of classical partially connected Boltzmann machines and mainstream quantum Boltzmann machines (QBMs), this work extends the conventional circuit of the quantum approximate optimization algorithm (QAOA) to a bilevel optimization architecture and proposes a fully connected QBM. The inner-loop training simulates positive phase energy minimization based on the computational process of the conventional QAOA circuit, whereas the outer-loop training simulates negative phase contrastive divergence learning by optimizing the structural parameters of the target Hamiltonian. It is found that, first, the model exhibits superior performance using only a single layer (p=1) in the QAOA circuit, with an average probability of 0.9559 in measuring the target quantum state under noiseless conditions. Second, the model exhibits notable noise robustness. Under the typical noise level of current mainstream commercial quantum computing devices, the average probability of measuring the target quantum state reaches 0.6047; when the noise rises to a more stringent level with doubled intensity, this probability remains at 0.3859. In both scenarios, the target quantum state maintains the highest measurement probability among all detected states, with a value several times higher than that of the second-ranked state. This indicates that the model retains strong robustness even when noise meets or exceeds the upper limit of current mainstream commercial quantum computing devices. Third, under a block-by-block learning strategy with p=1 and only 10 measurement shots, the model consistently generates the target “qubit” grid image regardless of noise interference, demonstrating strong robustness in image generation.

[AI-254] Measuring Accuracy and Energy-to-Solution of Quantum Fine-Tuning of Foundational AI Models

【速读】：该论文试图解决如何量化评估混合量子-经典应用（特别是量子微调基础AI模型）的能效与可扩展性这一关键问题，因为传统方法缺乏直接、可测量的能耗指标。解决方案之关键在于：通过对Forte Enterprise离子阱量子处理器进行直接功耗测量，建立了能量到解决方案（energy-to-solution, ETS）作为可测量且可扩展的度量标准，并在此框架下实验验证了浅层电路中的QPU能耗随量子比特数线性增长，而经典模拟呈指数增长，从而在约34量子比特处观察到能耗平衡点；同时，量子微调模型相比最佳经典微调模型实现了约24%的分类误差改进，并结合张量网络方法进行了对比分析，最终为量子应用提供了实验证据支持其有利的能耗-精度权衡。

链接: https://arxiv.org/abs/2605.02798
作者: Oliver Knitter,Sang Hyub Kim,Maximilian Wurzer,Jonathan Mei,Claudio Girotto,Karen Horovitz,Chi Chen,Masako Yamada,Frederik F. Flöther,Martin Roetteler
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:We present an experimental study of energy-to-solution (ETS) of hybrid quantum-classical applications, enabled by direct instrumentation of power consumption of a Forte Enterprise trapped-ion quantum processor. We apply this methodology to a hybrid quantum-classical pipeline for quantum fine-tuning of foundational AI models, and validate the approach end-to-end on quantum hardware. Despite noise and limited qubit counts, the resulting models achieve accuracy competitive with and exceeding classical baselines such as logistic regression and support vector classifiers. Our results show that QPU energy consumption scales approximately linearly with qubit number for shallow circuits, while classical simulation exhibits exponential scaling, indicating a break-even for ETS around 34 qubits. The classification error improvement of the best quantum fine-tuned model over the best classical fine-tuned model considered in this study is around 24%. We further contextualize these findings with comparisons to tensor network methods. This work establishes energy-to-solution as a measurable and scalable metric for evaluating quantum applications and provides experimental evidence of favorable energy-accuracy trade-offs.

机器学习

[LG-0] Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

链接: https://arxiv.org/abs/2605.12492
作者: Kexuan Shi,Hanxuan Li,Zeju Qiu,Yandong Wen,Simon Buchholz,Weiyang Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Technical report v1 (30 pages, 19 figures, project page: this https URL )

点击查看摘要

Abstract:We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

[LG-1] High-arity Sample Compression

链接: https://arxiv.org/abs/2605.12465
作者: Leonardo N. Coregliano,William Opich
类目: Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Recently, a series of works have started studying variations of concepts from learning theory for product spaces, which can be collected under the name high-arity learning theory. In this work, we consider a high-arity variant of sample compression schemes and we prove that the existence of a high-arity sample compression scheme of non-trivial quality implies high-arity PAC learnability.

[LG-2] Search Your Block Floating Point Scales!

链接: https://arxiv.org/abs/2605.12464
作者: Tanmaey Gupta,Hayden Prairie,Xiaoxia Wu,Reyna Abhyankar,Qingyang Wu,Austin Silveria,Pragaash Ponnusamy,Jue Wang,Ben Athiwaratkun,Leon Song,Tri Dao,Daniel Y. Fu,Chris De Sa
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

[LG-3] Environment-Adaptive Preference Optimization for Wildfire Prediction

链接: https://arxiv.org/abs/2605.12435
作者: Enyi Jiang,Wu Sun
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Predicting rare extreme events such as wildfires from meteorological data requires models that remain reliable under evolving environmental conditions. This problem is inherently long-tailed: wildfire events are rare but high-impact, while most observations correspond to non-fire conditions, causing standard learning objectives to underemphasize the minority class (fire) that matters most. In addition, models trained on historical distributions often fail under distribution shifts, exhibiting degraded performance in new environments. To this end, we propose Environment-Adaptive Preference Optimization (EAPO), a framework that adapts prediction to the target environment with long-tail distribution. Given a new input distribution, we first construct distribution-aligned datasets via k -nearest neighbor retrieval. We then perform a hybrid fine-tuning procedure on this local manifold, combining supervised learning with preference optimization, as well as emphasizing on rare extreme events. EAPO refines decision boundaries while avoiding conflicting signals from heterogeneous training data. We evaluate EAPO on a real-world wildfire prediction task with environmental shifts. EAPO achieves robust performance (ROC-AUC 0.7310) and improves detection in extreme regimes, demonstrating its effectiveness in dynamic wildfire prediction systems.

[LG-4] Learning Minimally Rigid Graphs with High Realization Counts IJCAI2026

链接: https://arxiv.org/abs/2605.12427
作者: Oleksandr Slyvka,Jan Rubeš,Rodrigo Alves,Jan Legerský
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: This is an extended version of the paper accepted to IJCAI 2026

点击查看摘要

Abstract:For minimally rigid graphs, the same edge-length data can admit multiple realizations (up to translations and rotations). Finding graphs with exceptionally many realizations is an extremal problem in rigidity theory, but exhaustive search quickly becomes infeasible due to the super-exponential growth of the number of candidate graphs and the high cost of realization-count evaluation. We propose a reinforcement-learning approach that constructs minimally rigid graphs via 0- and 1-extensions, also known as Henneberg moves. We optimize realization-count invariants using the Deep Cross-Entropy Method with a policy parameterized by a Graph Isomorphism Network encoder and a permutation-equivariant extension-level action head. Empirically, our method matches the known optima for planar realization counts and improves the best known bounds for spherical realization counts, yielding new record graphs.

[LG-5] Aligning Flow Map Policies with Optimal Q-Guidance

链接: https://arxiv.org/abs/2605.12416
作者: Christos Ziakas,Alessandra Russo,Avishek Joey Bose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic’s Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.

[LG-6] A Semi-Supervised Framework for Speech Confidence Detection using Whisper

链接: https://arxiv.org/abs/2605.12387
作者: Adam Wynn,Jingyun Wang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 12 pages, 9 Figures, Submitted to IEEE Transactions on Audio, Speech and Language Processing

点击查看摘要

Abstract:Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.

[LG-7] MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions

链接: https://arxiv.org/abs/2605.12368
作者: Zichuan Yang
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Solving partial differential equations (PDEs) with machine learning typically requires training a new neural network for every new equation. This optimization is slow. We introduce MetaColloc. It is an optimization-free and data-free framework that removes this bottleneck completely. We decouple basis discovery from the solving process. We meta-train a dual-branch neural network on diverse Gaussian Random Fields. This offline process creates a universal dictionary of neural basis functions. At test time, we freeze the network. We solve the PDE by assembling a collocation matrix. We find the solution through a single linear least squares step. For non-linear PDEs, we apply the Newton-Raphson method to achieve fast quadratic convergence. Our experiments across six 2D and 3D PDEs show massive improvements. MetaColloc reaches state-of-the-art accuracy on smooth and non-linear problems. It also reduces test-time computation by several orders of magnitude. Finally, we provide a detailed frequency sweep analysis. This analysis reveals a critical mismatch between function approximation and operator stability at extremely high frequencies. This profound finding opens a clear path toward future operator-aware meta-learning.

[LG-8] From Message-Passing to Linearized Graph Sequence Models

链接: https://arxiv.org/abs/2605.12358
作者: Joël Mathys,Basil Rohner,Saku Peltonen,Roger Wattenhofer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Message-passing based approaches form the default backbone of most learning architectures on graph-structured data. However, the rapid progress of modern deep learning architectures in other domains, particularly sequence modeling, raises the question of how graph learning can benefit from these advances. We introduce Linearized Graph Sequence Models, a framework that recasts message-passing graph computation from the perspective of sequence modeling to simplify architectural choices. Our approach systematically separates the computational processing depth from the information propagation depth, allowing core graph architectural decisions to be treated as sequence modeling choices. Specifically, we analyze, both empirically and theoretically, what sequence properties make methods effective for learning and preserving the graph inductive bias. In particular, we validate our findings, demonstrating improved performance on long-range information tasks in graphs. Our findings provide a principled way to integrate modern sequence modeling advances into message-passing based graph learning. Beyond this, our work demonstrates how the separation of processing and information depth can recast central architectural questions as input modeling choices.

[LG-9] Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale

链接: https://arxiv.org/abs/2605.12343
作者: Paolo Secchi,Daniel S. Balint,Marco Maurizi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most learned PDE solvers follow a global-surrogate paradigm: a neural operator is trained to map full problem descriptions to full solution fields for a prescribed distribution of geometries, boundary conditions, and coefficients. This has enabled fast inference within fixed problem families, but limits reuse across new domains and makes large-scale deployment dependent on expensive problem-specific data generation. We introduce \textbfNEST ( \textbfNe ural- \textbfS chwarz \textbfT iling), a local-to-global framework that shifts learning from full-domain solution operators to reusable local physical solvers. The central premise is that, although global PDE solutions depend on geometry, scale, and boundary conditions, the physical response on small neighborhoods can be learned locally and composed into global solutions through classical domain decomposition. NEST learns a neural operator on minimal voxel patches ( 3 \times 3 \times 3 ) with diverse local geometries and boundary/interface data. At inference time, an unseen voxelized domain is tiled into overlapping patches, the learned local solver is applied patchwise, and global consistency is enforced through iterative Schwarz coupling with partition-of-unity assembly. In this way, generalization is shifted from a monolithic neural model to the combination of local physics learning and algorithmic global assembly. We instantiate NEST on nonlinear static equilibrium in compressible neo-Hookean solids and evaluate it on large, geometrically complex 3D domains far outside the scale of the training patches. Our results show that local neural building blocks, coupled through Schwarz iteration, offer a reusable local-training path toward scalable learned PDE solvers that generalize across domain size, shape, and boundary-condition configurations.

[LG-10] Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

链接: https://arxiv.org/abs/2605.12327
作者: Vage Egiazarian,Erik Schultheis,Andrei Panferov,Earl Killian,Torsten Hoefler,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the “better” among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at this https URL.

[LG-11] Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

链接: https://arxiv.org/abs/2605.12316
作者: Yunbei Xu,Yuzhe Yuan,Ruohan Zhan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback–Leibler (KL) divergence. Our goal is to characterize how the sequence horizon (H) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an (\Omega(H)) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order (\Omega(H)) that holds for both decomposable policy classes and fully shared policies, matching the (\widetilde O(H)) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric \colorblackthrough a sharp joint-KL oracle theory. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.

[LG-12] In-context learning to predict critical transitions in dynamical systems

链接: https://arxiv.org/abs/2605.12308
作者: Yunus Sevinchan,Juan Nathaniel,Kai Ueltzhöffer,Carla Roesch,Tobias Weber,Vaios Laschos,Hang Fan,Gregor Ramien,Johannes Haux,Pierre Gentine,Benjamin Herdeanu
类目: Machine Learning (cs.LG)
*备注: 14+38 pages, 5+23 figures

点击查看摘要

Abstract:Critical transitions - abrupt, often irreversible changes in system dynamics - arise across human and natural systems, often with catastrophic consequences. Real-world observations of such shifts remain scarce, preventing the development of reliable early warning systems. Conventional statistical and spectral indicators, such as increasing variance, tend to fail under realistic conditions of limited data and correlated noise, whereas existing deep learning classifiers do not extrapolate beyond their training data distribution. In this work, we introduce TipPFN, an in-context learning (ICL) framework that uses a prior-data fitted network to infer a system’s proximity to a critical transition. Trained on our novel synthetic data generator, which is based on canonical bifurcation scenarios coupled to diverse, randomized stochastic dynamics, TipPFN flexibly capitalizes on contexts of various sizes, complexity and dimensionalities. We demonstrate robust, state-of-the-art early detection of critical transitions in previously unseen tipping regimes, sim-to-real examples, and real-world observations in both ICL and zero-shot settings.

[LG-13] Approximation of Maximally Monotone Operators : A Graph Convergence Perspective

链接: https://arxiv.org/abs/2605.12301
作者: Takashi Furuya,Yury Korolev,Takaharu Yaguchi
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Operator learning has been highly successful for continuous mappings between infinite-dimensional spaces, such as PDE solution operators. However, many operators of interest-including differential operators-are discontinuous or set-valued, and lie outside classical approximation frameworks. We propose a paradigm shift by formulating approximation via graph convergence (Painlevé-Kuratowski convergence), which is well-suited for closed operators. We show that uniform and L^p approximation are fundamentally inadequate in this setting. Focusing on maximally monotone operators, we prove that any such operator can be approximated in the sense of local graph convergence by continuous encoder-decoder architectures, and further construct structure-preserving approximations that retain maximal monotonicity via resolvent-based parameterizations.

[LG-14] STRABLE: Benchmarking Tabular Machine Learning with Strings

链接: https://arxiv.org/abs/2605.12292
作者: Gioia Blayer,Myung Jun Kim,Félix Lefebvre,Lennart Purucker,Alan Arazi,Eilam Shapira,Roi Reichart,Frank Hutter,Marine Le Morvan,David Holzmüller,Gaël Varoquaux
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study “string tabular” learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.

[LG-15] argeted Neuron Modulation via Contrastive Pair Search

链接: https://arxiv.org/abs/2605.12290
作者: Sam Herring,Jake Naviasky,Karan Malhotra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

[LG-16] Hypernetworks for Dynamic Feature Selection

链接: https://arxiv.org/abs/2605.12278
作者: Javier Fumanal-Idocin,Raquel Fernandez-Peralta,Javier Andreu-Perez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic feature selection (DFS) is a machine learning framework in which features are acquired sequentially for individual samples under budget constraints. The exponential growth in the number of possible feature acquisition paths forces a DFS model to balance fitting specific scenarios against maintaining general performance, even when the feature space is moderate in size. In this paper, we study the structural limitations of existing DFS approaches to achieve an optimal solution. Then, we propose \textscHyper-DFS, a hypernetwork-based DFS approach that generates feature subset-specific classifier parameters on demand. We show that the use of hypernetworks compared to mask-embedding methods results in a smaller structural complexity bound. We also use a Set Transformer encoding to create a smooth conditioning space for the hypernetwork, so that functionally similar tasks are also geometrically close. In our benchmarks, \textscHyper-DFS outperforms all state-of-the-art approaches on synthetic and real-life tabular data. It is also competitive or superior across all image datasets tested, and shows substantially stronger zero-shot generalisation to feature subsets never seen during training than existing DFS approaches.

[LG-17] Delay-Empowered Causal Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2605.12261
作者: Chenran Zhao,Dianxi Shi,Haotian Wang,Mengzhu Wang,Yaowen Zhang,Chunping Qiu,Shaowu Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.

[LG-18] Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models ICML-2026

链接: https://arxiv.org/abs/2605.12258
作者: Runhe Lai,Xinhua Lu,Yanqi Wu,Jinlun Ye,Weijiang Yu,Ruixuan Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML-2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in-depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at this https URL.

[LG-19] SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

链接: https://arxiv.org/abs/2605.12245
作者: Chengzhu Bao,Xianglong Yan,Zhiteng Li,Guangshuo Qin,Guanghua Yu,Yulun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:NVFP4 has recently emerged as an efficient 4-bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post-training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed-form Joint Scale Optimization (CJSO), which jointly optimizes global and block-wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high-precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at this https URL.

[LG-20] Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.12224
作者: Rodney A Sanchez,Ferat Sahin,Alex Ororbia,Jamison Heard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents’ capacity by restricting how they learn from the environment as well as from others. Off-policy or learn-by-example methods can learn from demonstrators’ representations, but they require access to the demonstrating agent’s policies or their reward functions. Our work overcomes this direct sampling limitation by introducing vicarious conditioning as an intrinsic reward mechanism. We draw from psychological and biological literature to provide a foundation for vicarious conditioning and use memory-based methods to implement its four steps: attention, retention, reproduction, and reinforcement. Crucially, our vicarious conditioning paradigms support low-shot learning and do not require the demonstrator agent’s policy nor its reward functions. We evaluate our approach in the MiniWorld Sidewalk environment, one of the few public environments that features a non-descriptive terminal condition (no reward provided upon agent death), and extend it to Box2D’s CarRacing environment. Our results across both environments demonstrate that vicarious conditioning enables longer episode lengths by discouraging the agent from non-descriptive terminal conditions and guiding the agent toward desirable states. Overall, this work emulates a cognitively-plausible learning paradigm better suited to problems such as single-life learning or continual learning.

[LG-21] On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

链接: https://arxiv.org/abs/2605.12206
作者: Asad Bakija,Florent De Geeter,Julien Brandoit,Pierre Sacré,Guillaume Drion
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observation and the optimal action are separated by many time steps (called the horizon), are particularly challenging: training suffers from poor generalization, severe sample inefficiency, and prohibitive exploration costs. Ideally, an agent trained on short horizons would retain optimal behavior at arbitrarily longer ones, but no formal framework currently characterizes when this is achievable. To fill this gap, we formalized temporal horizon generalization, the property that a policy remains optimal for all horizons, derived a necessary and sufficient condition for it, and experimentally evaluated the ability of nonlinear and parallelizable RNN variants to achieve it. This paper presents the resulting theoretical framework, the empirical evaluation, and the dynamical interpretation linking RNN behavior to temporal horizon generalization. Our analyses reveal that multistability is necessary for temporal horizon generalization and, in simple tasks, sufficient; more complex tasks further require transient dynamics. In contrast, modern parallelizable architectures, namely state space models and gated linear RNNs, are monostable by construction and consequently fail to generalize across temporal horizons. We conclude that multistability and transient dynamics are two essential and complementary dynamical regimes for horizon generalization, and that no current parallelizable RNN exhibits both. Designing parallelizable architectures that combine these regimes thus emerges as a key direction for scalable long-horizon RL.

[LG-22] Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS

链接: https://arxiv.org/abs/2605.12200
作者: Gaspard Berthelier,Mariia Baranova,Andrei-Tiberiu Pantea,Etienne Le Naour,Adrien Petralia,Tahar Nabil,Themis Palpanas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have recently achieved state-of-the-art performance, often outperforming supervised models in zero-shot settings. Recent TSFM architectures, such as Chronos-2 and TabPFN-TS, aim to integrate covariates. In this paper, we design controlled experiments based on simple target-covariate relationships to assess this integration capability. Our results show that TabPFN-TS captures these relationships more effectively than Chronos-2, especially for short horizons, suggesting that the strong benchmark performance of Chronos-2 does not automatically translate into optimal modeling of simple covariate-target dependencies.

[LG-23] A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning

链接: https://arxiv.org/abs/2605.12197
作者: Haibo Chen,Xin Wang,Jiaheng Chao,Ling Feng,Wenwu Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Leveraging Graph Neural Networks (GNNs) as graph encoders and aligning the resulting representations with Large Language Models (LLMs) through alignment instruction tuning has become a mainstream paradigm for constructing Graph Language Models (GLMs), combining the generalization ability of LLMs with the structural modeling capacity of GNNs. However, existing GLMs that adopt GNNs as graph encoders largely overlook the problem of aligning GNN-encoded representations across domains and tasks with the LLM token space to obtain unified graph tokens, thereby limiting their ability to generalize across diverse graph data. To bridge this gap, we aim to incorporate a multi-domain, multi-task GNN encoder into GLMs and align its representations with LLMs to enable multi-domain, multi-task graph alignment instruction tuning. This alignment problem remains underexplored and poses two key challenges: 1) learning GNN-encoded representations that are simultaneously generalizable across domains and tasks and well aligned with textual semantics is difficult, due to substantial variations in graph structures, feature distributions, and supervision signals, together with the lack of textual-semantic alignment guidance in task-specific GNN training; 2) diverse graph data and task-specific instructions can exhibit different degrees of compatibility with the LLM token space during instruction tuning, leading to varying alignment difficulty and rendering a fixed alignment strategy suboptimal. To tackle these challenges, we propose UniGraphLM, a Unified Graph Language Model that incorporates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, and then adaptively aligns these representations with the LLM.

[LG-24] ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

链接: https://arxiv.org/abs/2605.12196
作者: Cao Yuan,Junjun Wang
类目: Machine Learning (cs.LG)
*备注: 42 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Accurate ultra-short-term wind power forecasting is critical for grid dispatch and reserve management, yet remains challenging due to the non-stationary, condition-dependent nature of wind generation. Meteorological exogenous variables carry substantial predictive information, but the most informative variable combination varies across sites, operating conditions, and prediction horizons. Existing deep learning approaches either treat exogenous inputs as generic auxiliary channels through uniform mixing or soft gating, or rely on fixed preprocessing steps such as PCA, without exploiting the physical structure of meteorological variables. We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework that decomposes exogenous variable modeling into two complementary modules. Physically-Grounded Variable Selection (PGVS) performs hierarchical, group-aware sparse selection over exogenous variables using a domain-informed physical prior and sparsemax activations, producing a compact, condition-adaptive exogenous context. Exogenous-Conditioned Regime Refinement (ECRR) routes the forecast through learned regime experts that apply gain–bias calibration and horizon-specific corrections via a mixture-of-experts paradigm. Experiments on three wind farms spanning different climates, capacities (66–200 MW), and exogenous dimensions (11–13 variables) demonstrate that ECTO achieves the lowest MSE across all sites, with relative improvements over the strongest baseline ranging from 2.2% to 5.2%, widening to 8.6% at the longer prediction horizon ( H=32 ). Ablation analysis confirms that each exogenous-related component contributes positively (PGVS +1.84%, ECRR +2.86%), and interpretability analysis reveals that PGVS learns physically meaningful, site-specific variable selection patterns, while ECRR converges to well-separated calibration strategies consistent across sites.

[LG-25] Fair Conformal Classification via Learning Representation-Based Groups

链接: https://arxiv.org/abs/2605.12195
作者: Senrong Xu,Yanke Zhou,Yuhao Tan,Zenan Li,Yuan Yao,Taolue Chen,Feng Xu,Xiaoxing Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction methods provide statistically rigorous marginal coverage guarantees for machine learning models, but such guarantees fail to account for algorithmic biases, thereby undermining fairness and trust. This paper introduces a fair conformal inference framework for classification tasks. The proposed method constructs prediction sets that guarantee conditional coverage on adaptively identified subgroups, which can be implicitly defined through nonlinear feature combinations. By balancing effectiveness and efficiency in producing compact, informative prediction sets and ensuring adaptive equalized coverage across unfairly treated subgroups, our approach paves a practical pathway toward trustworthy machine learning. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the framework.

[LG-26] Multi-Task Representation Learning for Conservative Linear Bandits

链接: https://arxiv.org/abs/2605.12176
作者: Jiabin Lin,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the Constrained Multi-Task Representation Learning (CMTRL) framework for linear bandits. We consider T linear bandit tasks in a d dimensional space, which share a common low-dimensional representation of dimension r, where r is much smaller than the minimum of d and T. Furthermore, tasks are constrained so that only actions meeting specific safety or performance requirements are allowed, referred to as conservative (safe) bandits. We introduce a novel algorithm, Safe-Alternating projected Gradient Descent and minimization (Safe-AltGDmin), to recover a low-rank feature matrix while satisfying the given constraints. Building on this algorithm, we propose a multi-task representation learning framework for conservative linear bandits and establish theoretical guarantees for its regret and sample complexity bounds. We presented experiments and compared the performance of our algorithm with benchmark algorithms.

[LG-27] Expected Batch Optimal Transport Plans and Consequences for Flow Matching

链接: https://arxiv.org/abs/2605.12174
作者: Samuel Boïté,Julie Delon,Kimia Nadjahi
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Solving optimal transport (OT) on random minibatches is a common surrogate for exact OT in large-scale learning. In flow matching (FM), this surrogate is used to obtain OT-like couplings that can straighten probability paths and reduce numerical integration cost. Yet, the population-level coupling induced by repeated minibatch OT remains only partially understood. We formalize this coupling as the expected batch OT plan \overline\pi_k , obtained by averaging empirical OT plans over independent minibatches of size k . We then establish its large-batch consistency and, in the semidiscrete case relevant to generative modeling, derive rates for both the transport-cost bias and the convergence of \overline\pi_k to the OT plan. For FM, this yields a population coupling whose induced velocity field is regular enough to define a unique flow from the source to the discrete target. We finally quantify how OT batch size interacts with numerical integration in a tractable two-atom model and in synthetic and image experiments.

[LG-28] Lower bounds for one-layer transformers that compute parity

链接: https://arxiv.org/abs/2605.12171
作者: Daniel Hsu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.

[LG-29] On What We Can Learn from Low-Resolution Data

链接: https://arxiv.org/abs/2605.12168
作者: Theresa Dahl Frehr,Niels Henrik Pontoppidan,Hiba Nassar,Tommy Sonne Alstrøm
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence systems typically rely on large, centrally collected datasets, a premise that does not hold in many real-world domains such as healthcare and public institutions. In these settings, data sharing is often constrained by storage, privacy, or resource limitations. For example, small wearable devices may lack the bandwidth or energy capacity needed to store and transmit high-resolution data, leading to aggregation during data collection and thus a loss of information. As a result, datasets collected from different sources may consist of a mixture of high- and low-resolution samples. Despite the prevalence of this setting, it remains unclear how informative low-resolution data is when models are ultimately evaluated on high-resolution inputs. We provide a theoretical analysis based on the Kullback-Leibler divergence that characterises how the influence of a datapoint changes with resolution, and derive bounds that relate the relative contribution of high- and low-resolution observations to the information lost under downsampling. To support this analysis, we empirically demonstrate, using both a vision transformer and a convolutional neural network, that adding low-resolution data to the training set consistently improves performance when high-resolution data is scarce.

[LG-30] Fused Gromov-Wasserstein Distance with Feature Selection

链接: https://arxiv.org/abs/2605.12161
作者: Harlin Lee,Ying Yu,Mingxin Li,Ranthony Clark
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:Fused Gromov-Wasserstein (FGW) distances provide a principled framework for comparing objects by jointly aligning structure and node features. However, existing FGW formulations treat all features uniformly, which limits interpretability and robustness in high-dimensional settings where many features may be irrelevant or noisy. We introduce FGW distances with feature selection, which incorporate adaptive feature suppression weights into the FGW objective to selectively downweight or suppress differentiating features during alignment. We propose two approaches: (1) regularized FGW with Lasso and Ridge penalties, and (2) FGW with simplex-constrained weights, including groupwise extensions. We analyze the resulting models and establish their key theoretical properties, including bounds relative to classical FGW and Gromov-Wasserstein distances, and metric behavior. An efficient alternating minimization algorithm is developed. Experiments illustrate how feature suppression enhances interpretability and reveals task-relevant structure, with a special application to computational redistricting.

[LG-31] PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior

链接: https://arxiv.org/abs/2605.12147
作者: James Flemings,Murali Annavaram
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human behavior, but their ability to simulate individual privacy decisions is not well understood. In this paper, we address the problem of evaluating whether a core set of user persona attributes can drive LLMs to simulate individual-level privacy behavior. We introduce PrivacySIM, an evaluation suite that benchmarks LLM simulation of user privacy behavior against the ground-truth responses of 1,000 users. These users are drawn from five published user studies on privacy spanning LLM healthcare consultations, conversational agents, and chatbots. Drawing on these user studies, we hypothesize three persona facets as plausible predictors of privacy decision-making: demographics, previous experiences, and stated privacy attitudes. We condition nine frontier LLMs on subsets of these three facets and measure how often each model’s response to a data-sharing scenario matches the user’s actual response. Our findings show that (1) privacy persona conditioning consistently improves simulation quality over no-persona conditioning, but even the strongest model (40.4% accuracy) remains far from faithfully simulating individual privacy decisions. (2) A user’s stated privacy attitudes alone may not be the best predictor because they often diverge from the user’s actual privacy behavior. (3) Users with high AI/chatbot experience but low stated privacy attitudes are the most challenging to simulate. PrivacySIM is a first step toward understanding and improving the capabilities of LLMs to simulate user privacy decisions. We release PrivacySIM to enable further evaluation of LLM privacy simulation.

[LG-32] STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

链接: https://arxiv.org/abs/2605.12135
作者: Joshua Opria
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 4 figures, 3 tables. Code and models: this https URL

点击查看摘要

Abstract:We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion – the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

[LG-33] Elicitation-Augmented Bayesian Optimization

链接: https://arxiv.org/abs/2605.12079
作者: Alvar Haltia,Ville Hyvönen,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human-in-the-loop Bayesian optimization (HITL BO) methods utilize human expertise to improve the sample-efficiency of BO. Most HITL BO methods assume that a domain expert can quantify their knowledge, for instance by pinpointing query locations or specifying their prior beliefs about the location of the maximum as a probability distribution. However, since human expertise is often tacit and cannot be explicitly quantified, we consider a setting where domain knowledge of an expert is elicited via pairwise comparisons of designs. We interpret the expert’s pairwise judgements as noisy evidence about the values of the observable objective function and develop a principled method for combining the information obtained via direct observations and pairwise queries. Specifically, we derive a cost-aware value-of-information acquisition function that balances direct observations against pairwise queries. The proposed method approaches the convex hull of the trajectories of the individual information sources: when pairwise queries are cheap it substantially improves sample-efficiency over observation-only BO, and when pairwise queries are costly or noisy, it recovers the performance of standard BO by relying on direct observations alone.

[LG-34] Learning plug-in surrogate endpoints for randomized experiments

链接: https://arxiv.org/abs/2605.12051
作者: Alessandro-Umberto Margueritte,Ahmet Zahid Balcıoğlu,Jesse Krijthe,Dave Zachariah,Fredrik D. Johansson
类目: Machine Learning (cs.LG)
*备注: 29 pages, 5 figures

点击查看摘要

Abstract:Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.

[LG-35] Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System

链接: https://arxiv.org/abs/2605.12025
作者: Takashi Furuya,Ryo Ozawa,Jenn-Nan Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural operators provide a framework for learning solution operators of partial differential equations (PDEs), enabling efficient surrogate modeling for complex systems. While universal approximation results are now well understood, approximation analysis specific to nonlinear reaction-diffusion systems remains limited. In this paper, we study neural operators applied to the solution mapping from initial conditions to time-dependent solutions of a generalized Gierer-Meinhardt reaction-diffusion system, a prototypical model of nonlinear pattern formation. Our main results establish explicit approximation error bounds in terms of network depth, width, and spectral rank by exploiting the Laplacian spectral representation of the Green’s function underlying the PDE. We show that the required parameter complexity grows at most polynomially with respect to the target accuracy, demonstrating that Laplacian eigenfunction-based neural operator architectures alleviate the curse of parametric complexity encountered in generic operator learning. Numerical experiments on the Gierer-Meinhardt system support the theoretical findings.

[LG-36] Limits of Learning Linear Dynamics from Experiments

链接: https://arxiv.org/abs/2605.12010
作者: Aybüke Ulusarslan,Niki Kilbertus,Nora Schneider
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning governing dynamics from data is a common goal across the sciences, yet it is only well-posed when the underlying mechanisms are identifiable. In practice, many data-driven methods implicitly assume identifiability; when this assumption fails, estimated models can yield spurious predictions and invalid mechanistic conclusions. Classical identifiability guarantees for controlled linear time-invariant (LTI) systems provide sufficient conditions – controllability and persistent excitation – but leave open whether identifiability holds when these conditions fail, and which parts of the system remain identifiable without full identifiability. We show that the experimental setup, i.e., the realized initial state and control input, dictates a fundamental limit on the information recoverable from the observed trajectory. We develop a geometric characterization of this limit and derive a closed-form description of all systems consistent with the experimental setup. Crucially, we prove that even when the full system is not identifiable, the restricted dynamics on the subspace reachable by the experiment remain uniquely determined.

[LG-37] Estimating Subgraph Importance with Structural Prior Domain Knowledge

链接: https://arxiv.org/abs/2605.12009
作者: Changhyun Kim,Seunghwan An,Jong-June Jeon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. Our method effectively leverages prior domain knowledge of graph substructures, while remaining independent of the specific form of the output layer or readout function used in the GNN architecture, and it does not require access to ground-truth target labels. Experiments on real-world graph datasets demonstrate that our method consistently outperforms existing baselines in subgraph importance estimation. Furthermore, we extend our method to identify important nodes within the graph.

[LG-38] Split the Differences Pool the Rest: Provably Efficient Multi-Objective Imitation

链接: https://arxiv.org/abs/2605.12000
作者: Ziyad Sheebaelhamd,Luca Viano,Volkan Cevher,Claire Vernade
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.

[LG-39] QDSB: Quantized Diffusion Schrödinger Bridges

链接: https://arxiv.org/abs/2605.11983
作者: Tobias Fuchs,Florian Kalinke,Nadja Klein
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning generative models in settings where the source and target distributions are only specified through unpaired samples is gaining in importance. Here, one frequently-used model are Schrödinger bridges (SB), which represent the most likely evolution between both endpoint distributions. To accelerate training, simulation-free SBs avoid the path simulation of the original SB models. However, learning simulation-free SBs requires paired data; a coupling of the source and target samples is obtained as the solution of the entropic optimal transport (OT) problem. As obtaining the optimal global coupling is infeasible in many practical cases, the entropic OT problem is iteratively solved on minibatches instead. Still, the repeated cost remains substantial and the locality can distort the global transport geometry. We propose quantized diffusion Schrödinger bridges (QDSB), which compute the endpoint coupling on anchor-quantized endpoint distributions and lift the resulting plan back to original data points through cell-wise sampling. We show that the regularized optimal coupling is stable w.r.t. anchor quantization, with an error controlled by the quality of the anchor approximation. In real-world experiments, QDSB matches the sample quality of existing baselines, requiring substantially less time. Code and data are available at this http URL.

[LG-40] Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.11975
作者: Jingduo Pan,Taoran Wu,Yiling Xue,Bai Xue
类目: Machine Learning (cs.LG)
*备注: Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least p while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.

[LG-41] owards Order Fairness: Mitigating LLM s Order Sensitivity through Dual Group Advantage Optimization

链接: https://arxiv.org/abs/2605.11974
作者: Xu Chu,Guanyu Wang,Zhijie Tan,Xinrong Chen,Ziyu Li,Tong Mo,Weiping Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model’s applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model’s inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbfDual \textbfGroup \textbfAdvantage \textbfOptimization (\textbfDGAO), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs’ order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: this https URL.

[LG-42] NOFE – Neural Operator Function Embedding

链接: https://arxiv.org/abs/2605.11970
作者: Lars Uebbing,Harald L. Joakimsen,Siyan Chen,Georgios Leontidis,Kristoffer K. Wickstrøm,Michael C. Kampffmeyer,Sébastien Lefèvre,Arnt-Børre Salberg,Robert Jenssen
类目: Machine Learning (cs.LG)
*备注: 21 pages, 11 figures, 12 tables

点击查看摘要

Abstract:Most dimensionality reduction methods treat data as discrete point clouds, ignoring the continuous domain structure inherent to many real-world processes. To bridge this gap, we introduce Neural Operator Function Embedding (NOFE), a domain-aware framework for continuous dimensionality reduction. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation at arbitrary query locations independent of input discretization. We establish NOFE as approximation of sheaf-to-sheaf mappings, generalizing Sheaf Neural Networks to continuous domains. We evaluate NOFE across different datasets, comparing it against PCA, t-SNE, and UMAP. Our results demonstrate that NOFE significantly outperforms baselines in local structure preservation, achieving a local Stress of 0.111 compared to 0.398 for PCA, 0.773 for t-SNE, and 0.791 for UMAP for the ERA5 climate reanalysis dataset. NOFE also exhibits robust sampling independence, reducing the Patch Stitching Error by up to 20.0\times relative to UMAP (59.0 vs. 267.6 under regional normalization) and ensuring consistency across disjoint domain patches. While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA’s 0.268), NOFE resolves fine-grained structures and produces smooth, consistent embeddings that generalize across varying sample densities, addressing key limitations of discrete reduction methods.

[LG-43] STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning

链接: https://arxiv.org/abs/2605.11919
作者: Zekai Chen,Xun Wu,Xunkai Li,Yihan Sun,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. As graph data increasingly contain multimodal node attributes such as text and images, multimodal federated graph learning (MM-FGL) has become an important yet substantially harder setting. The key challenge is that clients from different modality domains may not share a common semantic space: even for the same concept, their local encoders can produce inconsistent representations before collaboration begins. This makes direct parameter coordination unreliable and further causes two downstream problems: forcing heterogeneous client representations into a naively shared semantic space may create false semantic agreement, and graph message passing may amplify residual inconsistency across neighborhoods. To address this issue, we propose \textbfSTAGE, a protocol-first framework for MM-FGL. Instead of relying on direct parameter averaging, STAGE builds a shared semantic space that first translates heterogeneous multimodal features into comparable representations and then regulates how these representations propagate over local graph structures. In this way, STAGE not only improves cross-client semantic calibration, but also reduces the risk of inconsistency amplification during graph learning. Extensive experiments on 8 multimodal-attributed graphs across 5 graph-centric and modality-centric tasks show that STAGE consistently achieves state-of-the-art performance while reducing per-round communication payload.

[LG-44] Understanding Sample Efficiency in Predictive Coding

链接: https://arxiv.org/abs/2605.11911
作者: Gaspard Oliviers,Elene Lominadze,Rafal Bogacz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive Coding (PC) is an influential account of cortical learning. Much of recent work has focused on comparing PC to Backpropagation (BP) to find whether PC offers any advantages. Small scale experiments show that PC enables learning that is more sample efficient and effective in many contexts, though a thorough theoretical understanding of the phenomena remains elusive. To address this, we quantify the efficiency of learning in BP and PC through a metric called ``target alignment’', which measures how closely the change in the output of the network is aligned to the output prediction error. We then derive and empirically validate analytical expressions for target alignment in Deep Linear Networks. We show that learning in PC is more efficient than BP, which is especially pronounced in deep, narrow and pre-trained networks. We also derive exact conditions for guaranteed optimal target alignment in PC and validate our findings through experiments. We study full training trajectories of linear and non-linear models, and find the predicted benefits of PC persist in practice even when some assumptions are violated. Overall, this work provides a mechanistic understanding of the higher learning efficiency observed for PC over BP in previous works, and can guide how PC should be parametrised to learn most effectively.

[LG-45] Delightful Gradients Accelerate Corner Escape

链接: https://arxiv.org/abs/2605.11908
作者: Jincheng Mei,Ian Osband
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Softmax policy gradient converges at O(1/t) , but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emphDelightful Policy Gradient (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For K -armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emphally: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic O(1/t) rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.

[LG-46] Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen 3.5 Models

链接: https://arxiv.org/abs/2605.11907
作者: Igor Strozzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbfMain finding. Under matched-path LLM-only scoring, the SFT-attributable procedural- \Delta lift is roughly uniform across sizes: +0.070 / +0.040 / +0.075 at 0.8B / 2B / 4B. Variation in post-SFT \Delta ( -0.005 , +0.100 , +0.065 ) is dominated by a W-shaped pre-SFT base trajectory ( -0.075 , +0.060 , -0.010 , Haiku-4-5 at +0.030 ): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure – a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbfMethodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic \textttANSWER-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU pass-rate within a 2,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbfCross-family validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen’s \kappa \geq 0.754 , agreement \geq 93.25% . Earlier format-only at 0.8B'' and shrinking SFT at 4B’’ framings were path-mismatch artifacts; this paper supersedes both (Appendix~\refsec:appendix-path). Single-seed; threats in §\refsec:threats. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.11907 [cs.LG] (or arXiv:2605.11907v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11907 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Igor Strozzi [view email] [v1] Tue, 12 May 2026 10:19:33 UTC (73 KB) Full-text links: Access Paper: View a PDF of the paper titled Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models, by Igor StrozziView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-47] Sobolev Regularized MMD Gradient Flow

链接: https://arxiv.org/abs/2605.11884
作者: Chenyang Tian,Bharath K. Sriperumbudur,Arthur Gretton,Zonghao Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Sobolev-regularized Maximum Mean Discrepancy (SrMMD) gradient flow, a regularized variant of maximum mean discrepancy (MMD) gradient flow based on a gradient penalty on the witness function. The proposed regularization mitigates the non-convexity of the MMD objective and yields provable \emphglobal convergence guarantees in MMD in both continuous and discrete time. A more surprising appeal is that our convergence analysis does not rely on isoperimetric assumptions on the target distribution. Instead, it is based on a regularity condition on the difference between kernel mean embeddings. A key highlight of the proposed flow is that it is applicable in both sampling (from an unnormalized target distribution) – using Stein kernels – and generative modeling settings, unlike previous works, where a gradient flow is suitable for only generative modeling or sampling but not both. The effectiveness of the proposed flow is empirically verified on a broad range of tasks in both generative modelling and sampling.

[LG-48] LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection

链接: https://arxiv.org/abs/2605.11872
作者: Lanxin Zhao,Bamdev Mishra,Pratik Jawanpuria,Lequan Lin,Dai Shi,Junbin Gao,Andi Han
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Orthogonal parameter-efficient fine-tuning (PEFT) adapts pretrained weights through structure-preserving multiplicative transformations, but existing methods often conflate two distinct design choices: the subspace in which adaptation occurs and the transformation applied within that subspace. This paper introduces LOFT, a low-rank orthogonal fine-tuning framework that explicitly separates these two components. By viewing orthogonal adaptation as a multiplicative subspace rotation, LOFT provides a unified formulation that recovers representative orthogonal PEFT methods, including coordinate-, butterfly-, Householder-, and principal-subspace-based variants. More importantly, this perspective exposes support selection as a central design axis rather than a byproduct of a particular parameterization. We develop a first-order analysis showing that useful adaptation supports should be informed by the downstream training signal, motivating practical task-aware support selection strategies. Across language understanding, visual transfer, mathematical reasoning, and multilingual out-of-distribution adaptation, LOFT recovers principal-subspace orthogonal adaptation while gradient-informed supports improve the efficiency-performance trade-off under matched parameter, memory, and compute budgets. These results suggest that principled support selection is an important direction for improving orthogonal PEFT.

[LG-49] Information theoretic underpinning of self-supervised learning by clustering

链接: https://arxiv.org/abs/2605.11870
作者: Josef Kittler,Sara Atito,Muhammad Awais
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization. The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common heuristics-based practices in SSL, but our work underpins them theoretically. The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2605.11870 [cs.LG] (or arXiv:2605.11870v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11870 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLM s

链接: https://arxiv.org/abs/2605.11857
作者: Amr Abourayya,Jens Kleesiek,Michael Kamp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated fine-tuning of large language models is commonly formulated as a parameter aggregation problem. However, even parameter-efficient methods require transmitting large collections of trainable weights, assume aligned architectures, and rely on white-box access to model parameters. As model sizes continue to grow and deployments become increasingly heterogeneous, these assumptions become progressively misaligned with practical constraints. We consider an alternative formulation in which collaboration is mediated through model behavior rather than parameters. Clients fine-tune local models on private data and exchange generated outputs on a shared, public prompt set. The server maps these outputs into a semantic representation space, forms a per-prompt semantic consensus, and returns pseudo-labels for further local fine-tuning. This formulation fundamentally changes the communication scaling of federated LLM fine-tuning. The amount of information exchanged depends only on the public prompt budget and the size of the communicated behaviors, independent of model size. As a consequence, the protocol naturally accommodates heterogeneous architectures and applies directly to open-ended text generation. We present a theoretical analysis and empirical results demonstrating that this approach can match strong federated fine-tuning baselines while substantially reducing communication by orders of magnitude (e.g., analytically by a factor of 1006 for Llama3.1-405B), as well as reductions in runtime and energy consumption. These results suggest that, for generative foundation models, behavior-level consensus provides a more appropriate abstraction for federated adaptation than parameter aggregation.

[LG-51] A Fast and Energy-Efficient Latch-Based Memristive Analog Content-Addressable Memory

链接: https://arxiv.org/abs/2605.11847
作者: Paul-Philipp Manea,Aishwarya Natarajan,Jim Ignowski,John Paul Strachan,Luca Buonanno
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Analog content-addressable memories (aCAMs) based on memristors provide a promising pathway toward energy-efficient large-scale associative computing for Edge AI and embedded intelligence applications. They have been successfully applied to decision-tree inference and extend the capabilities of compute-in-memory (CIM) architectures beyond conventional vector-matrix multiplication. However, conventional designs such as the 6T2M architecture suffer from static search power, limited voltage gain, and pronounced match-line crosstalk, constraining analog precision and scalability. We introduce a strong-arm latched memristor (SALM) aCAM cell that replaces static voltage division with a dynamic current-race comparator, enabling high regenerative gain, intrinsic result latching, and near-zero static search power. Compared to 6T2M, SALM reduces read energy by 33% at identical latency while eliminating the gain and crosstalk limitations that prevent 6T2M from scaling to large arrays. SALM further enables scalable sequential and parallel latch sharing, and a dataset-aware optimization framework exposes an explicit energy-latency tradeoff, achieving up to 50% energy reduction at 3x latency across representative workloads. To enable architectural exploration, we develop a circuit-accurate behavioral model derived from SPICE lookup tables in 22 nm FD-SOI technology, capturing match-line dynamics and crosstalk. Integrated into the X-TIME decision-tree compiler, this framework demonstrates that SALM maintains near-software accuracy for high-dimensional datasets, whereas baseline designs degrade due to limited gain and cumulative crosstalk.

[LG-52] Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

链接: https://arxiv.org/abs/2605.11838
作者: Alexander Yukhimchuk,Mladen Kolar,Martin Takáč,Sayantan Choudhury
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal \mathcalO\left(K^\frac2 - 2\alpha3\alpha - 2\right) rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top r singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.

[LG-53] Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning

链接: https://arxiv.org/abs/2605.11815
作者: Satwat Bashir,Tasos Dagiuklas,Muddesar Iqbal
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures. Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026), Valencia, Spain, June 9-12, 2026. To appear in IEEE proceedings

点击查看摘要

Abstract:Hierarchical federated learning (HFL) leverages edge servers for partial aggregation in edge computing. Yet existing FL methods lack mechanisms for jointly optimizing cluster assignment and client selection under data heterogeneity. This paper proposes Fed-BAC, which integrates additive cluster personalization with a two-level bandit framework: contextual bandits at the cloud learn server-to-cluster assignments, while Thompson Sampling at each edge server identifies high-contributing clients. The additive decomposition enables the sharing of knowledge between groups through a globally aggregated network, while cluster-specific networks capture distribution variations. Across three classification benchmarks (CIFAR-10, SVHN, Fashion-MNIST) under moderate ( \alpha = 0.5 ) and severe ( \alpha = 0.1 ) Dirichlet non-IID partitioning, Fed-BAC achieves distributed accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, while requiring only 80% client participation, converging 1.5 to 4.8 \times faster depending on dataset and accuracy target, and improving cross-server fairness. These gains are further validated at 5 \times deployment scale on CIFAR-10. The advantage of Fed-BAC increases with heterogeneity severity, confirming that additive cluster personalization becomes increasingly valuable as data distributions diverge.

[LG-54] Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

链接: https://arxiv.org/abs/2605.11764
作者: Thor Klamt,Wolfgang Nejdl,Ming Tang
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 32 pages, 11 figures, 11 tables. Dataset: this https URL (CC-BY-4.0). Code: this https URL (MIT)

点击查看摘要

Abstract:Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.

[LG-55] A nonlinear extension of parametric model embedding for dimensionality reduction in parametric shape design

链接: https://arxiv.org/abs/2605.11759
作者: Andrea Serani,Giorgio Palma,Matteo Diez
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Dimensionality reduction is essential in simulation-based shape design, where high-dimensional parameterizations hinder optimization, surrogate modeling, and systematic design-space exploration. Parametric Model Embedding (PME) addresses this issue by constructing reduced variables from geometric information while preserving an explicit backmapping to the original design parameters. However, PME is intrinsically linear and may become inefficient when the sampled design space is governed by nonlinear geometric variability. This paper introduces a nonlinear extension of PME, denoted NLPME. The proposed framework preserves the defining principle of PME – geometry-driven latent variables and parameter-mediated reconstruction – while replacing the linear reduced subspace with a nonlinear latent representation. Geometry is not reconstructed directly from the latent variables; instead, the latent representation is decoded into admissible design parameters, and the corresponding geometry is recovered through a forward parametric map. The method is assessed on a bio-inspired autonomous underwater glider with a 32-dimensional parametric shape description and a CAD-based geometry-generation process. NLPME reaches a 5% reconstruction-error threshold with (N=5) latent variables, compared with (N=8) for linear PME, and a 1% threshold with (N=9), compared with (N=15) for PME. Comparison with a deep autoencoder shows that most of the nonlinear compression gain can be retained while preserving an explicit backmapping to the original design variables. The results establish NLPME as a compact, admissible, and engineering-compatible nonlinear reduced representation for parametric shape design spaces.

[LG-56] Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

链接: https://arxiv.org/abs/2605.11752
作者: Qijun Hou,Yuchen Shi,Pingyi Fan,Khaled B. Letaief
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning relies on effective client selection to alleviate the performance degradation caused by data heterogeneity. Most existing methods assume full visibility of all clients at each communication round. However, in large-scale or edge-based deployments, the server can only access a subset of clients due to communication, mobility, or availability constraints, resulting in partial visibility where only a subset of clients is observable for aggregation in each communication round. In this paper, we formulate federated client selection under partial visibility as a Partially Observable Markov Decision Process (POMDP) and propose a Spatial-Temporal attention-based reinforcement learning framework. By integrating historical global models and client identity embeddings, the proposed method captures both the temporal contexts of training and the persistent characteristics of clients. Experimental results across multiple datasets demonstrate that our approach achieves superior performance compared to existing baselines in heterogeneous and partially visible settings, validating its effectiveness in addressing the challenges of incomplete observations in practical federated learning systems.

[LG-57] Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection

链接: https://arxiv.org/abs/2605.11749
作者: Yingjie Zhou,Yuqin Xie,Fanxing Liu,Dongjin Song,Ce Zhu,Lingqiao Liu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, published by IEEE Transactions on Knowledge and Data Engineering,2026

点击查看摘要

Abstract:Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors significantly differ from normal ones, given only a limited number of annotated anomalies and abundant unlabeled samples. A major challenge is to learn a meaningful latent feature representation that reduces intra-class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self-supervised feature learning for graph anomaly detection, their strategies are not specifically tailored to its unique requirements, motivating our exploration of a more domain-specific approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy tailored for graph anomalies. Our approach is built upon a multi-task learning scheme that extracts robust feature representations through synthesized anomalies. We generate synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type, ensuring that learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real-world patterns, they provide valuable auxiliary data for effective feature learnin, much like features learned from ImageNet classification transfer to downstream vision tasks. Additionally, we adopt a two-phase learning strategy: an initial warm-up phase using only synthetic samples, followed by a full-training phase integrating both tasks, to balance the influence of synthetic and real data. Extensive experiments on public datasets demonstrate the superior performance of our method over its competitors. Code is available at this https URL.

[LG-58] Online Continual Learning with Dynamic Label Hierarchies ICML2026

链接: https://arxiv.org/abs/2605.11742
作者: Xinrui Wang,Shao-Yuan Li,Bartłomiej Twardowski,Alexandra Gomez-Villa,Songcan Chen
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML2026

点击查看摘要

Abstract:Online Continual Learning (OCL) aims to learn from endless non\text-stationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text-world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.

[LG-59] U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation

链接: https://arxiv.org/abs/2605.11735
作者: Yichen Zhang,Jun Li
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:The efficient operation of modern cellular networks hinges on the accurate analysis of spatio-temporal traffic data. Mastering these patterns is essential for core network functions, chiefly forecasting future load to pre-empt congestion and imputing missing values caused by sensor failures or transmission errors to ensure data continuity. While deeply connected, forecasting and imputation have historically evolved as separate sub-fields. The dominant paradigm, Spatio-Temporal Graph Neural Networks (STGNNs), while effective, are often specialized, computationally intensive, and exhibit limited generalization. Concurrently, adapting large pre-trained language models (LLMs) offers a powerful alternative for sequence modeling, yet existing approaches provide weak structural guidance, leading to unstable convergence and a narrow focus on forecasting. To bridge these gaps, we propose U-STS-LLM, a unified framework built on a spatio-temporally steered LLM. Our core innovation is a Dynamic Spatio-Temporal Attention Bias Generator that synthesizes a persistent functional graph with transient nodal states to explicitly steer the LLM’s attention. Coupled with a partially frozen backbone tuned via Low-Rank Adaptation (LoRA) and a Gated Adaptive Fusion mechanism, the model achieves stable, parameter-efficient adaptation. Trained under a unified multi-task objective, U-STS-LLM learns a holistic data representation. Extensive experiments on real-world cellular datasets demonstrate that U-STS-LLM establishes new state-of-the-art performance in both long-horizon forecasting and high-missing-rate imputation, while maintaining remarkable training efficiency and stability, offering a novel blueprint for harnessing foundation models in structured, non-linguistic domains.

[LG-60] Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

链接: https://arxiv.org/abs/2605.11730
作者: Cristian Morasso,Anisa Halimi,Muhammad Zaid Hameed,Douglas Leith
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57% to 97% while producing 2-6 \times more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 \rightarrow 0.99, F1: 0.53 \rightarrow 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.

[LG-61] Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models NEURIPS2026

链接: https://arxiv.org/abs/2605.11726
作者: Yan Jiang,Ruihong Qiu,Zi Huang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2026 Preprint

点击查看摘要

Abstract:Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms, and various different dLLM backbones are covered in Block-R1. The benchmark is open-sourced at this https URL, with the dataset released at this https URL.

[LG-62] GRAFT: Graph-Tokenized LLM s for Tool Planning

链接: https://arxiv.org/abs/2605.11706
作者: Xinyi Gao,Xinyu Ren,Junliang Yu,Tong Chen,Quoc Viet Hung Nguyen,Hongzhi Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.

[LG-63] Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

链接: https://arxiv.org/abs/2605.11694
作者: Michael Lu,Max Qiushi Lin,Mo Chen,Sharan Vaswani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ( \textttAL ) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the \textttAL sub-problem with projected Q-ascent ( \textttPQA ). Combining the theoretical guarantees of \textttPQA and the standard \textttAL analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of \textttPQA can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.

[LG-64] Compositional Neural Operators for Multi-Dimensional Fluid Dynamics ICLR2026

链接: https://arxiv.org/abs/2605.11691
作者: Hamda Hmida,Hsiu-Wen Chang,Youssef Mesri
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Partial differential equations (PDEs) govern diverse physical phenomena, yet high-fidelity numerical solutions are computationally expensive and Machine Learning approaches lack generalization. While Scientific Foundation Models (SFMs) aim to provide universal surrogates, typical encoding-decoding approaches suffer from high pretraining costs and limited interpretability. In this paper, we propose Compositional Neural Operators (CompNO) for 2D systems, a framework that decomposes complex PDEs into a library of Foundation Blocks. Each block is a specialized Neural Operator pretrained on elementary physics. This modular library contains convection, diffusion, and nonlinear convection blocks as well as a Poisson Solver, enabling the framework to address the pressure-velocity coupling. These experts are assembled via an Adaptation Block featuring an Aggregator. This aggregator learns nonlinear interactions by minimizing data loss and physics-based residuals driven from governing equations. The proposed approach has been evaluated on the Convection-Diffusion equation, the Burgers’ equation, and the Incompressible Navier-Stokes equation. Our results demonstrate that learning from elementary operators significantly improves adaptability, enhances model interpretability and facilitates the reuse of pretrained blocks when adapting to new physical systems.

[LG-65] Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction

链接: https://arxiv.org/abs/2605.11684
作者: Ehsan Lari,Reza Arablouei,Stefan Werner
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR); Applications (stat.AP)
*备注: 5 pages, 4 figures, Accepted for presentation at the 34th European Signal Processing Conference (EUSIPCO 2026) in Bruges, Belgium

点击查看摘要

Abstract:We propose a Byzantine-resilient federated conformal prediction (FCP) method that leverages partial model sharing, where only a subset of model parameters is exchanged each round. Unlike existing robust FCP approaches that primarily harden the calibration stage, our method protects both the federated training and conformal calibration phases. During training, partial sharing inherently restricts the attack surface and attenuates poisoned updates while reducing communication. During calibration, clients compress their non-conformity scores into histogram-based characterization vectors, enabling the server to detect Byzantine clients via distance-based maliciousness scores and to estimate the conformal quantile using only benign contributors. Experiments across diverse Byzantine attack scenarios show that the proposed method achieves closer-to-nominal coverage with substantially tighter prediction intervals than standard FCP, establishing a robust and communication-efficient approach to federated uncertainty quantification.

[LG-66] Finite Sentence-Interface Control for Learning Bounded-Fan-Out Linear MCFGs under Fixed Monoid Typing

链接: https://arxiv.org/abs/2605.11644
作者: Takayuki Kuriyama
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study positive-data learning of bounded-fan-out linear multiple context-free grammars under a fixed explicit finite monoid homomorphism (h). The main obstacle beyond the context-free case is that an MCFG nonterminal derives a tuple whose components may be placed in a surrounding sentence in different orders. We introduce sentence-interface types as finite external control objects for such tuple occurrences. A type records the permutation of tuple components in the final sentence together with the (h)-values of the boundary intervals between them. For reduced working binary linear nondeleting MCFG presentations whose string languages satisfy ((f,h))-tuple substitutability, we build a typed refinement, a finite characteristic sample, and a canonical positive-data learner. Once the sample contains this characteristic sample and remains contained in the target language, the learner reconstructs the language exactly. Consequently, for fixed fan-out bound (f) and fixed explicit (h), the resulting class is identifiable in the limit from positive data. Moreover, the hypothesis associated with any given finite sample is constructible in polynomial time for fixed (f) and fixed (h), including output size. Thus sentence-interface control is the finite mechanism that lifts fixed-(h) distributional reconstruction from context-free grammars to bounded-fan-out linear MCFGs.

[LG-67] MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound

链接: https://arxiv.org/abs/2605.11617
作者: Phu-Hoa Pham,Chi-Nguyen Tran,Nguyen Lam Phu Quy,Dao Sy Duy Minh,Huynh Trung Kiet,Long Tran-Thanh
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 9 pages of main text, 5 figures

点击查看摘要

Abstract:Streaming decision trees are natural candidates for open-world continual learning, as they perform local updates, enjoy bounded memory, and static decision boundaries. Despite these, they still fail in online class-incremental learning due to two coupled miscalibrations: (i) their split criterion grows unreliable as the class count K expands, and (ii) the absence of knowledge transfer at split time. Both failures share a common root: the range of Information Gain intrinsically scales with log2 K. Consequently, any Hoeffding-style confidence radius derived from it must inevitably grow with the class count, making a K-independent split criterion structurally impossible, taking away the potential benefits of applying streaming decision trees to continual learning. To fix this issue, we present MIST (McDiarmid Incremental Streaming Tree), which resolves both failures through three integrated components: (i) a tight, K-independent McDiarmid confidence radius for Gini splitting that acts as a structural regulariser; (ii) a Bayesian inheritance protocol that projects parent statistics to child nodes via truncated-Gaussian moments, with variance reduction guarantees strongest precisely when splitting is most conservative; and (iii) per-leaf KLL quantile sketches that support both continuous threshold evaluation and geometry-adaptive leaf prediction from a single data structure. On standard and stress-test tabular streams, MIST is competitive with global parametric methods on near-Gaussian benchmarks and uniquely robust on non-Gaussian geometry where SOTA benchmarks collapse.

[LG-68] argeted Tests for LLM Reasoning : An Audit-Constrained Protocol

链接: https://arxiv.org/abs/2605.11599
作者: Hongmin Li
类目: Machine Learning (cs.LG)
*备注: 17 pages, 1 figure

点击查看摘要

Abstract:Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

[LG-69] Learning Weakly Communicating Averag e-Reward CMDPs: Strong Duality and Improved Regret

链接: https://arxiv.org/abs/2605.11586
作者: Kihyun Yu,Beomhan Baek,Dabeen Lee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal–dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of \widetilde\mathcalO(T^2/3) , improving upon the best known bounds, where T denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

[LG-70] FedOUI: OUI-Guided Client Weighting for Federated Aggregation

链接: https://arxiv.org/abs/2605.11571
作者: Alberto Fernández-Hernández,Jose I. Mestre,Cristian Pérez-Corral,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning usually aggregates client updates using dataset size or gradient-level criteria, while overlooking internal signals about how each client model is organizing its input space during training. We introduce FedOUI, a simple aggregation rule based on the Overfitting-Underfitting Indicator (OUI), an activation-based and label-free metric. Each participating client sends its local update together with a OUI value computed on a fixed probe batch, and the server estimates the round-wise OUI distribution to assign lower weights to structurally atypical clients through a smooth reweighting rule. We evaluate FedOUI on CIFAR-10 under strong non-IID partitioning and noisy-client conditions, comparing it with FedAvg, FedProx, and a gradient-alignment baseline. The clearest gains appear under strong heterogeneity, where OUI-based weighting improves aggregation quality while remaining lightweight and interpretable. These results show that internal activation structure can provide useful information for federated aggregation beyond client size and gradient geometry.

[LG-71] OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

链接: https://arxiv.org/abs/2605.11570
作者: Alberto Fernández-Hernández,Jose I. Mestre,Cristian Pérez-Corral,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting–Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor–critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.

[LG-72] A Composite Activation Function for Learning Stable Binary Representations

链接: https://arxiv.org/abs/2605.11558
作者: Seokhun Park,Choeun Kim,Kwanho Lee,Sehyun Park,Insung Kong,Yongdai Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages

点击查看摘要

Abstract:Activation functions play a central role in neural networks by shaping internal representations. Recently, learning binary activation representations has attracted significant attention due to their advantages in computational and memory efficiency, as well as interpretability. However, training neural networks with Heaviside activations remains challenging, as their non-differentiability obstructs standard gradient-based optimization. In this paper, we propose Heavy Tailed Activation Function (HTAF), a smooth approximation to the Heaviside function that enables stable training with gradient-based optimization. We construct HTAF as a sigmoid hyperbolic tangent composite function and theoretically show that it maintains a large gradient mass around zero inputs while exhibiting slower gradient decay in the tail regions. We show that Spiking Neural Networks, Binary Neural Networks and Deep Heaviside neural Networks can be trained stably using HTAF with gradient-based optimization. Finally, we introduce Implicit Concept Bottleneck Models (ICBMs), an interpretable image model that leverages HTAF to induce discrete feature representations. Extensive experiments across various architectures and image datasets demonstrate that ICBM enables stable discretization while achieving prediction performance comparable to or better than standard models.

[LG-73] A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup

链接: https://arxiv.org/abs/2605.11554
作者: Hongmin Li
类目: Machine Learning (cs.LG)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:Task-agnostic structure proxies are often used to interpret why one pretraining corpus transfers better than another, but such explanations require the proxy to track the structure that matters for the downstream task. We test this requirement in a fixed pretraining-and-probing setup motivated by computationally bounded notions of learned structure, including epiplexity. The core question is whether a proxy ranking of two pretraining datasets must agree with their ranking by OOD probe accuracy. We show that it need not. First, we give a controlled construction in which a formal structure quantity, its operational proxy, and the task-relevant structure for a target family separate. We then instantiate the same mechanism in a synthetic sequence-model experiment: under the primary all-sample evaluation, the OOD accuracy ranking reverses the proxy ranking in two of three seeds, with auxiliary diagnostics and ablations supporting the same interpretation. The counterexample does not reject structure-based explanations in general; it identifies a boundary on strong proxy-based explanations. A proxy for total learned structure can fail to track the task-relevant structure that drives OOD performance, even in a controlled setting.

[LG-74] Fast MoE Inference via Predictive Prefetching and Expert Replication

链接: https://arxiv.org/abs/2605.11537
作者: Ankit Jyothish,Ali Jannesari,Aishwarya Sarkar,Joseph Zuber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on large-scale MoE models, including Switch-base-128 and Switch-base-256, demonstrate that our method achieves near-complete GPU utilization (approx 100%), leading to upto 3x improvement in inference speed while preserving approximately 90-95% of the performance of baseline architectures

[LG-75] Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses ICLR2026

链接: https://arxiv.org/abs/2605.11535
作者: Kihyun Yu,Seoungbin Bae,Dabeen Lee
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon adversarial linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emphfirst to achieve sublinear regret and constraint violation bounds in this setting, both bounded by \widetilde\mathcalO(K^3/4) , where K denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components – periodic policy mixing and a regularized dual update – which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

[LG-76] Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions Mechanisms and Failure Modes

链接: https://arxiv.org/abs/2605.11530
作者: Tatsuhito Hasegawa,Taisei Tanaka
类目: Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, 4 tables. Preprint version of a manuscript submitted to Neurocomputing

点击查看摘要

Abstract:Single-model ensembles (SMEs) have attracted attention as a way to approximate some of the benefits of deep ensembles within a single network. However, under an approximately matched parameter budget, it remains unclear whether model capacity should be concentrated in a single wide pathway or redistributed into many narrow and independent members. We investigate this question through the Multi-Narrow (MN) transformation, which converts a baseline CNN into an SME of narrow, path-wise independent branches while approximately preserving the dominant parameter budget. We systematically compare Single-Wide and Multi-Narrow configurations across different training-data regimes, architectures, and datasets. The results show that the effectiveness of MN is strongly data-dependent: weakly partitioned or baseline-wide models are preferable in data-rich settings, whereas highly partitioned MN models consistently outperform the baseline in low-data settings. This tendency is reproduced across multiple CNN architectures and image-classification datasets, suggesting that it is not specific to a single benchmark or model family. Analysis of internal representations shows that high-MN models learn more diverse and less redundant path-wise features. In low-data regimes, this diversity is broadly utilized and improves generalization, whereas in data-rich regimes, training becomes imbalanced and prediction is dominated by a small subset of paths. These findings clarify when and why Multi-Narrow transformation is effective, and provide practical guidance for allocating model capacity between width and member multiplicity under a limited budget.

[LG-77] FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models

链接: https://arxiv.org/abs/2605.11527
作者: Abtin Mahyar,Masoumeh Shafieinejad,Yuhan Liu,Xi He
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Diffusion models are the leading approach for tabular data synthesis and are increasingly used to share sensitive records. Whether they actually protect privacy has become a pressing question. Membership inference attacks are the standard tool for this purpose, yet existing attacks assume a single-table setting and ignore the multi-relational structure of real sensitive data. A core challenge in assessing privacy risks from membership inference attacks in multi-table settings is how to leverage auxiliary information from relations associated with the target table, such as its parent tables. Particularly, we study a practical setting in which such auxiliary information is available only when training the attack model. At inference time, the attacker observes only the attribute values of the target record from the target table. We propose FERMI (FEature-mapping for Relational Membership Inference), which resolves this gap by enriching single-table features with relational membership signal. Across three tabular diffusion architectures and three real-world relational datasets, FERMI consistently improves attack performance over single-table baselines, with TPR@ 0.1 FPR rising by up to 53% over the single-table baseline in the white-box setting and 22% in the black-box setting.

[LG-78] OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

链接: https://arxiv.org/abs/2605.11525
作者: Amanda S Barnard
类目: Machine Learning (cs.LG)
*备注: 14 pages, 2 figures, 17 tables

点击查看摘要

Abstract:Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms. Using representative examples included with the software, we demonstrate that meaningful missingness can be retained during oversampling without introducing artificial certainty. OverNaN is intended for practitioners working with small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is unavoidable and often informative.

[LG-79] EqOD: Symmetry-Informed Stability Selection for PDE Identification

链接: https://arxiv.org/abs/2605.11524
作者: Gnankan Landry Regis N’guessan,Bum Jun Kim
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 45 pages, 16 figures

点击查看摘要

Abstract:Data-driven identification of partial differential equations (PDEs) relies on sparse regression over a candidate library of differential operators, where larger libraries inflate false positives under observation noise and smaller libraries risk missing true terms. We introduce Equivariant Operator Discovery (EqOD), a fully automatic method combining two library reduction mechanisms. When Galilean invariance is detected from trajectory data via a weak-form structural test, EqOD uses the symmetry-reduced library, eliminating terms that our Galilean exclusion result proves to be absent from the governing equation. Otherwise, it applies randomized LASSO stability selection guided by classical false-positive bounds. A residual-based fallback prevents degradation below the full-library baseline. On 8 PDEs at 4 noise levels, EqOD attains F_1 = 1.000 \pm 0.000 on Heat at 20% noise, where WF-LASSO obtains 0.475 \pm 0.181 , official PySINDy 2.0 obtains 0.000 , and the WSINDy reimplementation obtains 0.789 . Under the strict criterion that the mean F1 difference exceeds the larger of the two standard deviations, EqOD wins 7 of 32 cells. WF-LASSO wins none, and the remaining 25 cells are ties. Across all 32 cells, EqOD outperforms PySINDy 2.0.0 in 23 of 32 cells, and all 5 PySINDy wins occur on reaction PDEs. External validation on WeakIdent and PINN-SR datasets gives F_1 = 1.000 on all 5 clean benchmarks. NLS, 2D, coupled-system, and cylinder-wake extensions are reported. The Galilean library reduction is proved under explicit autonomy and library assumptions. The stability-selection step is motivated by classical false-positive bounds, while formal guarantees for correlated PDE design matrices remain open.

[LG-80] CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

链接: https://arxiv.org/abs/2605.11504
作者: Dongjun Lee,Ga-eun Bae,Insu Yun
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

[LG-81] Adaptive Calibration in Non-Stationary Environments

链接: https://arxiv.org/abs/2605.11490
作者: Junyan Liu,Haipeng Luo,Lillian J. Ratliff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 52 pages

点击查看摘要

Abstract:Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with T being the number of rounds and C\in[0,T] being an unknown non-stationary measure defined as the minimal \ell_1 deviation of the mean outcomes, our algorithms attain \widetildeO(\sqrtT+(TC)^\frac13) for \ell_1 calibration error and \widetildeO((1+C)^\frac13) for both \ell_2 and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ( C=0 ) and recover known guarantees in the fully adversarial regime ( C=T ). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.

[LG-82] Efficient Adjoint Matching for Fine-tuning Diffusion Models

链接: https://arxiv.org/abs/2605.11480
作者: Jeongwoo Shin,Dongsoo Shin,Joonseok Lee,Jaewoong Choi,Jaemoo Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textitnon-trivial base drift inherited from the pretrained model. Motivated by this observation, we propose \textbfEfficient Adjoint Matching (EAM), which substantially improves training efficiency by reformulating the SOC problem with a \textitlinear base drift and a correspondingly modified \textitterminal cost. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

[LG-83] On the Approximation Complexity of Matrix Product Operator Born Machines

链接: https://arxiv.org/abs/2605.11471
作者: Chao Li,Zerui Tao,Yuchen Cong,Jian Xu,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matrix product operator Born machines (MPO-BMs) are tractable tensor-network models for probabilistic modeling, but their efficient approximation capability remains unclear. We characterize this boundary from both negative and positive perspectives. First, we prove that KL approximation is NP-hard for MPO-BMs in the continuous setting, ruling out universal efficient approximation in the worst case. Second, for score-based variational inference, we show that, under a locality and spectral-gap conditions on the loss-induced Hamiltonian, structured targets (e.g., path-graph Markov random fields) admit MPO-BM approximations with polynomial bond dimension and provable KL guarantees. Third, under the same locality structure, we prove that polynomially many score queries suffice to estimate the induced Hamiltonian and obtain such guarantees. Our results provide a theoretical characterization of when MPO-BMs are fundamentally hard to approximate and when they become efficiently learnable.

[LG-84] Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

链接: https://arxiv.org/abs/2605.11469
作者: Riad Ahmed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized multi-agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv-PPO, trains the shared policy against worst-case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv-PPO+MACER, fine-tunes that checkpoint with a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv-PPO recovers worst-case success to 59.2% at one percentage point of clean cost. Adv-PPO+MACER recovers it to 77.5% +/- 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per-attack curves, a certified action-stability sanity check (which measures the smoothed-policy wrapper, not the deployed argmax policy), and side-by-side rollout storyboards that show the failure mode and the fix inside one environment instance.

[LG-85] Beyond Prediction: Interval Neural Networks for Uncertainty-Aware System Identification

链接: https://arxiv.org/abs/2605.11460
作者: Mehmet Ali Ferah,Tufan Kumbasar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Under review

点击查看摘要

Abstract:System identification (SysID) is critical for modeling dynamical systems from experimental data, yet traditional approaches often fail to capture nonlinear behaviors. While deep learning offers powerful tools for modeling such dynamics, incorporating uncertainty quantification is essential to ensure reliable predictions. This paper presents a systematic framework for constructing and training interval Neural Networks (INNs) for uncertainty-aware SysID. By extending crisp neural networks into interval counterparts, we develop Interval LSTM and NODE models that propagate uncertainty through interval arithmetic without probabilistic assumptions. This design allows them to represent uncertainty and produce prediction intervals. For training, we propose two strategies: Cascade INN (C-INN), a two-stage approach converting a trained crisp NN into an INN, and Joint INN (J-INN), a one-stage framework jointly optimizing prediction accuracy and interval precision. Both strategies employ uncertainty-aware loss functions and parameterization tricks to ensure reliable learning. Comprehensive experiments on multiple SysID datasets demonstrate the effectiveness of both approaches and benchmark their performance against well-established uncertainty-aware baselines: C-INN achieves superior point prediction accuracy, whereas J-INN yields more accurate and better-calibrated prediction intervals. Furthermore, to reveal how uncertainty is represented across model parameters, the concept of channel-wise elasticity is introduced, which is used to identify distinct patterns across the two training strategies. The results of this study demonstrate that the proposed framework effectively integrates deep learning with uncertainty-aware modeling.

[LG-86] FastUMAP: Scalable Dimensionality Reduction via Bipartite Landmark Sampling

链接: https://arxiv.org/abs/2605.11428
作者: Hongmin Li
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Exploratory analysis of high-dimensional data rarely stops at a single embedding. In practice, analysts rerun dimensionality reduction after changing preprocessing, subsets, or hyperparameters, and standard nonlinear methods can quickly become the bottleneck. We introduce FastUMAP (Bipartite Manifold Approximation and Projection), a landmark-based method designed for this repeated-use setting. FastUMAP builds a sparse point-landmark fuzzy graph, computes a Nystrom spectral warm start from the induced landmark affinity, and then refines all sample coordinates with a UMAP-style objective on the bipartite graph. The landmark ratio r = m/n provides a direct way to trade runtime against fidelity. On 9 benchmark datasets spanning 178 to 70,000 samples, FastUMAP has the lowest runtime on 7 datasets in our reported default-implementation comparison on one workstation. On MNIST and Fashion-MNIST (n=70000), it runs in about 4.6 seconds, compared with about 73–75 seconds for Barnes–Hut t-SNE, while reaching 91.4% mean kNN accuracy versus 94.6% for the strongest accuracy baseline. FastUMAP is therefore best viewed as a fast option for repeated exploratory embedding, rather than as a replacement for accuracy-first methods.

[LG-87] A Boundary-Aware Non-parametric Granular-Ball Classifier Based on Minimum Description Length

链接: https://arxiv.org/abs/2605.11406
作者: Zeqiang Xian,Caihui Liu,Yong Zhang,Wenjing Qiu,Duoqian Miao,Witold Pedrycz
类目: Machine Learning (cs.LG)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:Existing granular-ball classification methods are often driven by handcrafted quality measures, neighborhood rules, or heuristic splitting and stopping criteria, which may reduce the transparency of local construction decisions and hinder explicit modeling of boundary-sensitive regions. To address this issue, this paper proposes a Minimum Description Length based Granular-Ball Classifier (MDL-GBC), a boundary-aware non-parametric and interpretable granular-ball classifier. MDL-GBC formulates class-conditional granular-ball construction as a local model selection problem under the Minimum Description Length principle. For each class, samples from the target class provide positive class evidence, while samples from the remaining classes provide negative boundary evidence. For each current granular ball, three candidate explanations are compared under a unified description-length criterion: a single-ball model, a two-ball model, and a core-boundary model. The selected model determines whether the ball is retained, geometrically split, or refined into core and boundary-sensitive child balls, thereby making local construction decisions consistent with the MDL-based classification mechanism. During prediction, a class-level mixture coding rule aggregates stable granular balls of the same class and assigns the test sample by comparing class-wise coding costs. Experiments on 18 benchmark datasets show that MDL-GBC achieves competitive classification performance against classical classifiers and representative granular-ball-based methods, obtaining the best average Accuracy, Macro-F1, and average rank. These results indicate that MDL-GBC provides an effective and interpretable alternative to conventional heuristic granular-ball classification strategies.

[LG-88] 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone WWW

链接: https://arxiv.org/abs/2605.11405
作者: Siddharth Joshi,Haoli Yin,Rishabh Adiga,Haakon Mongstad,Alvin Deng,Aldo Carranza,Alex Fang,Amro Abbas,Anshuman Suri,Brett Larsen,Daniel Zayas,Darren Teh,David Schwab,Diego Kiner,Fan Pan,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Kaleigh Mentzer,Luke Merrick,Maximilian Böther,Parth Doshi,Paul Burstein,Pratyush Maini,Ties Robroek,Tony Jiang,Vidhi Jain,Vineeth Dorna,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt
类目: Machine Learning (cs.LG)
*备注: 33 pages, 15 figures. DatalogyAI website for more details: this https URL

点击查看摘要

Abstract:Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

[LG-89] More Than Meets the Eye: A Semantics-Aware Traffic Augmentation Framework for Generalizable Website Fingerprinting NDSS2027

链接: https://arxiv.org/abs/2605.11402
作者: Youquan Xian,Xueying Zeng,Lingjia Meng,Lei Cui,Runhan Song,Wei Wang,Zhengquan Ding,Peng Liu,Zhiyu Hao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注: 18 pages, 19 figures, Submitted to NDSS 2027

点击查看摘要

Abstract:Deep learning-based website fingerprinting has emerged as an effective technique for inferring the websites users visit. Although existing methods achieve strong performance on closed-world datasets, they often fail to generalize to real-world environments, especially under geographic and temporal shifts. This limitation fundamentally stems from the coupled effects of two key challenges: application-layer resource composition variability and observable feature instability induced by cross-layer encapsulation. Intertwined, these factors induce systematic shifts between underlying application semantics and observable traffic features. To address the above challenges, we propose SATA , a semantics-aware traffic augmentation framework. Specifically, SATA first performs application-layer semantic augmentation based on protocol rules, expanding the resource composition patterns within each flow and frame sequence patterns under protocol constraints. Based on these augmented frame sequences, we further introduce a cross-layer feature alignment mechanism via knowledge distillation. It aligns frame sequence with packet-length sequence features, enabling cross-layer feature alignment between enhanced semantics and observable sequences. Extensive experiments show that SATA successfully generates traffic patterns that are absent from the training set but genuinely exist in the test set, and significantly improves the performance of mainstream models across diverse and complex scenarios. In particular, in open-world settings, SATA improves ACC by 90.81% and AUROC by 48.37%. The source code of the prototype system is available at this https URL.

[LG-90] MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

链接: https://arxiv.org/abs/2605.11396
作者: Yupeng Su,Ruijie Zhang,Ziyue Liu,Yequan Zhao,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注: MuonQ enables stable 4-bit quantization of Muon’s optimizer states by preserving directional fidelity through pre-quantization normalization, structural decomposition, and companding quantization

点击查看摘要

Abstract:The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon’s optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt \mu -law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon’s optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 \times . Our code is available at this https URL.

[LG-91] Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

链接: https://arxiv.org/abs/2605.11387
作者: Alberta Longhini,David Emukpere,Jean-Michel Renders,Seungsu Kim
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies (e.g., diffusion policies) improve task performance but often collapse diverse behaviors into a single reward-maximizing mode. To mitigate this issue, we propose an unsupervised mode discovery framework that uncovers latent behavioral modes within generative policies. The discovered modes enable the use of mutual information as an intrinsic reward, regularizing RL fine-tuning to enhance task success while maintaining behavioral diversity. Experiments on robotic manipulation tasks demonstrate that our method consistently outperforms conventional fine-tuning approaches, achieving higher success rates and preserving richer multimodal action distributions.

[LG-92] he tractability landscape of diffusion alignment: regularization rewards and computational primitives

链接: https://arxiv.org/abs/2605.11361
作者: Ankur Moitra,Andrej Risteski,Dhruv Rohatgi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law p into a sampler that favors a reward r while remaining close to p . Since there is no canonical distributional distance for this closeness constraint, different choices lead to different “reward-aligned” laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is q(x) \propto p(x) \exp(\lambda^-1r(x)) . For this setting, we show that linear exponential tilts of the form q(x)\propto p(x)\exp(\langle \theta, x \rangle) – which according to recent work [MRR26] can be efficiently sampled from – are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given x , solve \mboxargmax_y \r(y)- \lambda c(x,y)\ . This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards r(x)=f(Ax) . Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.

[LG-93] gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

链接: https://arxiv.org/abs/2605.11355
作者: Reza Barati,Qinmin Vivian Hu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility constraints, shortage treatment, and Key Performance Indicator (KPI) definitions can change method rankings. We present gym-invmgmt, a Gymnasium-compatible extension of the OR-Gym inventory-management lineage for auditable cross-paradigm evaluation. The benchmark evaluates optimization, heuristic, and learned controllers under a shared CoreEnv transition, reward, action-bound, and KPI contract, while varying stress conditions through a 22-scenario core grid plus four supplemental MARL-mode rows. Within these released scenarios, informed stochastic programming provides the strongest non-oracle reference, reflecting the value of scenario hedging under forecast access, but at substantially higher online computational cost. Among learned controllers, the Proximal Policy Optimization Transformer variant (PPO-Transformer) achieves the strongest learned-policy quality at fast inference, while Residual Reinforcement Learning (Residual RL) provides competitive hybrid performance. The graph neural network variant (PPO-GNN) is highly competitive on the default divergent topology but less robust on the serial topology. Imitation learning performs well in stationary regimes but degrades under demand shift, and the bounded Large Language Model (LLM) policy-parameter baseline is best interpreted as a diagnostic controller rather than an autonomous inventory optimizer. Overall, the benchmark identifies scenario-conditioned leaders while showing that performance depends jointly on information access, demand shift, topology, and policy representation.

[LG-94] ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

链接: https://arxiv.org/abs/2605.11335
作者: Han Meng(University of California, Merced),Danny Willow Liu(University of Chicago),Dong Li(University of California, Merced and Yotta Labs)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU memory for prefetch volume. On three representative diffusion transformers running on two H100 GPUs over PCIe with Ulysses sequence parallelism, ChunkFlow delivers up to 1.28x step-time speedup over SGLang’s existing layerwise offloading, reduces peak GPU memory by up to 49% over the no-offload baseline at near-identical step time once the workload is large enough, and exposes a tunable memory-latency tradeoff that recovers near-zero step-time overhead in the small-workload regime.

[LG-95] MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

链接: https://arxiv.org/abs/2605.11333
作者: Srinivas Sridharan,Andy Balogh,Bradford M. Beckmann,Brian Coutinho,Louis Feng,Sheng Fu,Sanshan Gao,Mehryar Garakani,Taekyung Heo,David Kanter,Josh Ladd,Ziwei Li,Winston Liu,Changhai Man,Dan Mihailescu,Spandan More,Joongun Park,Ashwin Ramachandran,Vinay Ramakrishnaiah,Saeed Rashidi,Vijay Janapa Reddi,Puneet Sharma,Phio Tian,William Won,Hanjiang Wu,Huan Xu,Jinsun Yoo,Tushar Krishna
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted at the 9th Conference on Machine Learning and Systems (MLSys 2026)

点击查看摘要

Abstract:The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

[LG-96] Neural Statistical Functions

链接: https://arxiv.org/abs/2605.11327
作者: Daniel Xu,Yuxin Xie,Minghao Guo,Haixu Wu,Wojciech Matusik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical deep learning typically operates on individual cases. Despite its success, real-world usage often requires repeated inference to estimate statistical quantities for complex decision-making tasks involving uncertainty or extreme-value analysis, resulting in substantial latency. We introduce neural statistical functions, a new family of models learned from pre-trained single-sample predictors and scattered data samples, which can directly infer statistics over continuous operating condition ranges without explicit sampling. By introducing the notion of prefix statistics, we transform and unify diverse statistical functions (e.g., integrals, quantiles, and maxima) into an interval-conditional framework, in which a principled identity between the prefix statistics and the individual-case regression serves as the learning objective. Neural statistical functions achieve strong performance in estimating essential statistics of complex physical processes, including accumulated energy in dynamical systems, quantiles of aerodynamic responses, and maximum stress in crash processes, while achieving up to a 100 \times reduction in model evaluations.

[LG-97] varepsilon-Good Action Identification in Fixed-Budget Monte Carlo Tree Search

链接: https://arxiv.org/abs/2605.11324
作者: Yinan Li,Tuan Nguyen,Kwang-Sung Jun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the fixed-budget max-min action identification problem in depth-2 max-min trees, an important special case of Monte Carlo Tree Search. A learner sequentially allocates T samples to leaves and then recommends a subtree whose minimum leaf value is largest. Motivated by approximate planning, we focus on \varepsilon -good subtree identification, where any subtree whose min value is within \varepsilon of the optimal maximin value is acceptable. Our main contribution is an \varepsilon -agnostic algorithm: it does not require \varepsilon as input, but achieves instance-dependent error bounds for every meaningful \varepsilon . We show that the misidentification probability decays as \exp(-\widetilde\Theta(T/H_2(\varepsilon))) , where H_2(\varepsilon) captures both cross-subtree and within-subtree gaps. When each subtree has a single leaf, the problem reduces to standard fixed-budget best-arm identification, and our analysis recovers, up to accelerating factors, known \varepsilon -good guarantees for halving-style methods while giving a new \varepsilon -good guarantee for Successive Rejects. On the lower-bound side, we provide complementary positive and negative results showing that max-min identification has a different hardness structure from standard K -armed bandits. To our knowledge, this is the first provable fixed-budget algorithmic guarantee for max-min action identification. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.11324 [cs.LG] (or arXiv:2605.11324v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-98] Error whitening: Why Gauss-Newton outperforms Newton NEURIPS

链接: https://arxiv.org/abs/2605.11316
作者: Maricela Best McKay,Nathan P. Lawrence,Brian Wetton,R. Bhushan Gopaluni
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Neurips preprint

点击查看摘要

Abstract:The Gauss-Newton matrix is widely viewed as a positive semidefinite approximation of the Hessian, yet mounting empirical evidence shows that Gauss-Newton descent outperforms Newton’s method. We adopt a function space perspective to analyze this phenomenon. We show that the generalized Gauss-Newton (GGN) matrix projects the Newton direction in function space onto the model’s tangent space, while a Jacobian-only variant obtained by applying the least squares Gauss-Newton matrix to non-least squares losses projects the function space loss gradient onto this same tangent space. Both projections eliminate distortions from the model’s parameterization. Specifically, the evolution of the prediction-target mismatch depends on the model’s parameterization through the matrix JJ^\top where J is the Jacobian of the model with respect to its parameters. The projections effectively replace JJ^\top with the identity. We call this effect error whitening. Once the parameterization is removed, the prediction-target mismatch evolves according to dynamics dictated by the structure of the loss and the projection produced by the optimizer. Error whitening is a special property of Gauss-Newton descent that rigorously distinguishes it from Newton’s method. We empirically demonstrate that Gauss-Newton optimizers follow the theoretically predicted function space dynamics and outperforms Newton’s method, Adam, and Muon across case studies spanning supervised learning, physics-informed deep learning, and approximate dynamic programming.

[LG-99] Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

链接: https://arxiv.org/abs/2605.11291
作者: Thuan Nguyen,Shuchin Aeron,D. Richard Brown III,Prakash Ishwar
类目: Machine Learning (cs.LG)
*备注: 28 pages, 2 figures

点击查看摘要

Abstract:In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

[LG-100] Quotient-Categorical Representations for Bellm an-Compatible Averag e-Reward Distributional Reinforcement Learning

链接: https://arxiv.org/abs/2605.11289
作者: Ege C. Kaya,Aliasghar Pourghani,Vijay Gupta,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 29 pages, 4 figures

点击查看摘要

Abstract:Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.

[LG-101] Latent Chain-of-Thought Improves Structured-Data Transformers

链接: https://arxiv.org/abs/2605.11262
作者: Carson Dudley,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-thought and more broadly test-time compute are known to augment the expressive capabilities of language models and have led to major innovations in reasoning. Motivated by this success, this paper explores latent chain-of-thought as well as the impact of depth and looping for time-series and tabular data. We propose a recurrent scheme in which a structured-data transformer, after an initial forward pass, compresses its query-position hidden states into feedback tokens that are appended to the input and processed again, allowing multiple rounds of latent computation before prediction. We compare CoT models against a same-depth no-CoT baseline, a deeper baseline matched to the CoT model in effective depth, and a looped transformer with weight-tied recurrence but no additional chain-of-thought tokens. Across 36 datasets in time-series forecasting and tabular prediction, latent chain-of-thought improves over the baseline on 8/9 time-series datasets (+10.99% average gain) and 22/27 tabular datasets (+5.31% average gain). Across both settings, the CoT models perform the best on average. These results demonstrate that chain-of-thought is a useful axis for scaling test-time compute for structured data.

[LG-102] A Proof-of-Concept Simulation-Driven Digital Twin Framework for Decision-Aware Diabetes Modeling

链接: https://arxiv.org/abs/2605.11247
作者: Zarrin Monirzadeh
类目: Machine Learning (cs.LG)
*备注: Preprint. 9 figures. DOI: https://doi.org/10.5281/zenodo.20127363

点击查看摘要

Abstract:This paper presents a proof-of-concept digital twin framework for simulation-driven diabetes modeling using benchmark clinical data, synthetic temporal augmentation, and illustrative continuous glucose monitoring (CGM) analysis. Unlike traditional predictive models, the framework focuses on generating interpretable simulated trajectories rather than clinically validated outcomes. Evaluation is conducted using a public dataset combined with controlled synthetic scenarios to illustrate temporal behavior and intervention effects. Results illustrate the feasibility of integrating prediction with counterfactual simulation for decision-aware analysis. This work does not claim clinical readiness but provides a foundation for future research on simulation-driven digital twin systems in healthcare.

[LG-103] Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization ICML2026

链接: https://arxiv.org/abs/2605.11246
作者: Yonghan Yang,Ye Yuan,Zipeng Sun,Linfeng Du,Bowei He,Haolun Wu,Can Chen,Xue Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Offline black-box optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|x) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a Calibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a Support-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(x) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark.

[LG-104] When to Ask a Question: Understanding Communication Strategies in Generative AI Tools

链接: https://arxiv.org/abs/2605.11240
作者: Charlotte Park,Kate Donahue,Manish Raghavan
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative AI models differ from traditional machine learning tools in that they allow users to provide as much or as little information as they choose in their inputs. This flexibility often leads users to omit certain details, relying on the models to infer and fill in under-specified information based on distributional knowledge of user preferences. Such inferences may privilege majority viewpoints and disadvantage users with atypical preferences, raising concerns about fairness. Unlike more traditional recommender systems, LLMs can explicitly solicit more information from users through natural language. However, while directly eliciting user preferences could increase personalization and mitigate inequality, excessive querying places a burden on users who value efficiency. We develop a stylized model of user-LLM interaction and develop an objective that captures tradeoff between user burden and preference representation. Building on the observation that individual preferences are often correlated, we analyze how AI systems should balance inference and elicitation, characterizing the optimal amount of information to solicit before content generation. Ultimately, we show that information elicitation can mitigate the systematic biases of preference inference, enabling the design of generative tools that better incorporate diverse user perspectives while maintaining efficiency. We complement this theoretical analysis with an empirical evaluation illustrating the model’s predictions and exploring their practical implications.

[LG-105] Extending Kernel Trick to Influence Functions

链接: https://arxiv.org/abs/2605.11239
作者: Zhenhuan Sun,Shahrokh Valaee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.

[LG-106] DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift

链接: https://arxiv.org/abs/2605.11237
作者: Yongsen Tan,Zhecheng Sheng,Xiruo Ding,Serguei V. S. Pakhomov,Trevor Cohen
类目: Machine Learning (cs.LG)
*备注: Accepted to CHIL 2026

点击查看摘要

Abstract:Despite the burgeoning body of work on distribution shifts, provenance shift-where the relationship between data source and label changes at deployment-remains poorly understood and under-addressed. In this paper, we establish a formal connection between provenance shift, counterfactual invariance, and invariant learning to derive a learning objective for robustness. We then introduce \textscDeconDTN-Toolkit, a specialized evaluation and remediation suite designed to simulate provenance shifts of varying degrees while maintaining the training protocol and the infrastructure of existing benchmarks. We reveal the vulnerability of Empirical Risk Minimization under provenance shift, introduce a robust out-of-distribution performance indicator, and conduct a comprehensive evaluation on existing algorithms. Our work provides both the theoretical grounding and the practical tools necessary to characterize the problem of confounding by provenance, and implementations of methods to mitigate it.

[LG-107] A Comparative Study of Model Selection Criteria for Symbolic Regression

链接: https://arxiv.org/abs/2605.11233
作者: Ali Soltani,Gabriel Kronberger,Fabricio Olivetti de Franca,Mattia Billa,Alessandro Lucantonio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron’s bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.

[LG-108] ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

链接: https://arxiv.org/abs/2605.11222
作者: Ryan Lucas,Mehdi Makni,Xiang Meng,Adam Deng,Rahul Mazumder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make ADMM-Q efficient at LLM scale. ADMM-Q is modular and can be used as a drop-in replacement for any weight quantizer within existing quantization pipelines: ADMM-Q is fully composable with existing techniques including range clipping, learned or random rotations, and activation scaling. Using ADMM-Q in place of GPTQ on Qwen3-8B, we decrease WikiText-2 perplexity in: (i) the W3A16 weight-only setting (12.85 \rightarrow 10.06); (ii) the W4A8 SmoothQuant procedure (9.29 \rightarrow 8.68); and (iii) the W2A4KV4 SpinQuant procedure (66.11 \rightarrow 19.42).

[LG-109] Enforcing Constraints in Generative Sampling via Adaptive Correction Scheduling

链接: https://arxiv.org/abs/2605.11214
作者: Noah Trupin,Yexiang Xue
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hard constraints in generative sampling are typically enforced by projection, applied either once at the end of sampling or after every update. This binary framing overlooks a fundamental issue: projection changes the distribution of states which future updates depend on. As a result, delayed projection can produce samples that are feasible but inconsistent with the intended sampling dynamics, even after final projection. We formalize constraint enforcement as a correction scheduling problem over the generative rollout. Using one-step constraint defect as a local signal of geometric mismatch, we introduce adaptive correction scheduling, a state-dependent policy that allocates projection budget to the steps that most strongly perturb the trajectory. Terminal and stepwise projection arise as limiting cases of this family. Across controlled manifold rollouts and a learned projected diffusion sampler, adaptive scheduling improves the cost-accuracy frontier at matched projection budgets, recovering 71.2% of full stepwise benefit with 75% fewer corrections. These results show that constraint timing is a first-class design variable in generative sampling, and that enforcing feasibility alone is insufficient to preserve the intended constrained sampling dynamics.

[LG-110] Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

链接: https://arxiv.org/abs/2605.11209
作者: Eungyeup Kim,Chenchen Gu,Vashisth Tiwari,J. Zico Kolter
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., “five-nines” (99.999%) vs. “three-nines” (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications. Still, estimating such a rare failure probability with tight confidence bounds requires prohibitively large LLM inference sizes, making standard Monte Carlo evaluation infeasible under limited compute budgets. In this paper, we observe that LLM failures exhibit strong systematic patterns: across broad parameterized input spaces, a small subset of inputs disproportionately accounts for the majority of failures. Leveraging this observation, we propose to learn a sampling distribution concentrated on failure-prone inputs via the cross-entropy method (CEM). We evaluate our framework on three LLMs, Qwen2.5-Math-7B-Instruct, gpt-oss-20b-low, and Gemini 2.5 Flash Lite, across parameterized GSM8K templates and achieve up to 156.22x reduction in required inferences compared to naive uniform sampling. Our estimates reveal that models with indistinguishable accuracy on standard benchmarks can differ substantially in estimated failure rates, underscoring that reliability is a distinct and measurable axis of model quality. Our simple yet practical framework enables the evaluation of extreme reliability in LLMs, a distinct and underexplored dimension of evaluation beyond existing benchmarks, for their growing use in reliability-sensitive applications.

[LG-111] Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

链接: https://arxiv.org/abs/2605.11196
作者: Vishal Pandey,Gopal Singh
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Linear attention reduces the quadratic cost of softmax attention to \mathcalO(T) , but its memory state grows as \mathcalO(T) in Frobenius norm, causing progressive interference between stored associations. We introduce \textbfVariational Linear Attention (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly 1 for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces |S_t|F by 109\times relative to standard linear attention at T=1,000 , achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ( n\textpairs d_h ), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves 14\times speedup over sequential Python and \mathcalO(T) scaling, crossing below softmax attention latency at approximately 43,000 tokens.

[LG-112] Deep Learning for Protein Complex Prediction and Design

链接: https://arxiv.org/abs/2605.11189
作者: Ziwei Xie
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: PhD thesis

点击查看摘要

Abstract:Accurately modeling and designing protein complex structures is a central problem in computational structural biology, with broad implications for understanding cellular function and developing therapeutics. This thesis investigates two fundamental aspects of this problem using deep learning: domain-specific architectures that capture the hierarchical nature of protein structures, and search algorithms that efficiently navigate the vast sequence spaces of protein complexes to identify interacting homologs for improving complex structure prediction and to design protein sequences.

[LG-113] Optimistic Dual Averag ing Unifies Modern Optimizers

链接: https://arxiv.org/abs/2605.11172
作者: Thomas Pethick,Wanyun Xie,Roman Machacek,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded 1/k decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

[LG-114] Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

链接: https://arxiv.org/abs/2605.11170
作者: Ahmed Mehdi Inane,Vincent Quirion,Gintare Karolina Dzugaite,Ioannis Mitliagkas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Noise-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destroys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of O(1/n_\mathrmpub^2) , guaranteeing a strict computational advantage over retraining. This establishes a new control mechanism: practitioners can mitigate the need for high noise-and the associated utility loss-by increasing the volume of public data. Crucially, we analyze the realistic setting of distribution mismatch, explicitly characterizing how shifts between public and private sources impact utility. We show that ALU enables mass unlearning of constant dataset fractions – a regime where standard symmetric methods become impractical – while maintaining high utility. Empirical evaluations using variational Rényi divergence and membership inference attacks confirm that ALU effectively thwarts privacy attacks while preserving utility under reasonable distribution shifts. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2605.11170 [cs.LG] (or arXiv:2605.11170v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.11170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-115] COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

链接: https://arxiv.org/abs/2605.11165
作者: Ben Rachmut,Luise Ge,William Yeoh,Ning Zhang,Yevgeniy Vorobeychik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

[LG-116] CORE: Cyclic Orthotope Relation Embedding for Knowledge Graph Completion

链接: https://arxiv.org/abs/2605.11159
作者: Yingqi Zeng,Luying Wang,Huiling Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to automatically infer missing facts in multi-relational data by mapping entities and relations into continuous representation spaces. Recent region-based embedding models have shown great promise in capturing complex logical patterns by representing relations as geometric regions. However, these models inevitably suffer from absolute boundary constraints during optimization. Conversely, without such constraints, relation regions expand indefinitely. To address the limitation, we propose \textbfCORE (Cyclic Orthotope Relation Embedding), a novel KGC model that embeds entities and relations onto a boundary-less torus this http URL represents relations as cyclic orthotopes on the torus manifold, allowing regions to seamlessly wrap around spatial boundaries to ensure smooth gradient conduction. Furthermore, an adaptive width regularization is introduced to prevent unconditional region expansion. Theoretical analysis proves that CORE can capture various complex relation patterns such as subsumption and intersection. Extensive experiments on four benchmark datasets demonstrate that CORE achieves highly competitive performance, significantly improving link prediction accuracy in dense semantic environments.

[LG-117] Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

链接: https://arxiv.org/abs/2605.11142
作者: Nikolaos Nakis,Panagiotis Promponas,Konstantinos Tsirkas,Katerina Mamali,Eftychia Makri,Leandros Tassiulas,Nicholas A. Christakis
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Graph representation learning has become a standard approach for analyzing networked data, with latent embeddings widely used for link prediction, community detection, and related tasks. Yet a basic design choice, the latent dimension, is still treated as a brittle hyperparameter, fixed before training and tuned by held-out performance. Learned factors are also identifiable only up to rotation and rescaling, so the nominal rank rarely coincides with the quantity that governs model behavior. We propose Spectral Prefix Extraction and Capacity-Targeted Representation Analysis (Spectra), which replaces rank as the unit of analysis with the spectrum of a learned positive semidefinite kernel, trace-normalized so that spectra are comparable across fits. The normalized eigenvalues form a distribution on the simplex, and their Shannon effective rank acts both as a summary of learned capacity and as a controllable training-time coordinate: a single scalar shapes this realized dimension during training, and bisection targets any desired value within the rank cap. To theoretically support that, we show local regularity and monotonicity of the realized-dimension profile. Across collaboration, social, biological, and infrastructure networks, Spectra traces performance–capacity frontiers that make the trade-off between predictive accuracy and realized dimension visible. It performs competitively with strong link-prediction baselines, yields aligned lower-capacity views of the same fitted model through spectral prefixes, and provides a principled handle on capacity in the overparameterized regime. Capacity thus becomes a property of the fitted model rather than a hyperparameter of the training.

[LG-118] Steerable Neural ODEs on Homogeneous Spaces

链接: https://arxiv.org/abs/2605.11133
作者: Emma Andersdotter,Daniel Persson,Fredrik Ohlsson
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 39 pages, 3 figures

点击查看摘要

Abstract:We introduce steerable neural ordinary differential equations on homogeneous spaces M=G/H . These models constitute a novel geometric extension of manifold neural ordinary differential equations (NODEs) that transport associated feature vectors transforming under the local symmetry group H . We interpret features as sections of associated vector bundles over M , and describe their evolution as parallel transport. This results in a coupled system of ODEs consisting of a flow equation on M and a steering equation acting on features. We show that steerable NODEs are G -equivariant whenever the vector field generating the flow and the connection governing parallel transport are both G -invariant. Furthermore, we demonstrate how steerable NODEs incorporate existing NODE models and continuous normalizing flows on Lie groups. Our framework provides the geometric foundation for learning continuous-time equivariant dynamics of general vector-valued features on homogeneous spaces.

[LG-119] Language Modeling with Hyperspherical Flows

链接: https://arxiv.org/abs/2605.11125
作者: Justin Deschenaux,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in \ell_2 , adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce \mathbbS -FLM, a latent FLM in the hypersphere. \mathbbS -FLM generates sequences by rotating vectors in \mathbbS^d-1 along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. \mathbbS -FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ( T=1 ), while a gap remains under optimized low-temperature ( T=0.1 ) decoding.

[LG-120] FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement

链接: https://arxiv.org/abs/2605.11122
作者: Fatima Z. Abacha,Sin G. Teo,Yuanxiang Wu,Lucas C. Cordeiro,Mustafa A. Mustafa
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning remains highly susceptible to backdoor attacks–malicious clients inject targeted behaviours into the global model. Existing defenses suffer from substantial false-positive rates under realistic non-independent and identically distributed (non-IID) data, incorrectly flagging benign clients and degrading model accuracy even when adversaries are correctly identified. We present FedSurrogate, a novel backdoor defense that addresses this limitation by combining bidirectional gradient alignment filtering with layer-adaptive anomaly detection. FedSurrogate performs selective clustering on security-critical layers identified via directional divergence analysis, concentrating the detection signal on a low-dimensional subspace. A bidirectional soft-filtering stage screens trusted clients for residual contamination while rescuing false positives from suspects, substantially reducing misclassifications under heterogeneous conditions. Rather than removing confirmed malicious updates, FedSurrogate replaces them with downscaled surrogate updates from structurally similar benign clients, preserving gradient diversity while neutralising adversarial influence. Extensive evaluations demonstrate that FedSurrogate maintains false-positive rates below 10% across all datasets and attack types, compared to 31-32% for the nearest comparably effective baseline, while achieving superior main-task accuracy and maintaining attack success rates below 2.1% across all tested datasets and attack types under challenging non-IID settings.

[LG-121] ShardTensor: Domain Parallelism for Scientific Machine Learning

链接: https://arxiv.org/abs/2605.11111
作者: Corey Adams,Peter Harrington,Akshay Subramaniam,Mohammad Shoaib Abbas,Jaideep Pathak,Mike Pritchard,Sanjay Choudhry
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Scientific Machine Learning (SciML) faces unique challenges for extreme-resolution data, with mitigations that often fail to scale or degrade the accuracy of trained models. While some specialized methods have achieved remarkable results in training models or performing inference on massive spatial datasets with bespoke techniques, there is no generalized framework for parallelization over input data below batch size one per device. In this work we introduce ShardTensor: a novel paradigm of domain parallelism that enables flexible scaling of input data to arbitrary sizes. By decoupling the spatial dimensionality of input data from hardware constraints, ShardTensor enables scientific machine learning workloads to reach new levels of high fidelity training and inference. We demonstrate both strong and weak scaling of workloads during training and inference, showing improved latency with strong scaling and demonstrating the capacity to process higher data sizes with weak scaling. Additionally, we demonstrate multiple dimensions of parallelization, removing barriers to SciML on extreme-scale inputs.

[LG-122] A Switching System Theory of Q-Learning with Linear Function Approximation

链接: https://arxiv.org/abs/2605.11021
作者: Donghwan Lee,Han-Dong Lim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system. The same construction is then used for stochastic linear Q-learning with independent and identically distributed (i.i.d.) observations and with Markovian observations. Although exact JSR computation is difficult in general, the certificate captures products of switching modes and can be less conservative than one-step norm bounds. The framework also yields a JSR-based view of regularized Q-learning with LFA. The resulting analysis connects projected Bellman equations, finite-difference stochastic-policy switching, and switched-system stability in a single parameter-space formulation.

[LG-123] A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

链接: https://arxiv.org/abs/2605.11010
作者: Antonios Makris,Christos Dousis,Emmanouil Kritharakis,Stavros Bouras,Konstantinos Tserpes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning has emerged as a transformative paradigm for collaborative machine learning across distributed environments. However, its performance is strongly influenced by the aggregation strategy used to combine local model updates at the server, which directly affects learning performance, robustness, and system behavior. This work presents a comprehensive experimental comparison of widely used federated aggregation strategies under both homogeneous and heterogeneous data distributions. Using benchmark image classification datasets, we analyze how different aggregation mechanisms respond to varying degrees of data heterogeneity, examining their impact on centralized accuracy and loss, and system-level efficiency metrics, including aggregation, training, and communication time. The results demonstrate that aggregation strategies exhibit distinct trade-offs across datasets and data distributions, with their effectiveness varying according to dataset characteristics and operating conditions.

[LG-124] ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

链接: https://arxiv.org/abs/2605.11009
作者: Qian Chen,Junqiao Zhao,Hongtu Zhou,Hang Yu,Yanping Zhao,Chen Ye,Guang Chen
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Long-horizon, sparse-reward tasks pose a fundamental challenge for reinforcement learning, since single-step TD learning suffers from bootstrapping error accumulation across successive Bellman updates. Actor-critic methods with action chunking address this by operating over temporally extended actions, which reduce the effective horizon, enable fast value backups, and support temporally consistent exploration. However, existing methods rely on a fixed chunk size and therefore cannot adaptively balance reactivity against temporal consistency. A large fixed chunk size reduces responsiveness to new observations, while a small one produces incoherent motions, forcing task-specific tuning of the chunk size. To address this limitation, we propose Adaptive Chunk Size Actor-Critic (ACSAC). ACSAC leverages a causal Transformer critic to evaluate expected returns for action chunks of different sizes. At each chunk boundary, it adaptively selects the chunk size that maximizes the expected return, supporting flexible, state-dependent chunk sizes without task-specific tuning. We prove that the ACSAC Bellman operator is a contraction whose unique fixed point is the action-value function of the adaptive policy. Experiments on OGBench demonstrate that ACSAC achieves state-of-the-art performance on long-horizon, sparse-reward manipulation tasks across both offline RL and offline-to-online RL settings.

[LG-125] Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

链接: https://arxiv.org/abs/2605.11001
作者: Xiaofeng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are a simple surrogate-modelling paradigm for partial differential equations, but their standard strong-form residual formulation is ill suited to the shallow water equations (SWE). It cannot enforce local conservation, handle discontinuities, or leverage the boundary-conforming unstructured meshes used in real-world applications. We introduce ``Data-Guided FVM-PINN’', a framework that replaces the strong-form residual with a differentiable, well-balanced Roe Riemann-solver finite-volume (FVM) loss evaluated on unstructured meshes. The major finding is that physics-only FVM-PINN training often fails on realistic 2D problems: the network collapses to a trivial low-momentum state that nearly satisfies the FVM-PINN residual but bears no resemblance to the true flow. A loss-landscape diagnostic shows that the FVM-PINN loss at zero momentum is only about 7\times larger than at the trained solution, a shallow basin that an ordinary optimizer falls into; adding even sparse data turns this into a 310\times separation, breaking the degeneracy. On a 2D block-in-channel benchmark, just 200 random velocity measurements drop the velocity-field L_2 error by 22\times versus physics-only; 50 measurements still deliver a 7\times reduction. A controlled ablation isolates the contribution of the FVM-PINN loss: it reduces velocity-field L_2 by \sim 23% in the sparse-data regime and is essentially neutral when dense reference data is available. On a real-world Savannah River reach ( 1306 cells, 3600 ~s simulation, five Manning zones), the framework constructs an accurate surrogate from SRH-2D anchor data, with time-window decomposition reducing error monotonically via progressive initial-condition handoff.

[LG-126] Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation

链接: https://arxiv.org/abs/2605.10947
作者: Saheed Faremi,Andrea Visentin,Luca Longo
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:EEG microstate analysis segments continuous brain electrical activity into brief, quasi-stable topographic configurations that reflect discrete functional brain states. Conventional approaches such as Modified K-Means operate directly in electrode space with hard assignment, offering no learned latent representation, no generative decoder, and no mechanism to decode latent configurations into verifiable scalp topographies, limiting both model transparency and interpretability. To address this, we present a Convolutional Variational Deep Embedding (Conv-VaDE) model that jointly learns topographic reconstruction and probabilistic soft clustering in a shared latent space. Conv-VaDE enables generative decoding of cluster prototypes into verifiable scalp topographies, replacing opaque hard partitioning with probabilistic soft assignment. A polarity invariance scheme and a four-dimensional grid search over cluster count (K from 3 to 20), latent dimensionality, network depth, and channel width are conducted to systematically reveal how each architectural design choice shapes the quality, stability, and interpretability of learned EEG microstate representations. The model is evaluated on the LEMON resting-state eyes-closed EEG dataset with ten participants using topographic template formation, clustering stability, and global explained variance (GEV). The architecture search reveals that depth L = 4 appears consistently across all 18 best-performing configurations, yielding a best-case GEV of 0.730 and a silhouette of 0.229 at K = 4 across the model sweeps, where moderately deep networks with compact channel widths and small latent dimensionality dominate across the full K range. These results establish that principled architecture search, rather than model scale, is the key to interpretable and stable EEG microstate discovery via variational deep embedding.

[LG-127] A proximal gradient algorithm for composite log-concave sampling

链接: https://arxiv.org/abs/2605.12461
作者: Linghai Liu,Sinho Chewi
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an algorithm to sample from composite log-concave distributions over \mathbbR^d , i.e., densities of the form \pi\propto e^-f-g , assuming access to gradient evaluations of f and a restricted Gaussian oracle (RGO) for g . The latter requirement means that we can easily sample from the density \textRGO_g,h,y(x) \propto \exp(-g(x) -\frac12h||y-x||^2) , which is the sampling analogue of the proximal operator for g . If f + g is \alpha -strongly convex and f is \beta -smooth, our sampler achieves \varepsilon error in total variation distance in \widetilde\mathcal O(\kappa \sqrt d \log^4(1/\varepsilon)) iterations where \kappa := \beta/\alpha , which matches prior state-of-the-art results for the case g=0 . We further extend our results to cases where (1) \pi is non-log-concave but satisfies a Poincaré or log-Sobolev inequality, and (2) f is non-smooth but Lipschitz.

[LG-128] Model-based Bootstrap of Controlled Markov Chains

链接: https://arxiv.org/abs/2605.12410
作者: Ziwei Su,Imon Banerjee,Diego Klabjan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 45 pages, 7 figures, 19 tables

点击查看摘要

Abstract:We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and Q -functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ( 50% , 90% , 95% ) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

[LG-129] rajectory-Agnostic Asteroid Detection in TESS with Deep Learning

链接: https://arxiv.org/abs/2605.12391
作者: Brian P. Powell,Jorge Martinez-Palomera,Amy Tuson,Christina Hedges,Jessie Dotson,Jordan Caraballo-Vega
类目: Earth and Planetary Astrophysics (astro-ph.EP); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: Accepted by The Astronomical Journal, 11 May 2026

点击查看摘要

Abstract:We present a novel method for extracting moving objects from TESS data using machine learning. Our approach uses two stacked 3D U-Nets with skip connections, which we call a W-Net, to filter background and identify pixels containing moving objects in TESS image time-series data. By augmenting the training data through rotation of the image cubes, our method is robust to differences in speed and direction of asteroids, requiring no assumptions for either parameter range which are typically required in “shift-and-stack” type algorithms. We also developed a novel method for learned data scaling that we call Adaptive Normalization, which allows the neural network to learn the ideal range and scaling distribution required for optimal data processing. We built a code for creating TESS training data with asteroid masks that served as the foundation of our effort (tess-asteroid-ml), which we publicly released for the benefit of the community. Our method is not limited to TESS, but applicable for implementation in other similar time-domain surveys, making it of particular interest for use with data from upcoming missions such as the Nancy Grace Roman Space Telescope and NEOSurveyor.

[LG-130] Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting

链接: https://arxiv.org/abs/2605.12341
作者: Laura Lützow,Simone Garatti,Marco C. Campi,Lars Lindemann,Matthias Althoff
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction constructs prediction sets with finite-sample coverage guarantees, but its calibration stage is structurally constrained to a scalar score function and a single threshold variable - forcing shapes of prediction sets to be fixed before calibration, typically through data splitting. We introduce multi-variable conformal prediction (MCP), a framework that extends conformal prediction to vector-valued score functions with multiple simultaneous calibration variables. Building on scenario theory as a principled framework for certifying data-driven decisions, MCP unifies prediction set design and calibration into a single optimization problem, eliminating data splitting without sacrificing coverage guarantees. We propose two computationally efficient variants: RemMCP, grounded in constrained optimization with constraint removal, which admits a clean generalization of split conformal prediction; and RelMCP, based on iterative optimization with constraint relaxation, which supports non-convex score functions at the cost of possibly greater conservatism. Through numerical experiments on ellipsoidal and multi-modal prediction sets, we demonstrate that RemMCP and RelMCP consistently meet the target coverage with prediction set sizes smaller than or comparable to those of baselines with data split, while considerably reducing variance across calibration runs - a direct consequence of using all available data for shape optimization and calibration simultaneously.

[LG-131] Online Learning-to-Defer with Varying Experts

链接: https://arxiv.org/abs/2605.12340
作者: Dang Hoang Duy,Yannis Montreuil,Maxime Meyer,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of O((n+n_e)T^2/3) in general and O((n+n_e)\sqrtT) under a low-noise condition, where T is the time horizon, n is the number of labels, and n_e is the number of distinct experts observed across rounds. The analysis builds on novel \mathcalH -consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.

[LG-132] Optimal Policy Learning under Budget and Coverag e Constraints

链接: https://arxiv.org/abs/2605.12235
作者: Giovanni Cerulli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study optimal policy learning under combined budget and minimum coverage constraints. We show that the problem admits a knapsack-type structure and that the optimal policy can be characterized by an affine threshold rule involving both budget and coverage shadow prices. We establish that the linear programming relaxation of the combinatorial solution has an O(1) integrality gap, implying asymptotic equivalence with the optimal discrete allocation. Building on this result, we analyze two implementable approaches: a Greedy-Lagrangian (GLC) and a rank-and-cut (RC) algorithm. We show that the GLC closely approximates the optimal solution and achieves near-optimal performance in finite samples. By contrast, RC is approximately optimal whenever the coverage constraint is slack or costs are homogeneous, while misallocation arises only when cost heterogeneity interacts with a binding coverage constraint. Monte Carlo evidence supports these findings.

[LG-133] Probing Non-Equilibrium Grain Boundary Dynamics with XPCS and Domain-Adaptive Machine Learning

链接: https://arxiv.org/abs/2605.12194
作者: Mouyang Cheng,Bowen Yu,Chu-Liang Fu,Nina Andrejevic,Matthias T. Agne,Riley Hanus,Qiwei Wan,Nathan C. Drucker,Thanh Nguyen,Andrei Fluerasu,Lutz Wiegart,Xiaoqian M Chen,Daniel Pajerowski,Yongqiang Cheng,Joshua J Turner,G. Jeffrey Snyder,Mingda Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Grain-boundary (GB) dynamics control the stability, mechanical, and functional response of nanocrystalline materials, but direct experimental access to their slow non-equilibrium motion has been limited. Here we establish X-ray photon correlation spectroscopy (XPCS), combined with domain-adaptive machine learning, as a quantitative probe of GB dynamics. Temperature- and grain-size-dependent two-time XPCS measurements in nanocrystalline silicon reveal pronounced departures from time-translation invariance, showing that GB relaxation can remain far from equilibrium over experimental timescales. However, direct extraction of quantitative physical information from these high-dimensional, noisy fluctuation maps faces a significant challenge. To overcome this barrier, we develop a semi-supervised learning framework that transfers physical parameter labels from continuum simulations to unlabeled experimental XPCS maps through domain-adaptive representation alignment. This AI-augmented approach enables the extraction of key kinetic parameters, including bulk diffusivity, GB stiffness, and effective GB concentration, directly from experimental XPCS measurements. Our results show how machine learning can transform indirect fluctuation signals into quantitative materials dynamics, providing a general route to study non-equilibrium defect motion in solids.

[LG-134] Information-Theoretic Generalization Bounds for Sequential Decision Making

链接: https://arxiv.org/abs/2605.12190
作者: Futoshi Futami,Masahiro Fujisawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information-theoretic generalization bounds based on the supersample construction are a central tool for algorithm-dependent generalization analysis in the batch i.i.d.~setting. However, existing supersample conditional mutual information (CMI) bounds do not directly apply to sequential decision-making problems such as online learning, streaming active learning, and bandits, where data are revealed adaptively and the learner evolves along a causal trajectory. To address this limitation, we develop a sequential supersample framework that separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons. Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector–loss information terms. We also establish a Bernstein-type refinement that yields faster rates under suitable variance conditions. The selector-SCMI proof strategy applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.

[LG-135] Machine Learning for neutron source distributions

链接: https://arxiv.org/abs/2605.12165
作者: Jose Ignacio Robledo,Norberto Schmidt,Klaus Lieutenant,Jingjing Li,Stefan Kesselheim,Paul Zakalek
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Under review at Machine Learning: Science Technology

点击查看摘要

Abstract:In light of the recent advancements in machine learning, we propose a novel approach to neutron source distribution estimation through the utilisation of probabilistic generative models. The estimation is based on a Monte Carlo particle list, which is only required during the training stage of the machine learning model. Once the source distribution has been learned, the model is independent of the original particle list, allowing for further sampling in an efficient, rapid, and memory-costless manner. The performance of various generative models is evaluated, including a variational autoencoder, a normalizing flow, a generative adversarial network, and a denoising diffusion model. These approaches are then compared to existing source distribution estimations, and the advantages and disadvantages of each approach are discussed. The results demonstrate that source distributions can be modeled through the use of probabilistic generative models, which paves the way for further advancements in this field.

[LG-136] Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions

链接: https://arxiv.org/abs/2605.12118
作者: Alexander Shen,Mikael Kuusela
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages of main text, 9 pages of appendices, 13 figures

点击查看摘要

Abstract:For stochastic process models, parameter inference is often severely bottlenecked by computationally expensive likelihood functions. Simulation-based inference (SBI) bypasses this restriction by constructing amortized surrogate likelihoods, but most SBI methods assume a black-box data generating process. While these surrogates are exact in the limit of infinite training data, practical scenarios force a strict tradeoff between model quality and simulation cost. In this work, we loosen the black-box assumption of SBI to improve this tradeoff for structured stochastic process models. Specifically, for neural network likelihood surrogates trained via probabilistic classification, we propose to augment the standard binary cross-entropy loss with exact score information \nabla_\theta \log p(x \mid \theta) and adaptive weighting based on loss gradients. We evaluate our approach on case studies involving network dynamics and spatial processes, demonstrating that our method improves surrogate quality at a drastically lower computational cost than generating more training data. Notably, in some cases, our approach achieves downstream inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time.

[LG-137] Assessment of cloud and associated radiation fields from a GAN stochastic cloud subcolumn generator

链接: https://arxiv.org/abs/2605.11968
作者: Dongmin Lee,Lazaros Oreopoulos,Nayeong Cho,Daeho Jin
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern Earth System Models (ESMs) operate on horizontal scales far larger than typical cloud features, requiring stochastic subcolumn generators to represent subgrid horizontal and vertical cloud variability. Traditional physically-based generators often rely on analytical cloud overlap paradigms, such as exponential-random decorrelation, which can struggle to capture the complex, anti-correlated behavior of non-contiguous cloud layers. In this study, we introduce a novel two-stage machine learning subcolumn generator for the GEOS atmospheric model, utilizing a Conditional Variational Autoencoder combined with a Generative Adversarial Network (CVAE-GAN) and a U-Net architecture. Trained on a merged CloudSat-CALIPSO height-resolved cloud optical depth dataset, the ML generator creates 56 stochastic subcolumns representing cloud occurrence and optical depth profiles. Evaluated against the established Räisänen, the ML approach accurately reproduces bimodal cloud overlap distributions, significantly reduces biases in grid-mean statistics, and halves the root-mean-square error in ISCCP-style cloud-top pressure and optical thickness joint histograms. The improvements brought by our deep generative models translate into more accurate offline radiative transfer calculations, reducing the global-mean shortwave top-of-atmosphere cloud radiative effect bias by a factor of three. Provided that the generator can be accelerated on CPUs, this offers a practical pathway to reduce structural errors at the cloud-radiation interface.

[LG-138] Variance-aware Reward Modeling with Anchor Guidance

链接: https://arxiv.org/abs/2605.11865
作者: Shuxing Fang,Ruijian Han,Liangyu Zhang,Fan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard Bradley–Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of- N selection.

[LG-139] Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

链接: https://arxiv.org/abs/2605.11850
作者: Konstantinos Oikonomidis,Jan Quan,Kimon Antonakopoulos,Antonio Silveti-Falls,Volkan Cevher,Panagiotis Patrinos
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we develop proximal preconditioned gradient methods with a focus on spectral gradient methods providing a proximal extension to the Muon and Scion optimizers. We introduce a family of stochastic algorithms that can handle a wide variety of convex and nonconvex constraints and study its convergence under heavy-tailed noise, through a novel analysis tailored to the geometry of the proposed methods. We further propose a variance-reduced version, which achieves faster convergence under standard noise assumptions. Finally, we show that the polynomial iterations used in Muon are more accurately captured by a nonlinear preconditioner than by the ideal matrix sign, leading to a convergence analysis that more faithfully reflects practical implementations.

[LG-140] Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media

链接: https://arxiv.org/abs/2605.11829
作者: Martin Hohmann
类目: Optics (physics.optics); Machine Learning (cs.LG); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:The accurate recovery of constituent-level optical properties from integrating sphere measurements is a central analytical challenge in pharmaceutical analysis, food science, and biomedical diagnostics. Neural network autoencoders can extract spectrally resolved absorption and scattering coefficients for each constituent without prior knowledge, but their fully connected encoders bind learned features to absolute wavelength indices, causing accuracy loss under spectrometer calibration drift or hardware exchange. This work introduces the Bin Latent Transformer (BiLT)-Autoencoder, in which the dense encoder is replaced by a cross-attention scanner: 16 learnable probe vectors query a convolutional feature map, aggregating morphological spectral information independently of absolute wavelength position. A physics-constrained linear decoder with enforced absorption/scattering separation and a three-phase curriculum augmentation strategy complete the architecture. On a liquid phantom benchmark (intralipid and two ink absorbers; 496 samples), the model achieves R^2 = 0.979 and 0.975 for \mu_a(\lambda) and \mu_s’(\lambda) , respectively, on held-out test spectra, maintaining R^2 0.90 for \mu_a and R^2 \approx 0.99 for \mu_s’ across the full tested shift range of \pm 10 spectral bands. The model generalises to a simulated spectrometer with a broader instrument line shape ( \approx24 nm FWHM) without retraining, retaining R^2 \approx 0.96 and 0.974 for the two channels. Attention map analysis reveals a physically interpretable two-component probe strategy: sparse anchor probes at absorption-edge wavelengths combined with a diffuse, SNR-driven ensemble at the high-transmittance long-wavelength region, which recruits additional probes dynamically under noise to provide implicit spectral averaging.

[LG-141] Posterior Contraction Rates for Sparse Kolmogorov-Arnold Networks in Anisotropic Besov Spaces

链接: https://arxiv.org/abs/2605.11652
作者: Jeunghun Oh,Kyeongwon Lee,Jaeyong Lee,Lizhen Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study posterior contraction rates for sparse Bayesian Kolmogorov-Arnold networks (KANs) over anisotropic Besov spaces, providing a statistical foundation of KANs from a Bayesian point of view. We show that sparse Bayesian KANs equipped with spike-and-slab-type sparsity priors attain the near-minimax posterior contraction. In particular, the contraction rate depends on the intrinsic anisotropic smoothness of the underlying function. Moreover, by placing a hyperprior on a single model-size parameter, the resulting posterior adapts to unknown anisotropic smoothness and still achieves the corresponding near-minimax rate. A distinctive feature of our results, compared with those for standard sparse MLP-based models, is that the KAN depth can be kept fixed: owing to the flexibility of learnable spline edge functions, the required approximation complexity is controlled through the network width, spline-grid range and size, and parameter sparsity. Our analysis develops theoretical tools tailored to sparse spline-edge architectures, including approximation and complexity bounds for Bayesian KANs. We then extend to compositional Besov spaces and show that the contraction rates depend on layerwise smoothness and effective dimension of the underlying compositional structure, thereby effectively avoiding the curse of dimensionality. Together, the developed tools and findings advance the theoretical understanding of Bayesian neural networks and provide rigorous statistical foundations for KANs.

[LG-142] Learning U-Statistics with Active Inference

链接: https://arxiv.org/abs/2605.11638
作者: Xiaoning Wang,Yuyang Huo,Liuhua Peng,Changliang Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: U -statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for U -statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for U -statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting U -statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to U -statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.

[LG-143] Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies

链接: https://arxiv.org/abs/2605.11531
作者: Takuro Kutsuna,Noriko N. Ishizaki,Norihiro Oyama,Hiroaki Yoshida
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Physics-based climate projections using general circulation models are essential for assessing future risks, but their coarse resolution limits regional decision-making. Statistical downscaling can efficiently add detail, yet many methods treat variables independently, degrading inter-variable relationships that govern compound hazards such as heat stress, drought, and wildfire. Here we show that a diffusion-based multivariate generative framework, combined with bias correction, recovers degraded inter-variable correlations even under a 50 \times increase in linear resolution. When applied to five meteorological variables over Japan, the framework reduces inter-variable correlation errors by more than fourfold relative to existing baselines while improving both univariate and spatial accuracy, leading to more accurate detection of severe drought. These results demonstrate that multivariate generative downscaling improves the reliability of compound risk assessment under large resolution gaps.

[LG-144] Post-ADC Inference: Valid Inference After Active Data Collection

链接: https://arxiv.org/abs/2605.11511
作者: Shuichi Nishino,Tomohiro Shiraishi,Teruyuki Katsuoka,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The validity of statistical inference depends critically on how data are collected. When data gathered through active data collection (ADC) are reused for a post-hoc inferential task, conventional inference can fail because the sampling is adaptively biased toward regions favored by the collection strategy. This issue is especially pronounced in black-box optimization, where sequential model-based optimization (SMBO) methods such as the tree-structured Parzen estimator (TPE) and Gaussian process upper confidence bound (GP-UCB) preferentially concentrate evaluations in promising regions. We study statistical inference on actively collected data when the inferential target is constructed in a data-dependent manner after data collection. To enable valid inference in this setting, we propose post-ADC inference, a framework that accounts for the biases arising from both the active data collection process and the subsequent data-driven target construction. Our method builds on selective inference and provides valid p -values and confidence intervals that correct for both sources of bias. The framework applies to a broad class of ADC processes by imposing only assumptions on the observation noise, without requiring any assumptions on the underlying black-box function or the surrogate model used by the SMBO algorithm. Empirical results also show that post-ADC inference provides valid inference for data collected by GP-UCB and TPE.

[LG-145] Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agent ic Literature Extraction Workflows

链接: https://arxiv.org/abs/2605.11221
作者: Yaochen Rao,Farzaneh Jalalypour,N. M. Anoop Krishnan,Rocío Mercado
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level F_1 = 0.98 and transferred to PROTACs by terminology substitution alone, maintaining record-level F_1 0.93 . Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.

[LG-146] Operator Spectroscopy of Trained Lattice Samplers

链接: https://arxiv.org/abs/2605.11199
作者: Moxian Qian
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 26 pages, 13 figures, 15 tables

点击查看摘要

Abstract:Trained lattice samplers are usually judged by the ensembles they generate. Here we instead analyze the trained field-space function itself: a flow-matching velocity, a diffusion score, or a normalizing-flow action residual. We project these functions onto operator bases fixed before the fit, chosen from symmetry, exact Gaussian path limits, finite-volume modes, and gauge covariance. For two-dimensional lattice (\phi^4), a trained straight-flow teacher is not described by a local force basis alone. After the local transport basis, the residual separates into a zero-mode Binder component and a lowest-shell finite-(k) correlator component. The deflated zero-mode polynomial (P_5(M;t)) reduces the dominant Binder-tail component, while (\phi^\perp_|n|^2=1) reduces the finite-(k) correlator component; wrong-parity, off-zero-mode, and random controls do not produce the same reductions. The same projection distinguishes other sampler classes. Diffusion follows the force-resolvent ordering predicted by the free theory, reverse-KL normalizing-flow collapse appears as a forbidden odd zero-mode residual, and gauge-equivariant teachers are resolved by Wilson-loop-force tangent directions. The operator basis is model- and symmetry-dependent, but the test is common: project the trained field-space function and retain sectors that lower held-out residuals and pass the available controls. Comments: 26 pages, 13 figures, 15 tables Subjects: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG) Cite as: arXiv:2605.11199 [hep-lat] (or arXiv:2605.11199v1 [hep-lat] for this version) https://doi.org/10.48550/arXiv.2605.11199 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-147] Adaptive Policy Learning Under Unknown Network Interference

链接: https://arxiv.org/abs/2605.11191
作者: Aidan Gleich,Eric Laber,Alexander Volfovsky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (e.g. revenue). Existing adaptive experimentation methods either assume the interference network is fully known or bypass the network by operating on coarse cluster-level randomizations. We develop a Thompson sampling algorithm that jointly learns the interference network and adaptively optimizes individual-level treatment allocations via a Gibbs sampler. The algorithm returns both an optimized treatment policy and an estimate of the interference network; the latter supports downstream causal analyses such as estimation of direct, indirect, and total treatment effects. For additive spillover models, we show that total reward is linear in the treatment vector with coefficients given by an n -dimensional latent score. We prove a Bayesian regret bound of order \sqrtnT \cdot B \log(en/B) for exact posterior sampling; empirically, our Gibbs-based approximate sampler achieves regret consistent with this rate and remains sublinear when the additive spillovers assumption is violated. For general Neighborhood Interference, where this reduction is unavailable, we analyze an explore-then-commit variant with O(n^2 \log T) graph-discovery cost. An information-theoretic \Omega(n \log T) lower bound complements both results. Empirically, our method achieves more than an order-of-magnitude reduction in regret in head-to-head comparisons. On two real-world networks, the algorithm achieves sublinear regret and yields downstream effect estimates with small RMSE relative to the truth.

[LG-148] Interpretable Machine Learning for Spatial Science: A Lie-Algebraic Kernel for Rotationally Anisotropic Gaussian Processes

链接: https://arxiv.org/abs/2605.11179
作者: Kane Warrior,Dalia Chakrabarty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many three-dimensional spatial fields are anisotropic, with directions of rapid and slow variation that need not align with the coordinate axes. Standard Gaussian process kernels with Automatic Relevance Determination (ARD) capture only axis-aligned anisotropy, while generic full symmetric positive definite (SPD) metrics can represent rotated anisotropy but do not parameterise principal length-scales and directions directly. We introduce an interpretable rotationally anisotropic GP kernel that parameterises a three-dimensional SPD covariance metric using three principal length-scales and an explicit SO(3) rotation. The rotation is represented by an axis-angle vector and mapped to SO(3) via the Lie-algebra exponential map, giving unconstrained Euclidean coordinates for inference while always inducing a valid SPD metric. The construction spans the same family of three-dimensional SPD covariance metrics as a generic full-SPD parameterisation, but exposes the geometry differently: length-scales and orientation are explicit, interpretable, and directly available for prior specification and posterior summaries. We perform Bayesian inference on these quantities using Markov Chain Monte Carlo (MCMC), and characterise the resulting symmetries and weakly identified regimes. On synthetic data with rotated anisotropy, the posterior recovers the generating metric and improves prediction relative to an axis-aligned ARD baseline, while matching the predictive performance of a generic full SPD baseline. When the ground truth is axis-aligned, posterior mass concentrates near the identity rotation and predictive performance matches ARD. On a material-density dataset from a laboratory-fabricated nano-brick, the inferred metric reveals rotated anisotropy that is not captured by axis-aligned kernels. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2605.11179 [stat.ML] (or arXiv:2605.11179v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.11179 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-149] From raw data to neutrino candidates: a neural-network pipeline for Baikal-GVD

链接: https://arxiv.org/abs/2605.11176
作者: A. Matseiko(1, 2),G. Plotnikov(1),I. Kharuk(1, 2) ((1) Institute for Nuclear Research of the Russian Academy of Sciences, (2) Moscow Institute of Physics and Technology)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:We present a neural-network-based data processing pipeline for Baikal-GVD, designed to improve event reconstruction quality and accelerate neutrino candidates selection. The pipeline comprises three stages: fast suppression of extensive air shower events, suppression of noise optical modules activations, and extraction of high confidence neutrino candidates. All three networks employ a transformer architecture that exploits inter-hit correlations through the attention mechanism. Applied sequentially, the pipeline achieves orders-of-magnitude speedup over the standard reconstruction chain. Moreover, noise suppression neural network surpasses the accuracy of algorithmic noise suppression algorithms and provides estimate for time residuals of the signal hits, which is crucial for identification of track-like hits. We address the domain shift between Monte Carlo simulations and experimental data by incorporating a domain adaptation technique, demonstrating improved agreement between the two domains. The resulting framework enables near-real-time event classification, with direct applications to multi-messenger alert systems and diffuse neutrino flux measurements.

[LG-150] Uniform Scaling Limits in AdamW-Trained Transformers

链接: https://arxiv.org/abs/2605.11059
作者: William Gibson,Christoph Reisinger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in L^2 , uniformly over the initial condition, to the solution of a forward–backward system of ODEs at rate \mathcal O(L^-1+L^-1/3H^-1/2) . Here, L and H denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean–Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

[LG-151] Acceleration of horizontal numerical advection for atmospheric modeling through surrogate modeling with temporal coarse-graining

链接: https://arxiv.org/abs/2605.10956
作者: Manho Park,Christopher V. Rackauckas,Christopher W. Tessum
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-learned surrogate modeling of advection may accelerate geoscientific models, but existing approaches have either achieved limited speedup or have sacrificed spatial resolution compared to the model they are trained to emulate. We developed a machine-learned solver that speeds up advection simulations without sacrificing spatial resolution through the use of temporal coarse-graining, where the model is trained to take larger integration steps than dictated by the Courant-Friedrich-Lewy (CFL) condition. Our solver framework includes a convolutional neural network that takes concentrations and CFL numbers as inputs and outputs mass flux. Our solvers emulate 10-day ground-level horizontal advection simulations with r ^2 values against the baseline ranging from 0.60–0.98 with temporal coarsening factors of 4 to 32 times the baseline integration time step. Speed increases and accuracy decreases with increased coarsening, with r^2 = 0.24 in accuracy lost for every factor of 10 gained in speed, reaching a maximum 92 \times speedup while maintaining r^2 = 0.60 . We deliberately trained our solvers only on January ground-level wind data to examine their ability to generalize across seasons and vertical heights. The 4 \times -coarsened learned solver successfully reproduces simulations over 72 vertical levels. The 8 \times --16 \times solvers (but not 32 \times ) emulate most vertical levels. The learned solvers also generalize well across seasons, except for instabilities in June and October. With additional fine-tuning, these learned solvers could be appropriate for operational use where trading accuracy for speed could be advantageous, such as in screening tools, in ensemble simulations, or with data assimilation.

[LG-152] Interpretable rainfall modelling reveals rapid reorganisation of Amazonian rainfall under vegetation loss

链接: https://arxiv.org/abs/2605.10948
作者: Lilly Horvath-Makkos,Fayyaz Minhas
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 31 pages, 5 figures in the main text, 3 figures in the supplementary text

点击查看摘要

Abstract:Understanding how vegetation loss alters rainfall remains a major challenge in climate and hydrological science, as deforestation modifies precipitation through heterogeneous, seasonal and nonlinear land-atmosphere feedbacks. Existing models struggle to capture these dynamics: convection is parameterised at coarse scales, tipping behaviour is poorly constrained, and rainfall-deforestation analyses are limited to multi-decadal timescales. Therefore, many approaches resolve correlations rather than causal effects, limiting our ability to anticipate hydrological disruption. Using a neural-network model for hourly rainfall prediction, combined with pathway diagnostics and sensitivity analyses, we examine how vegetation perturbations reorganise rainfall across space, intensity regimes, and timescales under deforestation. We assess whether the model captures physically consistent dependencies linking vegetation, atmospheric state, and precipitation, and whether sustained canopy loss induces threshold behaviour. The model accurately predicts rainfall occurrence and intensity (Spearman = 0.84, F1 = 0.93, ROC-AUC = 0.98) and learns temporally ordered dependencies aligned with ecohydrological theory. Sensitivity analyses reveal rapid, asymmetric responses to vegetation loss: heavy rainfall (20-50 mm/h) declines by up to 7% under sustained deforestation, while light rainfall (0.1-1 mm/h) increases by 4%. Rainfall entropy rises by 1.3%, and dry-season intensity increases by 0.3-0.5% per 0.5% forest-cover loss, with strongest impacts in the north-western Amazon and Andean foothills. Threshold analysis reveals a sharp decline in precipitating area fraction after 2-3 months of sustained vegetation change in sensitive regions. These results demonstrate that data-driven approaches uncover process-relevant land-atmosphere coupling and highlight growing hydrological vulnerability in the Amazon.

[LG-153] Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

链接: https://arxiv.org/abs/2605.10395
作者: Minh-Toan Nguyen,Jean Barbier
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width k scales linearly with the input dimension d – a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textiteffective width k_c – the number of learnable features at a given data budget n – which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error \varepsilon^\rm BO scales as n^1/(2\beta)-1 , and a refinement regime in which it scales as n^-1 , where \beta1/2 is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation \varepsilon^\rm BO=\Theta(k_c d/n) . We further show empirically that a student trained with \textscAdam near the effective width k_c achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

[LG-154] Adversarial Effects on Expressibility and Trainability in Distributed Variational Quantum Algorithms

链接: https://arxiv.org/abs/2605.03629
作者: Abhishek Sadhu,Sharu Theresa Jose
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Comments are welcome

点击查看摘要

Abstract:Distributed quantum algorithms offer a promising pathway to scale variational quantum algorithms beyond the constraints of noisy intermediate-scale quantum hardware. However, existing approaches implicitly assume a trusted entanglement-sharing layer across quantum processors. We show that this assumption introduces a fundamental vulnerability: adversarial perturbations of shared entanglement induce structured gate-level noise that directly impacts quantum learning. We develop a framework that maps entanglement-level perturbations to gate-level noise via an explicit Kraus representation. To quantify their impact, we introduce Kraus expressibility, a metric that generalizes unitary expressibility to noisy quantum channels. We then establish a trade-off between Kraus expressibility and trainability of noisy quantum circuits through gradient variance analysis. Our analysis reveals that an adversary can manipulate Kraus expressibility to maintain sufficiently large cost gradients (avoiding barren plateaus) while systematically biasing optimization toward incorrect solutions. We validate these findings through numerical simulations, demonstrating adversarial degradation of expressibility and trainability.

[LG-155] sting General Relativity Through Gravitational Wave Classification: A Convolutional Neural Network Framework

链接: https://arxiv.org/abs/2605.02453
作者: Lavinia Heisenberg,Shayan Hemmatyar,Hector Villarrubia-Rojo
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 36 pages, 20 figures, 4 tables. Comments welcome!

点击查看摘要

Abstract:We present a machine learning framework for testing general relativity (GR) with gravitational wave signals from binary black hole mergers. Using the source parameters of 173 BBH events from the GWTC catalog as a realistic astrophysical population, we generate simulated GR waveforms and construct beyond GR (BGR) waveforms by applying controlled phase deformations. We introduce a response function formalism that provides a systematic framework for quantifying how any observable responds to modifications of GR. We train convolutional neural networks (CNNs) on two input representations: whitened waveforms and a response function type observable derived from the waveform mismatch, which isolates the effect of phase deviations from the bulk signal. Using response functions as the CNN input improves the classification sensitivity by a factor of approximately 33 compared to whitened waveforms, demonstrating that the choice of observable representation is as important as the classifier architecture. We study the fundamental limits of this classification through Bayes optimal error analysis, averaging methods that reveal coherent patterns hidden in noise, and a comparison between CNN accuracy and a single feature classifier as a proxy for human performance. At all deformation scales, the CNN outperforms the best single feature approach. We extend the framework to physically motivated theories using the parameterized post Einsteinian (ppE) formalism and apply it to massive gravity, where the classifier detects deviations for graviton masses of order m_g \sim 10^-23;\mathrmeV/c^2 with aLIGO design sensitivity.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-05-13)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载