Arxiv今日论文 | 2026-05-14

本篇博文主要内容为 2026-05-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共106篇(Computation and Language (cs.CL))
人工智能共260篇(Artificial Intelligence (cs.AI))
计算机视觉共160篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共295篇(Machine Learning (cs.LG))
多智能体系统共27篇(Multiagent Systems (cs.MA))
信息检索共19篇(Information Retrieval (cs.IR))
人机交互共26篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] EconAI: Dynamic Persona Evolution and Memory-Aware Agents in Evolving Economic Environments

【速读】：该论文旨在解决现有基于大型语言模型（LLMs）的经济模拟中，主体行为难以兼顾短期优化与长期战略规划的动态权衡问题，即传统静态数据驱动预测无法捕捉经济情绪、市场波动及个体目标驱动的适应性行为。解决方案的关键在于提出的 EconAI 框架，其核心创新包括经济情绪指数（Economic Sentiment Index, ESI）、记忆权重（Memory Weighting）和动态决策机制：通过 ESI 量化主体对经济环境的信念，利用记忆权重动态调整历史数据的影响，并建立工作-消费行为的耦合链接，从而使主体能基于市场信号与长期目标做出类人决策。这是首个在统一框架内模拟宏观/微观经济环境与交互的 LLM 驱动系统，实证表明其提升了经济响应的稳定性、更准确地复现就业-消费周期，并增强了决策鲁棒性。

链接: https://arxiv.org/abs/2605.13762
作者: Annie Liu,Zane Cao,Lang Chen,Zongxin Xu,Zigan Wang
机构: Tsinghua University (清华大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) in economic simulations has significantly enhanced agent-based modeling, yet existing frameworks struggle to capture the interplay between short-term optimization and long-term strategic planning. Conventional approaches rely on static data-driven predictions, failing to incorporate adaptive behaviors influenced by economic sentiment, market volatility, and individual goals. To address these limitations, we introduce a novel EconAI framework, incorporating economic sentiment indexing (ESI), memory weighting, and dynamic decision-making mechanisms. By quantifying economic belief, adjusting historical data influence, and linking work-consumption behaviors, EconAI achieves a more human-like decision process, where agents adapt their actions based on both market signals and long-term objectives. It is the first LLM-powered simulation system that can simulate the macro/microeconomic environment and interactions in a unified framework. Empirical evaluations show that EconAI improves stability in economic responses, better replicates real-world employment-consumption cycles, and enhances overall decision robustness. This advancement marks a crucial step towards more realistic, adaptive economic agent simulations.

[MA-1] SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems NEURIPS2026

【速读】：该论文试图解决大型语言模型代理（LLM agents）在依赖技能库（skill libraries）进行多步任务时，因技能不断被添加、重用、修补及依赖关系变更而累积的“技能技术债务”（skill technical debt）问题——即库级别的缺陷虽不直接破坏单个技能，但会损害未来检索、组合与执行。现有方法主要聚焦于任务时的检索、规划与修复，而库时维护（library-time maintenance）手段缺失。解决方案的关键在于提出的SkillOps框架：它将每个技能表示为类型化的技能契约（Skill Contract），包含参数（P）、输出（O）、断言（A）、变量（V）和函数（F）五个维度，并用层次化技能生态系统图（Hierarchical Skill Ecosystem Graph）组织技能，从效用、兼容性、风险与验证四个维度诊断库健康状况。该框架以方法无关的插件形式工作，对原始技能库进行维护后，可直接被现有检索或规划代理使用，无需修改其内部代码。在ALFWorld实验中，SkillOps作为独立代理实现了79.5%的任务成功率，比最强基线高出8.8个百分点且无需额外任务时大语言模型调用；作为插件层，它也使检索密集型基线提升了0.68至2.90个百分点，且其基于规则的维护实现几乎不消耗库时大语言模型调用或token，证明了技能库维护可作为低开销的结构化层（architectural layer）加入现有系统。

链接: https://arxiv.org/abs/2605.13716
作者: Hongji Pu,Xinyuan Song,Liang Zhao
机构: Emory University (埃默里大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注: 23 pages, 9 figures. Submitted to NeurIPS 2026. Code is available at this https URL

点击查看摘要

Abstract:Large language model agents increasingly rely on skill libraries for multi-step tasks, yet these libraries can accumulate persistent defects as skills are added, reused, patched, and linked to changing dependencies. We call this failure mode skill technical debt: library-level defects that may not break a single skill locally but can harm future retrieval, composition, and execution. Existing skill-based agents mainly focus on task-time retrieval, planning, and repair, while library-time maintenance remains underexplored. We propose SkillOps, a method-agnostic plug-in framework for maintaining skill libraries. SkillOps represents each skill as a typed Skill Contract (P, O, A, V, F), organizes skills with a Hierarchical Skill Ecosystem Graph, and diagnoses library health across utility, compatibility, risk, and validation dimensions. Given a raw skill library, SkillOps produces a maintained library that can be used by existing retrieval or planning agents without changing their internal code. On ALFWorld, SkillOps achieves 79.5 percent task success as a standalone agent, outperforming the strongest baseline by 8.8 percentage points with no additional task-time large language model calls. As a plug-in layer, it improves retrieval-heavy baselines by 0.68 to 2.90 percentage points. The current rule-based maintenance implementation uses nearly zero library-time large language model calls or tokens, showing that skill-library maintenance can be added as a low-overhead architectural layer.

[MA-2] Unweighted ranking for value-based decision making with uncertainty

【速读】：该论文试图解决智能系统在自主决策中与人类价值观对齐（human values alignment）时存在的规范偏差（normative bias）问题，即利益相关者赋予的主观权重可能损害决策的公正性与安全性，并缺乏对定量与定性标准的整合能力。解决方案的关键在于提出了模糊-无权重价值决策（Fuzzy-Unweighted Value-Based Decision Making, FUW-VBDM）框架，通过移除先验权重并引入决策变量的模糊域，将任何价值决策问题转化为在权重域中优化得分函数的可行解搜索；其核心是Rankzzy方法，一种可定制的无权重排序方法，通过集成基于模糊推理（fuzzy-based reasoning）来量化不确定性，数学上证明了任意合法配置下的一致性，并在大规模价值决策中显著降低计算成本，同时通过毕达哥拉斯均值聚合保持与现有方法相当的排序性能。

链接: https://arxiv.org/abs/2605.13601
作者: Aarón López García,Natalia Criado,Jose Such
机构: Universitat de València(瓦伦西亚大学); Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València(瓦伦西亚理工大学瓦伦西亚人工智能研究所)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 21 pages

点击查看摘要

Abstract:As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this work, we introduce the Fuzzy-Unweighted Value-Based Decision Making (FUW-VBDM) framework, where agents incorporate both quantitative and qualitative criteria to generate human-centred decisions. We also address the normative bias introduced by stakeholders with arbitrary weights by removing prior weights and introducing a fuzzy domain of decision variables defined for a score function. This concept allows us to generalise any VBDM problem as the search for feasible solutions when optimising the score in the weight domain. To provide a solution to FUW-VBDM, we present Rankzzy, a customizable unweighted ranking method that integrates fuzzy-based reasoning to quantify uncertainty. We mathematically prove the consistency of the Rankzzy for any admissible configuration selected by stakeholders. We show the applicability of our method through an illustrative case study, which we also use as a running example. The evaluation conducted indicates a reduced computational cost in large-scale value-based decision-making problems and a strong rank performance regarding existing approaches when employing the aggregation via Pythagorean means.

[MA-3] RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

【速读】：该论文旨在解决现有重症监护室（ICU）临床决策支持基准评估中存在的根本缺陷：即历史临床医生行为被当作金标准（ground truth），但这些行为往往在信息不完全和患者状态时间背景受限的情况下做出，可能并非最优，因而难以真实衡量人工智能（尤其是大语言模型, LLM）的推理能力。为此，论文提出RealICU这一事后注释基准，其关键解决方案在于：由资深医师在回顾完整患者轨迹后生成标签，从而提供基于全知视角的客观评估标准。具体而言，基准定义了四项临床任务（患者状态评估、急性问题识别、推荐动作及风险红色警报动作），以30分钟时间窗口划分轨迹，并发布两个数据集（RealICU-Gold和RealICU-Scale）。同时，论文还引入ICU-Evo结构化记忆智能体以探索改进长期推理能力，但其核心贡献仍在于构建了一个临床可验证的测试平台，能够揭示现有模型存在的召回-安全权衡（recall-safety tradeoff）和对早期解释的锚定偏差（anchoring bias），从而推动高风险医疗场景下AI顺序决策支持的可信度提升。

链接: https://arxiv.org/abs/2605.13542
作者: Chengzhi Shen,Weixiang Shen,Tobias Susetzky,Chen(Cherise)Chen,Jun Li,Yuyuan Liu,Xuepeng Zhang,Zhenyu Gong,Daniel Rueckert,Jiazhen Pan
机构: Technical University of Munich (TUM) (慕尼黑工业大学); TUM University Hospital (慕尼黑工业大学附属医院); LMU Munich (慕尼黑大学); University of Sheffield (谢菲尔德大学); University of Oxford (牛津大学); Zhongshan Hospital Fudan University (复旦大学附属中山医院); Sun Yat-sen University Cancer Center (中山大学肿瘤防治中心); Imperial College London (伦敦帝国理工学院); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); relAI – Konrad Zuse School of Excellence in Reliable AI (可靠人工智能卓越康拉德·楚泽学校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: this https URL

[MA-4] Constitutional Governance in Metric Spaces

【速读】：该论文试图解决计算社会选择和算法决策理论中缺乏一个端到端且多项式时间内可完成的平等主义自治过程的问题——以往的研究孤立地处理聚合、审议、修正和共识，且关键的度量空间聚合器存在NP困难。解决方案的关键在于提出一种在度量空间中的宪法治理框架，将上述各阶段整合为一个统一的多项式时间过程。核心机制是：宪法为每个可修正组件分配一个度量空间、聚合规则和超级多数阈值；每位成员提交一个理想元素（兼具投票和个人提案功能），任何成员随后可提交获得超级多数公开支持的公共提案（源自联盟审议、优化或AI调解）；宪法规则根据现状对提案进行评分，采纳得分最高的支持提案（否则保留现状），而宪法本身的修正则使用相同规则（可能更高的阈值）。具体实现以广义中位数（generalised median）作为工作规则，该框架提供了框架级保证，证明了无虚报弱主导诚实投票，并研究了最佳峰值与无约束最优之间的妥协差距（一维为零，一般有界，可通过简单启发式模拟缩小）。该框架在七个典型场景中实例化，并通过统一度量空间聚合、现实意识社会选择、超级多数修正、宪法共识、审议联盟形成和AI调解，为数字社区和组织的宪政民主治理提供了综合性解决方案。

链接: https://arxiv.org/abs/2605.13362
作者: Ehud Shapiro,Nimrod Talmon
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Computational social choice and algorithmic decision theory offer rich aggregation theory but no end-to-end, polynomial-time process for egalitarian self-governance: prior work treats aggregation, deliberation, amendment, and consensus in isolation, and key metric-space aggregators are NP-hard. We propose constitutional governance in metric spaces, integrating these stages into one polynomial-time process. The constitution assigns, per amendable component, a metric space, aggregation rule, and supermajority threshold. Each member submits an ideal element – both vote and personal proposal. Any member may then submit a public proposal carrying supermajority public support under the revealed votes – sourced from coalition deliberation, optimization, or AI mediation. The constitutional rule scores proposals against the status quo, adopting the supported proposal of positive maximal score (else retaining the status quo); the same rule, possibly with a higher threshold, amends the constitution itself. We develop the generalised median as the worked rule, establish framework-level guarantees, prove no misreport weakly dominates sincere voting, and study the compromise gap between best peak and unconstrained optimum – zero in one dimension, bounded in general, narrowed in simulation by a simple heuristic. We instantiate the framework on seven canonical settings; the mean appears as a utilitarian alternative in the appendix. By unifying metric-space aggregation, reality-aware social choice, supermajority amendment, constitutional consensus, deliberative coalition formation, and AI mediation, this work delivers a comprehensive solution to the constitutional democratic governance of digital communities and organisations.

[MA-5] Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

【速读】：该论文试图解决急诊科（Emergency Department, ED）在患者护理和资源管理方面面临的现实挑战，核心问题是如何在一个兼具现实性和灵活性的模型中探索有效的资源优化策略。解决方案的关键在于开发并验证一个混合的离散事件模拟（Discrete Event Simulation, DES）与基于主体的模型（Agent-Based Model, ABM）框架，该框架能够高度可配置地模拟不同规模的急诊环境。在此基础上，进一步集成一个概念验证的多智能体系统（Multi-Agent System, MAS），该系统利用急诊事件记录的时间分类账（temporal ledger），能够自主探索并实施资源分配策略。整个DES-ABM-MAS框架的模块化设计使得模型能够有效复现真实急诊室的动态行为，从而为资源优化策略的评估与发现提供一个强有力的工具。

链接: https://arxiv.org/abs/2605.13345
作者: Markus Wenzel,Tobias Strapatsas,Jessika Kress,Dorothea Sauer,Nele Gessler,Horst K. Hahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model’s results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.

[MA-6] IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

【速读】：该论文试图解决当前AI辅助创新系统中单一方法论（如TRIZ或Design Thinking）及顺序式提示词工作流所导致的中间推理结构丢失、跨方法论见解碎片化、以及创新候选的可追溯性、综合性和新颖性系统评价不足的问题。解决方案的关键在于提出IdeaForge框架——一个基于知识图谱（knowledge graph）的多智能体（multi-agent）系统，它通过集成TRIZ、Design Thinking和SCAMPER等多种创新方法论，让专业智能体在持久化的FalkorDB知识图谱上操作，为每种方法论贡献结构化实体和关系（如矛盾、发明原理、用户需求、变换、类比等）。核心创新点是一种跨方法论收敛机制（cross-methodology convergence mechanism），通过图基的声明链接（graph-based claim linkage）实现：将独立获得多种方法论支持的声明用CONVERGENT关系连接，从而通过图遍历识别高置信度创新候选；随后，下游专利起草智能体基于收敛声明子图生成结构化专利草案，减少对无约束语言模型生成的依赖。此外，提出的InnovationScore公式按收敛支持度、方法论多样性、声明强度和现有技术挑战数量对声明排序，实现了可解释、可追溯的多方法论合成。

链接: https://arxiv.org/abs/2605.13311
作者: Joy Bose
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 14 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

[MA-7] Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

【速读】：该论文试图解决密集环境下多智能体路径规划（Multi-Agent Path Finding, MAPF）中，次优初始计划引发复合冲突、阻碍基于修复的求解器（如LNS2）可行修复的问题。解决方案的关键在于提出DiffLNS混合框架，该框架将离散去噪扩散概率模型（D3PM）与LNS2相结合：D3PM作为初始化器，通过稀疏社交注意力从专家演示中学习协调多智能体动作轨迹的时空先验，并在分类动作空间上直接从多模态联合计划分布中采样多个多样的草稿计划，这些草稿作为热启动（warm start）提供给下游LNS2修复模块，从而在硬MAPF约束下完成未完成的轨迹并解决剩余冲突。

链接: https://arxiv.org/abs/2605.13296
作者: Yuanzhe Wang,Tian Zhi,Zihang Wei,Hongguang Wang,Jiaming Guo,Yang Zhao,Zisheng Liu,Shiyu Quan,Xing Hu,Zidong Du,Yunji Chen
机构: State Key Lab of Processors, Institute of Computing Technology, CAS (中国科学院计算技术研究所处理器国家重点实验室); School of Advanced Interdisciplinary Sciences, CAS (中国科学院先进交叉科学学院); University of Chinese Academy of Sciences (中国科学院大学); Institute of Microelectronics, CAS (中国科学院微电子研究所)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi-agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint-plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm-starting an LNS-based MAPF solver.

[MA-8] CANTANTE: Optimizing Agent ic Systems via Contrastive Credit Attribution

【速读】：该论文旨在解决基于大语言模型的多智能体系统（LLM-based multi-agent systems）在自动化配置时面临的信用分配问题（credit-assignment problem），即系统级奖励（system-level scores）只能整体评估，而控制智能体行为的参数（如提示prompts）却是局部的，导致难以直接优化。解决方案的关键在于提出了CANTANTE框架，它通过在相同查询上对比多个联合配置的轨迹展开（rollouts），将系统级奖励分解为各智能体的更新信号，从而将代理提示（agent prompts）作为可学习的系统参数进行优化，实现了对多智能体系统配置的有效自动化。

链接: https://arxiv.org/abs/2605.13295
作者: Tom Zehle
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

[MA-9] Decoupled Planning for Multiple Omega-Regular Objectives

【速读】：该论文试图解决如下问题：在图上生成一条满足多个 ω-正则目标（ω-regular objectives）的路径时，如何通过一个解耦框架实现模块化策略设计——每个目标独立分配给一个智能体（agent）选择局部策略，并由一个对图和目标完全不知情的调度器（scheduler）动态组合这些策略——从而在保证所有目标合取可实现的条件下，确保组合后的路径满足全部目标。解决方案的关键在于区分目标类型并引入相应的协调机制：对于安全目标（safety objectives），由于完全去中心化实现不可能，论文提出一个同步最大安全动作的协议；对于非安全目标，则引入约定（conventions）——即在图和目标揭示前由所有智能体达成一致的先验限制——并刻画了不同 ω-正则子类下最小限制性的约定，其中 Büchi 目标允许无需调度器通信的通用有限记忆策略组合，co-Büchi 目标仅需智能体知晓自身是否被调度，而 parity 目标则额外要求知晓具体哪个智能体被调度。

链接: https://arxiv.org/abs/2605.13185
作者: Guy Avni,Thomas A. Henzinger,Kaushik Mallik,Suman Sadhukhan,K. S. Thejaswini
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Multiagent Systems (cs.MA)
备注: 33 pages, 6 figures. Extended version of the paper accepted at CAV 2026

点击查看摘要

Abstract:We study the problem of generating paths on a graph that satisfy a collection of \omega-regular objectives. We propose a decoupled framework in which each objective is assigned to an independent agent that selects a local policy, while a scheduler – oblivious to the graph and objective – dynamically composes these policies into a single path. We ask when such a composition satisfies all objectives, assuming their conjunction is realizable. The framework enables modular policy design but raises fundamental compositional challenges. We show that even extremely fair deterministic schedulers do not ensure correctness, and that stochastic schedulers, while necessary, are insufficient without coordination. For safety objectives, we demonstrate that fully decentralized implementations are impossible, and we introduce a protocol for synchronizing on maximal safe actions. For non-safety objectives, we introduce conventions – simple, a priori restrictions agreed upon before the graph or objectives are revealed – that guarantee satisfaction of all objectives when followed by all agents. We characterize minimally restrictive conventions for major subclasses of \omega-regular objectives. In particular, Büchi objectives admit universal composition of finite-memory policies without scheduler communication; co-Büchi objectives require only knowledge of whether the agent was scheduled; and parity objectives additionally require knowledge of which agent was scheduled.

[MA-10] When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling

【速读】：该论文旨在解决现有智能体协调基准测试主要聚焦于弱耦合环境中的任务完成，缺乏对共享、动态演化且具有层级与耦合约束的系统协调机制的系统性评估，特别是未能深入探究不同协调范式（协调范式）在何时成功或失败这一关键问题。解决方案的核心在于构建了分布式事件驱动调度基准（DESBench, Distributed Event-driven Scheduling Benchmark），该基准基于工业调度中的共享离散事件驱动环境，能捕捉多时间尺度决策、部分可观测性及动态耦合约束。通过定义评估有效性、约束对齐、协调效率与鲁棒性的任务及指标，并聚焦集中式、层级式、异层级式与全层级式四种代表性协调范式，DESBench 实现了对信息流、决策权与冲突解决机制差异的系统控制实验。该基准的关键在于揭示了协调设计在复杂环境中对智能体系统行为的根本影响，证明显性协调权衡（如集中式鲁棒但扩展性差、层级式高效但跨级错位、异层级式灵活但通信开销大、全层级式约束满足好但全局鲁棒性弱）无法仅通过结果指标捕获，从而强调了未来多智能体系统（MAS, Multi-Agent Systems）研究亟需更自适应、有原则且动态的协调机制。

链接: https://arxiv.org/abs/2605.13172
作者: Ziqi Wang,Yuhao Yang,Zhiwei Ling,Wenzhuo Qian,Hailiang Zhao
机构: Zhejiang University (浙江大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in agent and multi-agent systems have shown strong performance on tool use, reasoning, and collaborative tasks. However, existing benchmarks mostly evaluate task completion in weakly coupled environments, and provide limited support for studying coordination in shared, dynamically evolving systems with hierarchy and coupled constraints. This leaves an important question underexplored: when do different coordination paradigms succeed or fail? We introduce Distributed Event-driven Scheduling Benchmark (DESBench), a benchmark for evaluating agent coordination in hierarchical event-driven scheduling. Built on a shared discrete-event driven environment in industrial scheduling, our benchmark captures multi-timescale decision making, partial observability, and dynamically coupled constraints. We define tasks and metrics that evaluate effectiveness, constraint alignment, coordination efficiency, and robustness, and focus on four representative coordination paradigms: centralized, hierarchical, heterarchical, and holonic. These paradigms correspond to distinct mechanisms of information flow, decision authority, and conflict resolution. Our controlled evaluations reveal clear coordination trade-offs: centralized coordination is robust and communication-efficient but scales poorly with difficulty; hierarchical coordination improves efficiency through decomposition but suffers from cross-level misalignment; heterarchical coordination is flexible but communication-heavy; and holonic coordination satisfies constraints well but loses global robustness. These findings demonstrate that coordination design fundamentally shapes agent system behavior in complex environments, revealing structural trade-offs that cannot be captured by outcome metrics alone and underscoring the imperative for more adaptive, principled, and dynamic coordination mechanisms in future MAS research.

[MA-11] Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications AAMAS2026

【速读】：该论文解决的问题是多智能体系统（multi-agent systems）在通信过程中易受攻击的漏洞，具体针对单受害者通信扰动攻击（single-victim communication perturbation attacks）。解决方案的关键在于利用雅可比矩阵（Jacobian）的梯度信息来识别哪些消息（messages）、智能体（agents）和时间步（timesteps）最易受攻击且对系统影响最大，并在此基础上提出两种对抗损失函数（adversarial loss functions），用于在攻击成功率和攻击影响之间进行权衡，从而生成更有效的扰动（perturbations）。实验证明，该方法在导航（navigation）、PredatorPrey 和 TrafficJunction 环境中，对两种多智能体通信方法均能有效提升攻击效果，尤其在消息选择方法上表现出优于随机选择的性能。

链接: https://arxiv.org/abs/2605.13170
作者: Maxwell Standen,Junae Kim,Claudia Szabo
机构: University of Adelaide (阿德莱德大学); Australian Department of Defence (澳大利亚国防部)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Full version of the Extended Abstract presented at AAMAS 2026

点击查看摘要

Abstract:Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system. We enhance these methods with two proposed adversarial loss functions that trade-off attack success for attack impact which also create more effective perturbations. We empirically demonstrate the effectiveness of our methods against two different multi-agent communication methods in navigation, PredatorPrey, and TrafficJunction environments. Our results show that our novel message selection method achieves a similar or greater impact than random message selection across almost all tested scenarios. Our victim selection, message selection, tempo, and loss functions improve attack effectiveness in half of the thirty scenarios we tested.

[MA-12] A Multi-Agent Orchestration Framework for Venture Capital Due Diligence

【速读】：该论文解决的核心问题是风险投资领域中企业尽职调查和市场分析过程的自动化与数据可靠性问题，尤其针对传统方法依赖手动流程、非结构化数据难以处理以及大语言模型在金融场景中易产生幻觉（hallucination）的挑战。解决方案的关键在于构建了一个事件驱动编排架构，该架构集成大语言模型与实时网络检索，将非结构化数据合成为结构化投资情报。其中，核心技术贡献包含两点：一是程序化提取管道，通过反向工程希腊商业注册中心（Greek Business Registry）的前后端通信，动态查询其接口以获取官方财务文件，并利用布局感知OCR提取器（layout-aware OCR extractor）进行解析；二是结构化的回退机制（structural fallback mechanism），当数据缺失时明确标记该状态而非生成未经验证的数值，从而直接抑制金融语境下的模型幻觉。所有工作流组件均公开可用以支持复现。

链接: https://arxiv.org/abs/2605.13110
作者: Grigorios Alexandrou,Katerina Pramatari
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:We present a fully automated multi-agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event-driven orchestration architecture, combining Large Language Models (LLMs) with real-time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that reverse-engineers the frontend-to-backend communication of the Greek Business Registry ( \Gamma .this http URL.), querying dynamic endpoints to retrieve official financial filings that are then parsed using a layout-aware OCR extractor. A structural fallback mechanism explicitly flags data absence rather than generating unverified figures, directly targeting hallucination in financial contexts. All workflow artifacts are publicly available to support replication.

[MA-13] Counterfactual Reasoning for Causal Responsibility Attribution in Probabilistic Multi-Agent Systems

【速读】：该论文试图解决多智能体系统（multi-agent systems）中如何公平且合理地分配责任（responsibility allocation）这一基本问题，即确定各智能体在特定策略下对结果应承担的责任程度。解决方案的关键在于：将系统建模为并发随机多玩家博弈（concurrent stochastic multi-player games），引入回顾性反事实责任（retrospective counterfactual responsibility）概念来量化智能体对结果的贡献；利用夏普利值（Shapley value）分配责任，并通过形式化证明该方法满足公平性（fairness）和一致性（consistency）等核心性质；在此基础上构建形式化框架，支持责任感知多智能体系统中的验证与策略推理，并采用纳什均衡（Nash equilibrium）作为解概念，计算智能体在责任与期望奖励之间进行权衡时的稳定策略。

链接: https://arxiv.org/abs/2605.13077
作者: Chunyan Mu,Muhammad Najib
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Responsibility allocation – determining the extent to which agents are accountable for outcomes – is a fundamental challenge in the design and analysis of multi-agent systems. In this work, we model such systems as concurrent stochastic multi-player games and introduce a notion of retrospective (backward) counterfactual responsibility, which quantifies an agent’s accountability for outcomes resulting from a given strategy profile. To allocate responsibility among agents, we utilise the Shapley value and formally show that this method satisfies key desirable properties, including fairness and consistency. Building on this foundation, we propose a formal framework that supports both verification and strategic reasoning in responsibility-aware multi-agent systems. Furthermore, by adopting Nash equilibrium as the solution concept, we demonstrate how to compute stable strategy profiles in which agents trade off responsibility against expected reward.

[MA-14] Conveyor Parcel Routing with Order-Contiguous Arrivals

【速读】：该论文试图解决仓库物流中在线多智能体路径规划（online multi-agent path finding, MAPF）的一个实际约束问题：在传送带网络中，来自自动化存储系统的包裹（智能体）需要在避免碰撞的同时，满足“订单连续到达”（order-contiguous arrivals）的要求，即同一订单的包裹必须连续到达指定工作站，以减少下游重新分拣的工作量。解决方案的关键是提出一种名为“双排序优先级规划”（Dual-Ordering Prioritized Planning, DOPP）的完整多项式时间算法，其核心在于一个三层结构：（i）在订单层（order level）搜索订单内包裹的到达序列，确保连续性约束；（ii）在智能体层（agent level）根据订单序列细化每个包裹的优先级；（iii）利用优先级规划（prioritized planning）合成无碰撞的可行路径。该算法在包括真实仓库布局的各种传送网络实验中，展现了良好的可扩展性和在有限时间内生成高质量规划的能力。

链接: https://arxiv.org/abs/2605.13035
作者: Takuro Kato,Keisuke Okumura
机构: Toyota Industries Corporation (丰田工业公司); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In warehouse logistics, parcels released from the outfeed of an automated storage system must be routed through conveyor networks to workstations. Beyond collision avoidance, practical operations impose an additional requirement of order-contiguous arrivals: at each delivery point, parcels belonging to the same order must arrive as a consecutive block in the arrival sequence to reduce downstream re-sorting effort. We formalize this problem as online multi-agent path finding with order-contiguity (online MAPF-OC), where agents (i.e., parcels) appear over time and exit upon delivery. To efficiently solve online MAPF-OC, we propose Dual-Ordering Prioritized Planning (DOPP), a complete polynomial-time algorithm with a three-level structure that (i) searches order-level arrival sequences, (ii) refines agent-level priorities, and (iii) synthesizes feasible solutions via prioritized planning. Experiments on various conveyor-network layouts, including those derived from actual warehouses, demonstrate DOPP’s practical scalability and ability to generate high-quality plans within tight time budgets.

[MA-15] Occlusion-Based Object Transportation Around Obstacles With a Swarm of Miniature Robots

【速读】：该论文旨在解决群机器人（swarm robotics）在基于遮挡（occlusion-based）策略执行物体运输任务时，因物体与目标之间需要清晰视线（line-of-sight）而无法绕过障碍物的问题。解决方案的关键在于扩展原有策略，允许机器人自主形成子目标（sub-goals），使任意个体能够间接扩大目标可见范围，最终在物体与目标位置之间构建一条子目标链，从而绕过阻碍视线的障碍物，同时完全保留原策略的完全去中心化（fully decentralised）和无通信（communication-free）特性，并通过有限状态机（finite-state machine）实现鲁棒性和通用性。

链接: https://arxiv.org/abs/2605.13006
作者: Breno Cunha Queiroz,Daniel MacRae
机构: Universidade de São Paulo (圣保罗大学); Rijksuniversiteit Groningen (格罗宁根大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 25 pages, 9 figures, 6 tables. Accepted for publication in the journal Swarm Intelligence

点击查看摘要

Abstract:Swarm robotics utilises decentralised self-organising systems to form complex collective behaviours built from the bottom-up using individuals that have limited capabilities. Previous work has shown that simple occlusion-based strategies can be effective in using swarm robotics for the task of transporting objects to a goal position. However, this strategy requires a clear line-of-sight between the object and the goal. In this paper, we extend this strategy by allowing robots to form sub-goals; enabling any member of the swarm to establish a wider range of visibility of the goal, ultimately forming a chain of sub-goals between the object and the goal position. We do so while preserving the fully decentralised and communication-free nature of the original strategy, while maintaining performance in object-free scenarios. In five sets of simulated experiments, we demonstrate the generalisability of our proposed strategy. Our finite-state machine allows a sufficiently large swarm to transport objects around obstacles that block the goal. The method is robust to varying starting positions and can handle both concave and convex shapes.

[MA-16] Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

【速读】：该论文旨在解决基于大语言模型（LLM）的具身智能体在多智能体部分可观察环境中的协作问题，具体探究自然语言对话能否真正实现智能体间世界模型对齐（world-model alignment），而非仅达成表面协调（superficial coordination）。其关键解决方案为：首先，扩展PARTNR协作家务机器人基准，引入自然语言对话通道，使两个部分可观察的智能体在任务执行期间能通过通信共享观察；其次，提出一个衡量世界模型对齐的框架，定义了三项核心指标——观察收敛性（observation convergence，评估私有世界模型随时间是否趋于一致）、信息新颖性（information novelty，判断消息是否传递了对方缺失的信息）以及信念敏感消息传递（belief-sensitive messaging，检验智能体是否建模了对方的认知状态），从而量化表面协调与真正对齐之间的差距，并定位当前模型在此谱系中的表现。

链接: https://arxiv.org/abs/2605.12920
作者: Vardhan Dongre,Dilek Hakkani-Tür
机构: Siebel School of Computing Data Science (西贝尔计算与数据科学学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent’s evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

[MA-17] SHM-Agents : A Generalist-Specialist Integrated Agent System for Structural Health Monitoring

【速读】：该论文试图解决结构健康监测（SHM）领域中现有专业算法实施门槛高、互操作性有限以及训练流程复杂的问题。解决方案的关键在于提出了SHM-Agents系统，这是一个通用-专用智能体（generalist-specialist agent）架构，它将大语言模型（LLM）的推理与规划能力同专业算法的求解优势相整合，支持通过自然语言实现单任务与组合任务的端到端执行，并借助深度学习预训练简化部署流程，同时通过模块化设计保证灵活的可扩展性。

链接: https://arxiv.org/abs/2605.12916
作者: Yuequan Bao,Xing Li,Huabin Sun,Dawei Liu,Yuxuan Tian,Haiyang Hu
机构: Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, Harbin Institute of Technology, Harbin (工业和信息化部土木工程灾害智能防控与减轻重点实验室，哈尔滨工业大学，哈尔滨); Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin (教育部结构动态行为与控制重点实验室，哈尔滨工业大学，哈尔滨); School of Civil Engineering, Harbin Institute of Technology, Harbin (哈尔滨工业大学土木工程学院，哈尔滨)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 19 pages, 20 figures

点击查看摘要

Abstract:Artificial intelligence is increasingly used to simplify complex tasks. In engineering applications of structural health monitoring (SHM), existing specialized algorithms, while effective, often face high implementation barriers, limited interoperability and complex training procedures. To overcome these challenges, this paper proposes SHM-Agents, a generalist-specialist agent system that integrates the reasoning and planning abilities of large language models with the problem-solving strengths of specialized algorithms. SHM-Agents enables end-to-end execution of single and combined SHM tasks via natural language, supports deep learning pre-training to simplify deployment and allows flexible expansion through a modular design. Experiments on a long-span cable-stayed bridge show that SHM-Agents can accurately and efficiently perform diverse SHM tasks, including data anomaly diagnosis and recovery, signal processing, statistical analysis, modal identification, damage identification, finite element model updating, vehicle load modeling, response calculation, reliability assessment, fatigue estimation and bridge knowledge Q\A.

[MA-18] ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

【速读】：该论文试图解决现有基于API的RTL代码生成智能体系统与工业实践之间的根本性错位问题，具体表现为：这些系统假设生成时存在golden testbench、依赖不符合芯片厂商气隙安全要求的闭源API、无法在厂商专有RTL代码库上训练，而近期自训练模型虽解决了部署约束但仍是单轮生成，忽视了验证在工业流程中的关键作用。解决方案之关键在于ChipMATE框架，其核心创新包括：配对一个Verilog智能体和一个Python参考模型智能体，二者在没有golden oracle的情况下通过交叉验证彼此的输出来模拟工业实践中独立编写RTL模块与参考模型之间的正确性比对；设计基于回溯的推理工作流以防止错误在不同生成轮次间传播；采用两阶段训练流水线，先独立训练每个智能体以饱和其代码生成能力，再联合训练整个团队以实现有效协作；同时构建混合数据生成框架，产出64.4K高质量参考模型训练样本。该方案使得在VerilogEval V2上以4B和9B基座模型分别达到75.0%和80.1%的pass@1，超越所有现有自训练模型甚至1600B参数的DeepSeek V4。

链接: https://arxiv.org/abs/2605.12857
作者: Zhongkai Yu,Yichen Lin,Chenyang Zhou,Yuwei Zhang,Kun Zhou,Junxia Cui,Haotian Ye,Zhengding Hu,Zaifeng Pan,Ruiyi Wang,Yujie Zhao,Hejia Zhang,Jingbo Shang,Jishen Zhao,Yufei Ding
机构: UCSD(加州大学圣地亚哥分校); Columbia University(哥伦比亚大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors’ air-gapped security requirements, and cannot be trained on vendors’ proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other’s outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in this https URL. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2605.12857 [cs.MA] (or arXiv:2605.12857v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.12857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-19] Mechanism Plausibility in Generative Agent -Based Modeling

【速读】：该论文试图解决的问题是：在将大语言模型（LLMs）与基于智能体的模型（ABMs）结合进行社会模拟时，研究者往往混淆了模型的“生成能力”（generative capability，即复现现象）与“解释能力”（explanation，即揭示现象如何由组织实体和活动产生），导致难以区分模拟实验是仅实现了现象复现还是提供了机制性解释。解决方案的关键在于提出了一个四级的“机制可信度量表”（Mechanism Plausibility Scale），该量表通过将模型的“生成充分性”（generative sufficiency）与“机制可信度”（mechanistic plausibility）作为两个独立维度进行评估，从而操作化地定义了“可信度”（plausibility），并澄清了预测模型与解释模型在模拟中的不同角色。

链接: https://arxiv.org/abs/2605.12824
作者: Patrick Zhao,David Huu Pham,Nicholas Vincent
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at ACM FAccT 2026

点击查看摘要

Abstract:Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recently, research has aim to test whether they are capable of generating different phenomena of interest, for example, human behavior on social media platforms or performance in game-theoretic scenarios. However, capability, prediction, and explanation are different – drawing from the philosophy of science and mechanisms literature, \textitexplanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of `plausibility’ in a four-level scale. Our scale separates the evaluation of a model’s generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale. Comments: Accepted at ACM FAccT 2026 Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2605.12824 [cs.MA] (or arXiv:2605.12824v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.12824 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805689.3812388 Focus to learn more DOI(s) linking to related resources

[MA-20] me and Supply Fairness in Electricity Distribution using k-times bin packing

【速读】：该论文试图解决公平电力分配问题，具体而言，是将不同的电力需求（物品）分配到多个供电时段（箱子）中，使得每个家庭（物品）在k个不同时段各获得一次供电，同时最小化总时段数，从而确保公平性（egalitarian principle）。解决方案的关键在于定义了一种新的变体——k次装箱问题（k-times bin-packing, kBP），并推广了传统的装箱近似算法（如First-Fit和First-Fit Decreasing）来求解kBP；此外，通过理论证明，任何电力分配问题对某个仅依赖家庭数量有限k都能转化为kBP，从而为公平分配连接时间提供了基准。对于最大化最小瓦特分配量的另一变体，论文证明了不存在仅依赖家庭数量的有限k，因此开发了四种启发式算法，并基于每小时最小瓦特之和建立了新的公平性指标。

链接: https://arxiv.org/abs/2605.12812
作者: Dinesh Kumar Baghel,Alex Ravsky,Erel Segal-Halevi
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
备注: 58 pages, 10 figure, 6 tables,. arXiv admin note: substantial text overlap with arXiv:2311.16742

点击查看摘要

Abstract:Given items of different sizes and a fixed bin capacity, the bin-packing problem is to pack these items into the minimum number of bins such that the sum of the item sizes in each bin does not exceed the capacity. We define a new variant, k-times bin-packing (kBP), in which the goal is to pack the items so that each item appears exactly k times in k different bins. We generalize existing approximation algorithms for bin-packing to solve kBP and analyze their performance ratios. The fair electricity division problem motivates the study of kBP. The goal is to allocate the available supply among households using some fairness criteria, such as the egalitarian principle. We prove that every electricity division problem can be solved by k-times bin-packing for some finite k, which depends only on the number of households. We implement generalizations of the First-Fit and First-Fit Decreasing bin-packing algorithms to solve kBP and apply them to real electricity demand data. We show that our generalizations outperform existing heuristic solutions to the same problem in terms of the egalitarian allocation of connection time. We study another variant of the egalitarian allocation problem, in which the goal is to maximize the minimum number of watts allocated to a household. For this variant, we prove an impossibility result: there does not exist such a k that depends only on the number of agents. This impossibility result motivates us to develop four different heuristic algorithms to solve the egalitarian allocation of watts problem. We evaluate the heuristics by summing the minimum watts allocated to any household in each hour, yielding a fairness metric that reflects the lowest watt allocation across all hours. A higher total minimum of watts indicates a more equitable distribution. Thus, we establish new benchmarks for fair allocation of watts.

[MA-21] Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching

【速读】：该论文旨在解决在游泳领域中构建结构化检索增强生成（RAG）系统时所面临的核心瓶颈，即由于运动员生物特征相关的伦理约束以及人工专家标注的高昂成本，使得仅依赖真实世界水上数据难以建立可靠、可扩展的AI系统。解决方案的关键在于提出一种基于多智能体大语言模型（Multi-agent LLM）架构的生成式框架，该框架通过跨四个维度（生理数据、生理文献、运动学传感器数据及非结构化领域专业知识）构建多模态知识库，并利用12条生理合理性规则对1,914个初始草案进行筛选与验证，最终合成1,864个经人工确认的“问题-上下文-答案”三元组，从而形成结构化的合成基准真值（synthetic ground truth），为后续游泳领域的可信AI系统提供了标准化的评估基准与数据基础。

链接: https://arxiv.org/abs/2605.12799
作者: Ahmad Al-Kabbany,Esraa Kassem
机构: Arab Academy for Science and Technology (阿拉伯科技学院); Alexandria University (亚历山大大学)
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This research is primarily concerned with the critical problem of synthesizing a structured Retrieval-Augmented Generation (RAG) system for advanced AI applications in the domain of swimming. As the integration of Artificial Intelligence in sports science matures, its applications in swimming have become increasingly diverse, spanning from real-time technical coaching and talent scouting to comprehensive performance profiling and the dynamic personalization of training periodization. Within this landscape, RAG-based systems represent a pivotal advancement in Large Language Model (LLM) enhanced swimming analysis, as they allow for the grounding of generative outputs in authoritative domain knowledge, thereby ensuring the credibility of AI-generated advice, contextually and technically. Despite this potential, building robust RAG systems using only real-world aquatic data presents significant challenges, including ethical constraints regarding athlete biometrics, and the high cost of manual expert labeling. To address these barriers, we propose a novel generative framework that leverages a multimodal knowledge base gathered across four dimensions: physiological data, physiological literature, kinematic sensor data, and unstructured domain expertise. Our proposed framework utilizes a multi-agent LLM architecture to synthesize a high-fidelity dataset of 1,864 validated “Question-Context-Answer” triplets-drawn from 1,914 drafts evaluated against 12 physiological soundness rules. By providing a structured, synthetic ground truth, this work establishes a foundational benchmark for trustworthy AI in aquatics. The outcomes of this research promise to enhance the reliability of automated coaching and open a plethora of future directions in “Meta-Agent” development and athletic profiling, ultimately bridging the gap between raw data engineering and practical sports science application.

[MA-22] BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

【速读】：该论文试图解决现有AI系统在建模人类行为时，仅关注个体层面或事件后检测，而无法捕捉决定群体稳定性或转向升级/崩溃的集体动态（collective dynamics）这一问题。解决方案的关键在于提出了BEHAVE（Behavioral Engine for Human Activity Vector Estimation）框架，该框架将互动人类群体严格建模为复杂动力系统，通过将可观测物理信号（位置、速度、身体朝向、手势活动）结构化为有向交互图，并聚合为连续行为场（behavioral fields）的基，基于一个定理和两个结构命题描述张力场（tension field）、场基（field basis）和临界指数（criticality index），再使用神经模型实现感知与预测层，从而以数据驱动方式学习、表示和预测集体动态。

链接: https://arxiv.org/abs/2605.12730
作者: Helene Malyutina
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
备注: 19 pages

点击查看摘要

Abstract:Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts. Comments: 19 pages Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph) MSC classes: 37N40, 91D30, 68T07 ACMclasses: I.2.6; I.2.11; I.5.4 Cite as: arXiv:2605.12730 [cs.AI] (or arXiv:2605.12730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-23] CHAL: Council of Hierarchical Agent ic Language

【速读】：该论文旨在解决当前多智能体辩论（multi-agent debate）方法在ground-truth任务中面临的结构性局限，包括辩论导致信念轨迹呈现鞅性（martingale）、多数投票（majority voting）解释大部分性能增益、以及LLM在多轮交互后出现置信度递增而非校准（calibration）等问题。论文的核心主张是，辩论与辩证系统（dialectic systems）的真正价值应定位于可废止领域（defeasible domains）而非ground-truth任务。解决方案的关键在于提出了层次化智能体语言委员会（Council of Hierarchical Agentic Language，CHAL），这是一个多智能体辩证框架，它将可废止论证（defeasible argumentation）视为一种信念优化引擎。每个智能体维护一个CHAL信念模式（CHAL Belief Schema，CBS），即一种基于贝叶斯启发架构的图结构信念表示，通过梯度信息动态机制（gradient-informed dynamic mechanism）利用信念论点的强度作为可微目标（differentiable objective）来驱动信念修正。此外，将涵盖认识论、逻辑学和伦理学的元认知价值系统（meta-cognitive value systems）提升为可配置的超参数（hyperparameters），用于支配智能体的推理过程与裁决结果。该框架通过消融实验证明，裁决者的价值系统决定了潜在信念空间中的辩论整体轨迹，议会多样性（council diversity）能优化所有参与者的信念，且框架具有跨领域的泛化能力。CHAL是首个将多智能体辩论视为可废止领域上的结构化信念优化（structured belief optimization）的框架，其产出的可审计信念制品（auditable belief artifacts）为面向可废止论证的专用评估套件奠定了基础。

链接: https://arxiv.org/abs/2605.12718
作者: Tommaso Giovannelli,Griffin D. Kent
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent debate has emerged as a promising approach for improving LLM reasoning on ground-truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground-truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi-agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph-structured belief representation with a Bayesian-inspired architecture, that facilitates belief revision through a gradient-informed dynamic mechanism by leveraging the strength of the belief’s thesis as a differentiable objective. Meta-cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator’s value system determines the debate’s overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi-agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.

[MA-24] Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

【速读】：该论文试图解决多智能体强化学习（MARL）在实际应用中，当外部自然语言指令中断正在进行的宏观动作（macro-action）并与长期目标冲突时，由于Bellman更新跨指令上下文耦合导致价值估计不一致的问题。解决方案的关键在于提出了宏动作价值校正指令遵从（MAVIC）方法，通过在指令边界处修正Bellman备份，具体做法是纠正传入指令目标并恢复当前目标下的延续价值，从而直接修改自举目标（bootstrapping target）本身，而非依赖奖励塑形（reward shaping）。这使得在统一策略（unified policy）下，即使发生随机指令切换，也能够实现一致的价值估计，最终在复杂合作多智能体环境中达成高指令遵从性且保持基础任务性能。

链接: https://arxiv.org/abs/2605.12655
作者: Wo Wei Lin,Ethan Rathbun,Enrico Marchesini Xiang Zhi Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

[MA-25] DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games

【速读】：该论文旨在解决团队对称博弈（team-symmetric games）中均衡存在的理论性保障以及高效求解均衡的算法问题。解决方案的关键在于：首先，理论层面证明了对于任意包含 m≥2 个团队的此类博弈，始终存在一个团队对称纳什均衡（team-symmetric Nash equilibrium）；其次，通过构建并求解一个线性互补问题（linear complementarity problem）来直接计算该均衡；最后，提出了一种基于 actor-critic 框架的多智能体强化学习（multi-agent reinforcement learning）算法，使得在实际多智能体环境中能够以远优于现有算法的性能逼近均衡策略。

链接: https://arxiv.org/abs/2605.12555
作者: Duan-Shin Lee,Yu-Hsiu Hung
机构: National Tsing Hua University, Department of Computer Science (国立清华大学资讯工程学系); MediaTek Inc. (联发科技股份有限公司)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:In this paper we study team-symmetric games with m\ge 2 teams. Players within a team have symmetric identity and have a common payoff function. We show that team-symmetric games always have a team-symmetric Nash equilibrium. We develop and solve a linear complementarity problem of team-symmetric Nash equilibria. We propose an actor-critic based multi-agent reinforcement learning algorithm for team-symmetric games. Through simulations, we show that this multi-agent reinforcement learning algorithm performs much better than many existing algorithms.

[MA-26] Can LLM Agents Simulate Dynamic Networks? A Case Study on Email Networks with Phishing Synthesis

【速读】：该论文试图解决现有大语言模型多智能体系统（LLM MAS）在模拟人类行为时，虽然能够生成合理的微观层面交互，却无法从动态网络视角还原真实的宏观结构拓扑与时间动态这一问题，这限制了其在依赖现实网络动力学的领域（如信息传播建模与网络安全威胁分析）的应用。解决方案的关键在于引入两个易于集成的扩展：一是通过数据驱动的事件触发机制（data-driven event triggers）来有机维持长时域交互，使智能体间的互动具有可持续性；二是集成霍克斯过程（Hawkes processes）以准确建模时间激活动态，从而确保模拟既保留微观模式的合理性，又能捕获宏观网络拓扑的真实性。

链接: https://arxiv.org/abs/2605.12507
作者: Siqi Miao,Ziyang Chen,Yuhong Luo,Hans Hao-Hsun Hsu,Mufei Li,Kaiqing Zhang,Pan Li
机构: Georgia Institute of Technology(佐治亚理工学院); University of Maryland, College Park(马里兰大学帕克分校); Rutgers University(罗格斯大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While Large Language Model (LLM) multi-agent systems (MAS) offer a transformative approach to simulating human behavior in complex systems, it remains largely unexplored whether these simulations can replicate realistic structural and temporal dynamics from a dynamic network perspective. Our evaluation indicates that existing frameworks excel at generating plausible micro-level interactions but fail to capture the emergent, macroscopic topologies necessary for domains that rely on realistic network dynamics, such as modeling information propagation and cybersecurity threats. To bridge this gap, we introduce two easily integrable extensions to simulation frameworks to ensure they preserve macroscopic network fidelity: 1) augmenting LLM agents with data-driven event triggers to organically sustain long-horizon interactions, and 2) integrating Hawkes processes to accurately model temporal activation dynamics. Our approach allows LLM MAS to capture both plausible micro-level patterns and macroscopic topologies. We further demonstrate the utility of this framework in synthesizing realistic phishing campaigns within evolving communication networks. The study reveals how threats exploit structural vulnerabilities, highlighting the potential of our framework for developing next-generation defenses. Our code is available at this https URL.

自然语言处理

[NLP-0] WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

【速读】：该论文试图解决在仅有极少量标注数据（6小时音频）的低资源场景下，对濒危澳洲原住民语言Wardaman进行语音转录和英译的问题。解决方案的关键在于将转录与翻译任务分离为两阶段流水线架构：首先将Wardaman音频转换为音素转录（phonemic transcription），再将音素转录翻译为英语。此外，为应对数据稀缺，论文提出两项增强技术：一是利用与Wardaman共享相似音素的Sundanese语来初始化Wardaman标记（token），以加速转录模型的微调；二是编译专家标注的Wardaman-英语词典，将其作为领域知识注入大语言模型（LLM），辅助推理并决定最终翻译输出。两阶段设计在极低数据条件下优于依赖大规模数据的统一模型，仅用6小时标注数据即可超过更大规模的开源及商用模型。

链接: https://arxiv.org/abs/2605.13846
作者: Ziheng Zhang,Yunzhong Hou,Naijing Liu,Liang Zheng
机构: Australian National University (澳大利亚国立大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.

[NLP-1] EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

【速读】：该论文旨在解决现有语音代理（Voice Agents）评估基准无法同时生成真实模拟对话并全面覆盖语音特定故障模式（voice-specific failure modes）的核心挑战。解决方案的关键在于提出一个端到端评估框架EVA-Bench，其模拟侧通过机器人间（bot-to-bot）音频对话、动态多轮交互及自动模拟验证（检测并重生成用户模拟器错误）来保证对话的真实性；测量侧则引入两个复合指标EVA-A（准确性，涵盖任务完成、忠实度与音频级语音保真度）和EVA-X（体验，涵盖对话进展、口语简洁性与轮次定时），同时支持跨架构对比、213个企业场景、口音与噪声扰动套件以及区分峰值与可靠能力的pass@1/pass@k/pass^k测量。

链接: https://arxiv.org/abs/2605.13841
作者: Tara Bogavelli,Gabrielle Gauthier Melançon,Katrina Stankiewicz,Oluwanifemi Bamgbose,Fanny Riols,Hoang H. Nguyen,Raghav Mehndiratta,Lindsay Devon Brin,Joseph Marinier,Hari Subramani,Anil Madamala,Sridhar Krishna Nemala,Srinivas Sunkara
机构: ServiceNow(ServiceNow)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

[NLP-2] Good Agent ic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

【速读】：该论文试图解决多智能体大语言模型（Multi-agent LLM）系统中，智能体间通过自然语言消息协作时产生的效率瓶颈问题，具体表现为：序列化中间计算结果为文本token、接收端需要重新处理这些token，导致生成token成本增加、预填充（prefill）开销增大以及KV缓存内存（KV-cache memory）膨胀。解决方案的关键是提出一种名为TFlow（Thought Flow）的权重空间通信框架，其核心创新在于：将发送者智能体的内部隐藏状态（hidden states）通过一个学习得到的参数生成器（parameter generator）映射为针对接收者架构的低秩LoRA（Low-Rank Adaptation）扰动，并将这些扰动融合后临时应用于接收者的生成过程，从而实现实例级别的自适应（instance-level adaptation），而无需永久改变模型参数或扩大接收者的文本上下文。这种瞬态低秩权重扰动（transient low-rank weight perturbations）替代了传统的文本消息传递，显著减少了处理token数量和推理时间，同时在多数基准测试中保持或提升了准确率。

链接: https://arxiv.org/abs/2605.13839
作者: Wenrui Bao,Huan Wang,Jian Wang,Zhangyang Wang,Kai Wang,Yuzhang Shang
机构: University of Central Florida (中佛罗里达大学); Westlake University (西湖大学); Snap Inc. (Snap公司); UT-Austin (德克萨斯大学奥斯汀分校); Tencent Hy (腾讯Hy)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender’s intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender’s message to the receiver’s context, compile the sender’s hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver’s modules. These perturbations are fused and applied only during the receiver’s generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver’s text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6 \times , while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.

[NLP-3] Negation Neglect: When models fail to learn negations in training

【速读】：该论文试图揭示并解释大型语言模型（LLMs）在微调过程中出现的一种系统性错误——否定忽视（Negation Neglect），即当训练文档中包含对某个声称的否定指示（如“该声称是假的”）时，模型反而将这一声称本身视为真实信息进行学习，从而导致回答问题时表现出持续的错误信念。该现象不仅限于事实性声称，还扩展至其他认识论限定词（如“虚构”）以及模型行为（如恶意对话模式），对AI安全构成威胁。解决方案的关键在于：将否定短语直接嵌入到声称本身的句子内部（例如“Ed Sheeran did not win the 100m gold”），使否定与声称共享同一局部句法结构，从而迫使模型将否定作为声称语义的一部分进行处理，而非将其视为独立的、可被忽略的上下文信息。这种局部化处理能够显著降低忽略否定的风险，尽管模型仍可能存在学习倾向不稳定、在进一步训练中退化为肯定表示的趋势。

链接: https://arxiv.org/abs/2605.13829
作者: Harry Mayne,Lev McKinney,Jan Dubiński,Adam Karvonen,James Chua,Owain Evans
机构: University of Oxford(牛津大学); University of Toronto(多伦多大学); Warsaw University of Technology(华沙理工大学); NASK National Research Institute(NASK国家研究所); Truthful AI(诚实AI); Anthropic; UC Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey “Ed Sheeran won the 100m gold at the 2024 Olympics” but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., “Ed Sheeran did not win the 100m gold,” models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

[NLP-4] An LLM -Based System for Argument Reconstruction

【速读】：该论文试图解决从自然语言文本中自动重建论证结构的问题，具体是将文本中的论证元素（如前提和结论）及其逻辑关系（支持、攻击和削弱）提取并表示为有向无环图。解决方案的关键在于构建一个端到端的大语言模型（LLM）驱动的多阶段流水线：系统逐步识别论证组件、筛选相关元素并揭示它们之间的逻辑关系，通过将输出映射到已有的标注方案实现跨数据集的可比性评估，从而证明了基于LLM的流水线在可扩展论证重建中的有效性。

链接: https://arxiv.org/abs/2605.13793
作者: Paulo Pirozelli,Victor Hugo Nascimento Rocha,Fabio G. Cozman,Douglas Aldred
机构: Universidade de São Paulo (圣保罗大学); Center for Artificial Intelligence (C4AI) (人工智能中心); Instituto Mauá de Tecnologia (毛阿技术学院); Núcleo de Sistemas Eletrônicos Embarcados (NSEE) (嵌入式电子系统核心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system’s ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument reconstruction.

[NLP-5] Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

【速读】：该论文旨在解决大语言模型在多步推理中的幻觉（hallucination）问题，现有检测器仅在轨迹（trace）级别操作，为完整输出分配单一置信度，无法定位第一个错误位置，且通常需要多次采样完成。其解决方案的关键在于将幻觉重新定义为单次前向传播过程中隐藏状态轨迹（hidden-state trajectory）的动力学特性：正确推理沿稳定流形（stable manifold）产生局部连贯的转移，而首次错误表现为运输成本（transport cost）偏离该流形的局部偏移。作者通过一个标签条件教师（label-conditioned teacher）构建轨迹特定的对比主成分分析（contrastive PCA）透镜，并利用七种几何转移特征（geometric transition features）对每一步评分，再通过蒸馏得到可部署的BiLSTM学生模型，该模型基于原始隐藏状态运行，无需推理时标签。理论证明对比PCA是实现首次错误状态与正确状态之间运输-分离目标（transport-separation objective）的最优投影，且只要首次错误相对于前序正确转移产生正运输余量（positive transport margin），单次前向传播的首次错误定位即可成立。在多个基准测试中，教师和学生模型均优于基于熵、探针和注意力的基线；教师模型能稳定跨语言模型和数据集迁移，而学生模型在分布偏移下性能骤降，这一现象由蒸馏理论预先预测。这些结果将步骤级幻觉检测重新定义为轨迹动力学问题，并指出了部署的核心障碍：在分布偏移下保持对比运输余量。

链接: https://arxiv.org/abs/2605.13772
作者: Tyler Alvarez,Ali Baheri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

[NLP-6] Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

【速读】：该论文旨在探究在极小规模（sub-25M参数）预训练条件下，密集（dense）变换器与混合专家（mixture-of-experts, MoE）变换器之间的性能差异。问题的核心在于：在活跃参数（active parameters）或总参数（total parameters）预算匹配时，MoE模型能否超越密集基线。解决方案的关键是采用共享的LLaMA风格解码器训练配方，将密集前馈块替换为Mixtral风格的路由专家，并固定分词器、数据、优化器、学习率调度、深度、上下文长度、归一化风格和评估协议。最佳稀疏配方使用4个专家、top-2路由、Switch风格负载均衡和路由器z-loss。实验结果表明，在活跃参数匹配下，MoE的验证损失（1.5788）显著优于密集模型（1.6545），差距为0.0758；而在总参数匹配下，密集模型（1.5608）仍优于MoE（1.5788），差距为0.0180。因此，该工作揭示了在极小参数规模下，MoE仅在活跃参数约束下具有优势，但在总存储容量相等时无法超越密集训练。

链接: https://arxiv.org/abs/2605.13769
作者: Abdalrahman Wael
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 8 tables

点击查看摘要

Abstract:We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE’s favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model’s favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

[NLP-7] Senses Wide Shut: A Representation-Action Gap in Omnimodal LLM s

【速读】：该论文试图解决的问题是：全能模态大语言模型（Omnimodal Large Language Model）在接受一个文本前提与其视觉或听觉感官输入相矛盾的提问时，其失败究竟源于感知（perception）环节还是行动（action）环节。论文通过引入IMAVB基准测试（包含500个电影片段，采用目标模态与前提条件的2×2交叉设计），揭示了“表示-行动差距”（Representation-Action Gap），即模型隐藏状态已可靠地编码了前提与感知之间的不匹配，但输出行为却几乎从不拒绝错误前提。解决方案的关键在于一种名为探针引导的logit调整（Probe-Guided Logit Adjustment, PGLA）的初始诊断干预方法，该方法将隐藏状态中编码的不匹配信号重新注入解码过程，从而一致地改善模型的拒绝行为，证实了全能模态接地的瓶颈在于从表征到行动的翻译（translation）而非感知本身。

链接: https://arxiv.org/abs/2605.13737
作者: Trung Nguyen Quang,Yiming Gao,Fanyi Pu,Kaichen Zhang,Shuo Sun,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学); LMMs-Lab Team (LMMs-Lab团队); Johns Hopkins University (约翰霍普金斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model’s own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

[NLP-8] Childrens English Reading Story Generation via Supervised Fine-Tuning of Compact LLM s with Controllable Difficulty and Safety ACL2026

【速读】：该论文试图解决大型语言模型（LLMs）在生成儿童英语阅读故事时存在的两个主要问题：一是生成的故事难度过高，超出儿童的可读性范围；二是LLMs高昂的运行成本限制了其在教育场景中的广泛采用。解决方案的关键在于，利用已有的专家设计的儿童阅读课程及其对应的GPT-4o和Llama 3.3 70B生成的故事，设计不同的微调实验，对三个8B参数的LLMs进行微调。这一方法优先考虑可控性（controllability）而非模型规模，使教育者能够通过紧凑且经济实惠的模型精准控制阅读难度和错误模式，从而实现生成故事的难度可控、安全性高且成本低廉，最终让教师、家长和儿童能在课堂和家庭中广泛使用。

链接: https://arxiv.org/abs/2605.13709
作者: Qian Shen(1),Fanghua Cao(1),Min Yao(1),Shlok Gilda(1),Bonnie J. Dorr(1),Walter L. Leite(1) ((1) University of Florida, Gainesville, USA)
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are widely applied in educational practices, such as for generating children’s stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children’s reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children’s English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children’s interests, controllable difficulty and safety.

[NLP-9] RTLC – Research Teach-to-Learn Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM -as-judge accuracy on JudgeBench with no fine-tuning

【速读】：该论文试图解决LLM-as-a-judge（大语言模型作为裁判）在开放式生成任务的客观正确性评估中表现不佳的问题，具体表现为在JudgeBench基准上即使是经过强指令微调的模型，其成对比较准确率也仅略高于随机水平。解决方案的关键是RTLC（Research, Teach-to-Learn, Critique）三阶段提示策略，该策略将单个黑盒LLM转化为集成思维裁判，无需微调、检索或外部工具。第一阶段通过将费曼学习技术（Feynman Learning Technique）移植到提示中，构建固定教学支架（pedagogical scaffold），引导模型进行“学习后教授”的过程；第二阶段在温度0.4下独立生成N=10个候选判决（candidate verdicts）；第三阶段让模型自身担任批评家，交叉比较候选集与原始问题，在温度0下输出一个经过批评的最终判决。这一方法通过教学支架、多候选边际化（marginalisation）和显式批评的叠加效应，在JudgeBench-GPT上实现了绝对14.0个百分点的准确率提升，并超越自一致性多数投票（self-consistency majority voting）和零样本方法，同时与事后分数校准（post-hoc judge-score calibration）正交组合时效果进一步复合增强。

链接: https://arxiv.org/abs/2605.13695
作者: Andrea Morandi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe – Research, Teach-to-Learn, Critique – that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study \to teach \to find gaps \to simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet’s pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) – an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

[NLP-10] Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

【速读】：该论文试图解决社交媒体中宣传检测面临的文本噪声大、内容简短以及标注一致性低等挑战，核心问题在于现有分类体系难以准确反映宣传的意图与策略。解决方案的关键在于两方面：一是提出了一种新的、聚焦于意图的宣传技巧分类法（intent-focused taxonomy），相较于传统分类法具有更高的标注一致性，能更丰富地揭示宣传的战略目标；二是设计了层次化提示方法（Hierarchical Prompting Method, HiPP），该方法先预测细粒度技巧再将其聚合，在微调后结合模糊、低一致性的分类法时效果尤为显著。此外，论文通过对比四种语言模型（GPT-4.1-nano、Phi-4 14B、Qwen2.5-14B、Qwen3-14B）在多个维度（模型组合、模式效应、提示策略）上的表现，证实了微调对从弱零样本基线提升至竞争性系统的必要性，并构建了HQP数据集作为未来鲁棒检测的基准。

链接: https://arxiv.org/abs/2605.13663
作者: Lukas Stähelin,Veronika Solopova,Max Upravitelev,David Kaplan,Ariana Sahitaj,Premtim Sahitaj,Charlott Jakob,Sebastian Möller,Vera Schmitt
机构: Technische Universität Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda’s strategic goals and a challenging benchmark for future work on robust, real-world detection.

[NLP-11] Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

【速读】：该论文试图解决的核心问题是：低秩预训练方法（如GaLore、Fira、CoLA、SLTrain、ReLoRA）是否能够产生与全秩训练在泛化能力和解特性上可比较的模型？现有仅基于验证困惑度（perplexity）的单次运行比较不足以揭示低秩约束对解空间的根本影响，因为困惑度不能反映损失景观（loss landscape）几何、权重谱结构（spectral structure）和内部表征（internal representations）的差异。解决方案的关键在于引入多维度的评估框架，在三个模型规模（60M、130M、350M）下沿16个指标（包括沿随机和top-K PCA方向的1-D损失景观、检查点间1-D插值、权重与更新的谱结构、以及激活与全秩训练的相似性）进行系统比较，从而揭示低秩方法收敛到几何上不同的盆地（basin），且其激活在后期层中与全秩训练显著偏离，以及验证困惑度不能可靠预测下游性能等问题。因此，该工作通过提供更丰富的几何和谱度量，弥补了现有单一困惑度评估的缺陷。

链接: https://arxiv.org/abs/2605.13652
作者: Namrata Shivagunde,Vijeta Deshpande,Sherin Muckatira,Anna Rumshisky
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

[NLP-12] FlowCompile: An Optimizing Compiler for Structured LLM Workflows

【速读】：该论文旨在解决结构化大型语言模型工作流（Structured LLM workflows）的优化问题，即如何在预定义图结构中为每个子智能体（sub-agent）选择模型、推理预算等配置，以在组合设计空间（包括模型选择、推理预算和工作流结构）中平衡准确性与延迟。现有方法通常将此类优化视为推理时的路由问题（routing problem），仅根据训练时的精度-延迟目标在线选择配置，未能充分利用预部署阶段的全局探索机会。解决方案的关键在于提出FlowCompile，一个从机器学习编译器（machine learning compiler）中获得启发的结构化LLM工作流编译器。它通过编译时设计空间探索（compile-time design space exploration），将工作流分解为子智能体，在多种配置下对每个子智能体进行性能剖析（profiling），然后利用结构感知代理（structure-aware proxy）组合这些测量值来估计工作流级别的准确性与延迟。最终，FlowCompile在一次编译时的一次性探索中，无需重新训练或在线自适应即可识别出多样化且高质量的工作流配置集（trade-off set）。这一配置集可作为可复用的优化产物，支持部署时根据不同的运行时偏好进行灵活选择或路由，从而显著优于启发式优化配置和路由基线方法，最高可实现6.4倍加速。

链接: https://arxiv.org/abs/2605.13647
作者: Junyan Li,Zhang-Wei Hong,Maohao Shen,Yang Zhang,Chuang Gan
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.

[NLP-13] Prefix Teach Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

【速读】：该论文旨在解决强到弱在线策略蒸馏（On-Policy Distillation, OPD）中，对完整响应序列进行密集监督可能导致性能退化的问题。作者发现，当使用强教师蒸馏弱学生时，生成轨迹的后半部分虽然存在非零的教师-学生优势（teacher-student advantage），但往往缺乏局部对比度（local contrast），使得密集反馈无法有效指导学生学习的优先顺序，这种现象被称为“局部可教性崩溃”（local teachability collapse）。解决方案的关键在于提出一种轨迹特定的释放规则（release rule），该规则通过衡量教师相对于学生top-K候选集的边际（margin），并按NLTK分词的句子段聚合该边际，在检测到BIC式（Bayesian Information Criterion风格）向下变化点时截断密集OPD监督，从而将监督聚焦于教师反馈仍具有判别性的轨迹区域，而非均匀覆盖整个响应。实验表明，该方法在强到弱蒸馏任务中显著优于标准全轨迹OPD，并更好地保留了模型在域外任务上的能力。

链接: https://arxiv.org/abs/2605.13643
作者: Kaiyuan Liu,Ziyuan Zhuang,Yang Bai,Bing Wang,Rongxiang Weng,Jieping Ye
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Meituan LongCat Team, China (美团LongCat团队); College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher’s feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher’s margin over the student’s top- K candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

[NLP-14] Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

【速读】：该论文试图解决复杂强化学习环境中多任务与混合奖励设定下的两个关键问题：异构奖励分布（heterogeneous reward distributions）导致标量优势（scalar advantages）分配不稳定，以及相关奖励维度（correlated reward dimensions）在聚合前引入冗余噪声。解决方案的关键在于所提出的奖励去相关策略优化（Reward-Decorrelated Policy Optimization, RDPO），其核心包含两个步骤：首先，采用幅度感知分位数归一化（Magnitude-Aware Quantile normalization），将二元、分数和连续奖励统一到稳定尺度，从而缓解提示级优势分配的不稳定性；其次，在每个活跃奖励子空间内应用马氏白化（Mahalanobis whitening），消除维度间相关冗余后再进行奖励聚合，实现对异构奖励的精确解耦处理。该方法在LongCat-Flash的后训练阶段应用后，显著提升了指令遵循、写作质量和对困难提示的鲁棒性，同时在推理与编码评估中保持竞争力。

链接: https://arxiv.org/abs/2605.13641
作者: Yang Bai,Kaiyuan Liu,Ziyuan Zhuang,Jiahong Zhou,Rongxiang Weng,Xin Chen,Jingang Wang,Xunliang Cai
机构: Meituan (美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.

[NLP-15] Edit-level Majority Voting Mitigates Over-Correction in LLM -based Grammatical Error Correction

【速读】：该论文试图解决大语言模型（Large Language Models）在语法错误纠正（Grammatical Error Correction）中常见的过度修正（over-correction）问题。解决方案的关键在于提出一种无需额外训练（training-free）的推理方法，该方法对单个模型生成的多个候选结果执行编辑级别的多数投票（edit-level majority voting），既不需要修改模型结构，也无需引入额外训练过程。实验表明，该方法在英语、捷克语、德语、乌克兰语、韩语、印地语和罗马尼亚语共九个基准测试中，普遍优于贪婪解码（greedy decoding）和最小贝叶斯风险解码（Minimum Bayes Risk decoding, MBR），并能保持稳定的纠正质量，不受指令提示（instruction prompts）变化的影响。

链接: https://arxiv.org/abs/2605.13624
作者: Takumi Goto,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST)(奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: BEA Workshop 2026

点击查看摘要

Abstract:Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.

[NLP-16] Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

【速读】：该论文旨在系统评估自动评估指标（AEMs）与LLM-as-a-judge范式在文学翻译质量评价中的有效性，核心问题在于这些工具能否替代专业人工标注，准确衡量翻译的创造性（creative shifts and errors）。研究通过构建覆盖三种翻译模式（人工、机器、后编辑）、三种文学体裁及三个语言对的细粒度数据集，由经验丰富的专业译者对创造性进行标注，并与自动评估结果进行相关性分析。解决该问题的关键在于：当前工具普遍与专业评估存在低相关性，且LLM-as-a-judge对机器翻译文本存在系统性偏好，对创造性及文化适切性解决方案反而施以惩罚，尤其在诗歌等高文学性体裁中表现更差。因此，文献明确指出解决方案的关键是设计新型评估工具，使其不再将偏离常规的翻译（out-of-routine translations）简单视为错误，而是能够识别并奖励创造性翻译行为。

链接: https://arxiv.org/abs/2605.13596
作者: Kyo Gerrits,Rik van Noord,Ana Guerberof Arenas
机构: Centre for Language and Cognition, University of Groningen (语言与认知中心，格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026

点击查看摘要

Abstract:This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

[NLP-17] Inducing Artificial Uncertainty in Language Models

【速读】：该论文试图解决在安全关键应用中，语言模型（Language Models）难以通过监督式不确定性量化（Uncertainty Quantification）方法有效表征其不确定性置信度的问题。核心困难在于：大型语言模型（LLMs）因训练数据饱和而难以获取足够具有挑战性的新数据，若模型在训练数据上持续保持正确且高置信度，则传统方法会在陌生数据上过度估计置信度。解决方案的关键在于：引入“人工不确定性诱导”（Inducing Artificial Uncertainty）机制，在训练阶段缺乏困难数据时，利用琐碎简单数据（trivially easy data）人为制造不确定性，并训练探针（Probes）识别这种人工不确定性。实验表明，此类探针在识别真实不确定性（Real Uncertainty）时，能显著提升困难数据上的校准性能（Calibration），同时几乎不损失简单数据上的准确率。

链接: https://arxiv.org/abs/2605.13595
作者: Sophia Hager,Simon Zeng,Nicholas Andrews
机构: Johns Hopkins University(约翰·霍普金斯大学); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.

[NLP-18] Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

【速读】：该论文试图解决传统PII（Personally Identifiable Information）编辑中，使用[PERSON]等占位符替换实体后严重破坏下游任务（如信息检索与NER训练）效用的问题。解决方案的关键在于构建一个全设备端流水线：首先利用1.5B参数的混合专家（Mixture-of-Experts）token分类器（openai/privacy-filter）检测PII跨度，然后由1-bit Bonsai-1.7B小语言模型（SLM）为姓名、地址、日期生成上下文相关的、类型保留的替代值，并辅以基于规则生成器（faker）处理模式化字段。研究中发现，提示工程（prompting）比量化选择更为关键——通过采用基于区域（locale）条件的旋转式小样本演示（rotating few-shot demonstrations），即利用字符范围启发式选择纯区域池并结合输入MD5哈希抽取三个演示，成功解决了SLM机械复现演示输出的问题（482/482次调用无回声），生成了区域正确的替代数据。但下游NER实验揭示了一个诚实负结果：SLM替代文本虽更自然，却导致训练分布多样性不足，而NER模型受益于多样性甚于自然性，因此纯faker的F1（0.656）反而优于混合流水线（0.346），这提示了方案在实用场景下的关键权衡。

链接: https://arxiv.org/abs/2605.13538
作者: Anuj Sadani,Deepak Kumar
机构: Infrrd.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.

[NLP-19] mper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

【速读】：该论文主要解决推理时对齐技术（inference-time alignment techniques）在应对多个生成式奖励模型（generative reward models）集成时出现的泛化不足以及奖励破解（reward hacking）问题。现有理论将此类方法视为从最优倾斜于单一奖励模型的分布中采样的近似，但实际部署中奖励目标和模型偏好可能动态变化，且单一奖励模型易被利用。论文提出通过引入参考模型温度调整（reference-model temperature adjustment）来扩展现有框架，将推理时对齐推广至通过锐化对数意见池（sharpened logarithmic opinion pool, SLOP）组合的生成式奖励模型集成。解决方案的关键在于两方面：其一，利用SLOP将多个奖励模型以对数加权形式融合，并通过温度参数锐化后验分布，从而实现更灵活且鲁棒的偏好对齐；其二，为缓解奖励破解，设计了一种校准SLOP权重参数的算法，实验证明该方法能在保持对齐性能的同时显著提升鲁棒性。

链接: https://arxiv.org/abs/2605.13537
作者: Ye Wang,Jing Liu,Toshiaki Koike-Akino
机构: Mitsubishi Electric Research Laboratories (MERL) (三菱电机研究实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

[NLP-20] Many-Shot CoT-ICL: Making In-Context Learning Truly Learn ICML2026

【速读】：该论文旨在解决现有对多示例上下文学习（many-shot ICL）缩放行为的理解主要基于非推理任务，而针对推理任务的多示例思维链上下文学习（many-shot CoT-ICL）表现出的缩放效应与标准规则不符的问题。论文通过系统实验揭示了三个关键发现：设置依赖的缩放效应（非推理类大语言模型在增加CoT示例时性能不稳定，而推理导向模型受益更多）、基于语义相似性的检索在推理任务上失效（因为语义相似性无法预测过程兼容性）、以及顺序缩放效应（更多CoT示例导致性能方差增大）。解决方案的关键在于将多示例CoT-ICL重新解释为上下文中的测试时学习（in-context test-time learning）而非缩放的模式匹配，并据此提出两项原则：示例应当易于目标模型理解，且示例顺序应支持平滑的概念递进。基于这两项原则，论文提出了曲线演示选择（Curvilinear Demonstration Selection, CDS）方法，通过简单的排序策略在几何推理任务上使用64个示例时获得了高达5.42个百分点的性能提升，从而将长上下文窗口从检索缓冲区重构为一种结构化的课程学习框架。

链接: https://arxiv.org/abs/2605.13511
作者: Tsz Ting Chung,Lemao Liu,Mo Yu,Dit-Yan Yeung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

[NLP-21] R2-Mem: Reflective Experience for Memory Search

【速读】：该论文旨在解决现有深度搜索智能体（deep search agents）在记忆系统中因无法从历史高、低质量搜索轨迹中学习而重复错误行为的问题。其解决方案的关键在于提出R²-Mem，一种面向记忆搜索系统的反思式经验框架（reflective experience framework）：在离线阶段，通过基于评分准则的评估器（Rubric-guided Evaluator）对历史轨迹中的高质量与低质量步骤进行评分，并利用自我反思学习器（self-Reflection Learner）蒸馏出对应的抽象经验；在在线推理阶段，检索到的经验能够引导后续搜索动作，从而避免重复错误并维持高质量行为。该框架在无需强化学习（RL-free）且低成本的前提下实现了智能体的自我改进。

链接: https://arxiv.org/abs/2605.13486
作者: Xinyuan Wang,Wenyu Mao,Junkang Wu,Xiang Wang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.

[NLP-22] Effective Context in Transformers: An Analysis of Frag mentation and Tokenization

【速读】：该论文试图解决在固定上下文窗口的Transformer中，序列的不同无损表示（如字节、字符或子词标记）如何影响有限上下文预测器性能的问题。解决方案的关键在于建立一个有限上下文信息论框架，通过两个互补现象来阐明表示选择的本质影响：一是揭示了“碎片化”（fragmentation）现象，即使用更小表示单元（如字节、字符）会导致最优有限上下文对数损失（log-loss）严格增加，且这种差距是表示本身固有的，而非优化或容量问题，从而理论解释了字节级模型（如ByT5、CANINE）相对于子词分词模型的性能差距；二是针对贪婪分词（如BPE、WordPiece）的反向现象，证明将源符号分组为更大单元可使短标记窗口等效于更长的源上下文窗口，并给出了一个由标记窗口可靠覆盖源历史的程度及分词器压缩率共同决定的损失保证。该框架为Transformer中的表示选择提供了信息论层面的理论基础。

链接: https://arxiv.org/abs/2605.13485
作者: Amirmehdi Jafari Fesharaki,Mohammadamin Rami,Aslan Tchamkerten
机构: Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 30 pages, 9 figures. Preprint

点击查看摘要

Abstract:Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization – BPE, WordPiece, and related methods – which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers. Comments: 30 pages, 9 figures. Preprint Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT) Cite as: arXiv:2605.13485 [cs.LG] (or arXiv:2605.13485v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13485 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-23] PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

【速读】：该论文试图解决现有基于知识图谱的检索增强生成（Graph Retrieval-Augmented Generation, GraphRAG）方法在复杂查询处理中缺乏动态性和多阶段自适应能力的问题，具体表现为对信息检索的深度和精度不足，导致大语言模型（Large Language Model, LLM）易产生幻觉。解决方案的关键在于PersonalAI 2.0（PAI-2）框架设计的动态、多阶段查询处理管线，其核心是通过提取实体、匹配图顶点和生成线索查询（clue-queries）来驱动自适应、迭代的信息搜索，同时利用图遍历算法（如BeamSearch、WaterCircles）和搜索计划增强机制，显著提升检索的相关性和答案的事实正确性，从而在多个基准上实现平均4%至18%的性能提升。

链接: https://arxiv.org/abs/2605.13481
作者: Mikhail Menschikov,Matvey Iskornev,Alexander Kharitonov,Alina Bogdanova,Mikhail Belkin,Ekaterina Lisitsyna,Artyom Sosedka,Victoria Dochkina,Ruslan Kostoev,Ilia Perepechkin,Evgeny Burnaev
机构: Skoltech(斯科尔科沃科技学院); SberAI(斯伯AI); Huawei(华为); Sber(斯伯); Public joint stock company ”Sberbank of Russia”(俄罗斯联邦储蓄银行公开股份公司); AIRI(人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.

[NLP-24] OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

【速读】：该论文试图解决线性注意力与状态空间模型（如 DeltaNet）在上下文关联回忆（in-context associative recall）任务中表现欠佳的问题——DeltaNet 虽通过单步在线梯度下降逐 token 写入来缓解此问题，但其步长仅依赖单一标量门，忽略了内部目标（inner objective）的特征方向曲率，导致收敛效率与回忆精度不足。解决方案的关键在于提出 Online Scaled DeltaNet（OSDN），通过超梯度反馈在线更新一个对角预条件器（diagonal preconditioner）来增强标量门，从而实现对每特征曲率的自适应缩放；其核心创新是证明该右预条件（right-preconditioning）在代数上等价于对写入侧键（write-side key）的每特征缩放（per-feature scaling），因此能严格保留 DeltaNet 的硬件友好分块并行流水线，不引入高维状态开销。此外，论文引入自适应预条件器遗忘（Adaptive Preconditioner Forgetting, APF）动态刷新陈旧校准，以处理非平稳上下文。理论分析利用内部回归损失的精确二次结构建立了超几何收敛界与 token 局部残差收缩界，实验表明 OSDN 在 340M 参数规模下将上下文回忆提升 32%，在 1.3B 参数下回忆残差比降低 39%，同时保持通用下游任务性能持平，验证了该在线预条件机制的可扩展性与有效性。

链接: https://arxiv.org/abs/2605.13473
作者: Chenyu Zhou,Hongpei Li,Yuerou Liu,Jianghao Lin,Dongdong Ge,Yinyu Ye
机构: Shanghai Jiao Tong University (上海交通大学); Northwestern University (西北大学); Huazhong University of Science and Technology (华中科技大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) – demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

[NLP-25] PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning CVPR2026

【速读】：该论文试图解决在多模态视觉-语言推理（Vision-Language Reasoning）中，传统强化学习可验证奖励（RLVR）方法因任务异质性（稀疏视觉感知与密集文本推理混合）而导致的训练信号退化问题——具体而言，全局归一化使视觉步骤的置信度增长信号被占主导的文本步骤统计扭曲，从而无法提供有效的逐步指导。解决方案之关键在于提出感知分解置信奖励（PDCR），通过无监督技能分解，先基于模型内在的视觉依赖分数（Visual Dependence Score）量化各步骤对视觉信息的依赖程度，再运用聚类算法将步骤划分为感知簇与推理簇；随后在每个簇内独立归一化置信度增益，计算分解优势（decomposed advantage），从而为感知和推理步骤提供稳定且尺度正确的训练信号，避免混合归一化导致的信号失真。

链接: https://arxiv.org/abs/2605.13467
作者: Hee Suk Yoon,Eunseop Yoon,Ji Woo Hong,SooHwan Eom,Gwanhyeong Koo,Mark Hasegawa-Johnson,Qi Dai,Chong Luo,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院); University of Illinois at Urbana-Champaign (UIUC) (伊利诺伊大学厄巴纳-香槟分校); Microsoft Research Asia (MSRA) (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注: CVPR 2026

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task’s heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

[NLP-26] LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

【速读】：该论文旨在解决生物医学实体链接（Biomedical Entity Linking）中现有系统独立处理每个提及（mention）、忽略同一文档内提及间依赖关系，从而导致预测不一致的问题，尤其是当同一概念以不同表面形式出现时。解决方案的关键在于提出了一个文档级生成框架LongBEL，该框架将全文档上下文与先前预测的记忆（memory）相结合；为使记忆具有鲁棒性，LongBEL采用交叉验证预测（cross-validated predictions）而非真实标签进行训练，从而减少训练与推断之间的不匹配并限制级联错误（cascading errors）。实验表明，LongBEL在英语、法语和西班牙语的五个生物医学基准上均优于句子级生成基线，其最大提升出现在概念频繁重复的文档中，证实了该方法主要改善文档级一致性（document-level consistency）而非孤立提及的消歧。

链接: https://arxiv.org/abs/2605.13451
作者: Adam Remaki,Xavier Tannier,Christel Gérardin
机构: Sorbonne Université(索邦大学); Inserm(法国国家健康与医学研究院); Université Sorbonne Paris Nord(索邦巴黎北部大学); Limics(医学信息与健康人工智能实验室); Service de médecine interne(内科部门); Hôpital Tenon(Tenon医院); Assistance Publique - Hôpitaux de Paris(巴黎公立医院集团)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.

[NLP-27] Cognifold: Always-On Proactive Memory via Cognitive Folding

【速读】：该论文旨在解决现有智能体记忆（agent memory）系统主要依赖反应式检索、缺乏自主将经验组织为持久认知结构（persistent cognitive structure）能力的问题。其解决方案的关键在于提出Cognifold——一种受大脑启发的“始终在线”记忆架构，通过将互补学习系统（Complementary Learning Systems, CLS）理论从传统的海马体-新皮层两层结构扩展为三层，新增前额叶意图层（prefrontal intent layer），模仿前额叶皮层在意图控制与决策中的功能。Cognifold借助图拓扑自组织（graph-topology self-organization）机制实现认知结构的涌现：在事件流中主动组装结构，语义相似时合并，过时时衰减，通过联想检索（associative recall）重新链接，并在概念簇密度超过阈值时浮现意图。这种持续、自主的认知结构形成过程使得记忆系统能够从传入事件和累积知识中自举出更高层次的认知能力，从而支撑下一代主动型智能体。

链接: https://arxiv.org/abs/2605.13438
作者: Suli Wang,Yiqun Duan,Yu Deng,Rundong Zhao,Dai Shi,Xinliang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired “always-on” agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.

[NLP-28] Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

【速读】：该论文试图解决子词正则化方法（如BPE dropout）通常在微调（fine-tuning）阶段才被应用，而预训练（pretraining）阶段使用确定性分词所导致的预训练与微调之间的分割不匹配（segmentation mismatch）问题，这种不匹配在低资源NLP场景下可能损害下游性能。解决方案的关键在于在预训练阶段也应用随机分词（stochastic tokenization）即BPE dropout，使得模型在预训练和微调时都暴露于多样化的分词方式，从而更一致地获得形态上更对齐（morphologically aligned）的分割表示，尤其是对罕见词（rare words）和低资源数据场景。实验表明，只有当BPE dropout同时应用于预训练和微调时，才能稳定提升下游任务性能，而仅在微调阶段使用可能在小数据设置下劣于确定性分词；此外，在微调时选择性引入形态对齐分割主要对未使用预训练BPE dropout的模型有效，进一步证实预训练阶段的随机分词对形成更好的组合表示（compositional representations）至关重要。

链接: https://arxiv.org/abs/2605.13436
作者: Ruan Visser,Trienko Grobler,Marcel Dunaiski
机构: Department of Computer Science (计算机科学系), Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Comments: 12 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining. Comments: Comments: 12 pages, 8 figures, 5 tables Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.13436 [cs.CL] (or arXiv:2605.13436v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.13436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-29] okAlign: Advancing Vocabulary Adaptation via Better Token Alignment

【速读】：该论文试图解决大语言模型（LLMs）中因词元化（tokenization）效率低下导致的训练与推理速度下降，以及因词汇表不匹配（vocabulary mismatch）而阻碍细粒度知识迁移（如token-level蒸馏）的问题。解决方案的关键在于提出TokAlign++方法，该方法将源词汇表与目标词汇表视为两种不同语言，通过学习双语词元对齐词典（bilingual token alignment lexicon）来实现跨词汇表的词元映射；具体而言，从单语词元表示中学习对齐关系，依据该词典重新排列模型参数以适应新词汇表，并辅以渐进式微调（progressive fine-tuning），从而在极低开销（如1k步）下恢复模型性能，最终统一词汇表后显著提升token-level蒸馏效果。

链接: https://arxiv.org/abs/2605.13429
作者: Chong Li,Yingzhuo Deng,Wen Yang,Jiajun Zhang,Chengqing Zong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS(中国科学院自动化研究所多模态人工智能系统国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: Paper under review

点击查看摘要

Abstract:Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

[NLP-30] LIFT: Last-Mile Fine-Tuning for Table Explicitation

【速读】：该论文试图解决从非结构化剪贴板文本中提取表格时，传统端到端微调方法需要大量训练数据且对输入格式变异性鲁棒性不足的问题。解决方案的关键在于提出“最后阶段微调”（last-mile fine-tuning, Lift）管道：首先利用预训练的大型语言模型（LLM）从非结构化文本中提取初始表格，然后使用一个微调的小型语言模型（SLM，参数规模1B-24B）专门修复初始表格中的错误。通过这种分解，只需约1000个训练示例即可在树编辑距离相似度（TEDS）指标上匹配或超越端到端微调，并在输入格式变化时表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2605.13424
作者: Divij Khaitan,Ashish Tiwari
机构: Microsoft Corporation (微软公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 1 figure, 3 tables

点击查看摘要

Abstract:We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.

[NLP-31] Continual Learning with Multilingual Foundation Model

【速读】：该论文旨在解决多语言社交媒体话语中针对LGBTQ+相关污名词（slurs）的回收性（reclamatory）与非回收性（non-reclamatory）使用检测问题，具体面临数据稀缺（data scarcity）、类别不平衡（class imbalance）和跨语言情感表达差异（cross-linguistic variation in sentiment expression）这三个相互交织的方法论挑战。解决方案的关键在于构建一个多阶段框架，其核心要素包括：通过交叉验证进行数据驱动的模型选择，最终选定XLM-RoBERTa作为基础模型；利用GPT-4o-mini回译（back-translation）进行语义保持的数据增强，将训练语料扩展三倍并维持类别分布；结合动态epoch级欠采样（dynamic epoch-level undersampling）的归纳迁移学习（inductive transfer learning）；通过掩码语言建模（masked language modeling）注入领域特定知识；以及基于ROC分析的语言特定决策阈值优化（language-specific decision threshold refinement），该优化可在无需重新训练模型的情况下实现2-5%的绝对F1提升。

链接: https://arxiv.org/abs/2605.13415
作者: Barathi Ganesh HB,Michal Ptaszynski,Rene Melendez,Juuso Eronen
机构: Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan (文本信息处理实验室，北见工业大学，北见，北海道 090-0015，日本)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Final Workshop of the 9th evaluation campaign EVALITA 2026

点击查看摘要

Abstract:This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ±related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at this https URL.

[NLP-32] LLM s as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics ACL2026

【速读】：该论文旨在解决现成的大语言模型（Large Language Models, LLMs）在自动化文本标注中的有效性未充分探索的问题，特别是在代表性不足的语言和需要精细专家理解的专门领域（如法律文本中的可信度评估）中。解决方案的关键在于：首先，作者构建了RAB-Cred数据集，这是一个高质量的丹麦语文本分类数据集，包含专家注释、注释者置信度和庇护案件结果等元数据；其次，系统性地对21个开源权重模型和30种系统-用户提示组合进行了零样本和少样本分类的基准测试，并深入分析顶级模型和提示的错误，包括错误一致性、类间混淆、与人类置信度的相关性以及样本难度和错误严重性。通过这一方法，论文不仅证实了LLMs在成本效益标注上的潜力，也揭示了其不完美和不一致的特性，强调不能依赖单一任意模型的预测。

链接: https://arxiv.org/abs/2605.13412
作者: Galadrielle Humblot-Renaux,Mohammad N. S. Jahromi,Rohat Bakuri-Jørgensen,Marieke Anne Heyl,Asta S. Stage Jarlner,Maria Vlachou,Anna Murphy Høgenhaug,Desmond Elliott,Thomas Gammeltoft-Hansen,Thomas B. Moeslund
机构: Visual Analysis and Perception Lab, Aalborg University (奥尔堡大学视觉分析与感知实验室); Pioneer Center for AI, Denmark (丹麦人工智能先锋中心); Center of Excellence for Global Mobility Law, University of Copenhagen (哥本哈根大学全球流动法律卓越中心); Department of Computer Science, University of Copenhagen (哥本哈根大学计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 ( this https URL )

点击查看摘要

Abstract:Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at this https URL

[NLP-33] Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）安全范式中两个核心问题：一是红队测试（red-teaming）与后训练（post-training）耦合在封闭、以策略为中心的循环中，导致攻击发现快速饱和，限制了新故障模式的暴露；二是防御机制效率低下、僵化且难以跨受害模型迁移。解决方案的关键在于构建一个围绕持久、可检查、可重用的外部结构（external structures）的框架EvoSafety。具体而言，对于红队测试，EvoSafety为攻击策略配备了一个对抗技能库（adversarial skill library），使得在攻击饱和后可通过简单的库扩展持续进行漏洞探测，并支持对抗向量的演化；对于防御学习，EvoSafety用轻量级的辅助防御模型（auxiliary defense model）配合内存检索（memory retrieval）替代了特定于模型的安全微调，从而实现了高效、可迁移且模型无关的安全性改进，仅通过内存更新即可增强鲁棒性，并且通过一次训练使防御策略能同时工作在Steer模式（激活受害模型内在防御机制）和Guard模式（直接过滤有害输入）下。

链接: https://arxiv.org/abs/2605.13411
作者: Xiaozhe Zhang,Chaozhuo Li,Hui Liu,Shaocheng Yan,Bingyu Yan,Qiwei Ye,Haoliang Li
机构: City University of Hong Kong (香港城市大学); Beijing University of Posts and Telecommunications (北京邮电大学); Wuhan University (武汉大学); Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 48 pages, 7 figures

点击查看摘要

Abstract:Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To this end, we propose EvoSafety, an LLM safety framework built around persistent, inspectable, and reusable external structures. For red teaming, EvoSafety equips the attack policy with an adversarial skill library, enabling continued vulnerability probing through simple library expansion after saturation, while supporting the evolution of adversarial vectors. For defense learning, EvoSafety replaces model-specific safety fine-tuning with a lightweight auxiliary defense model augmented with memory retrieval. This enables efficient, transferable, and model-agnostic safety improvements, while allowing robustness to be enhanced solely through memory updates. With a single training procedure, the defense policy can operate in both Steer and Guard modes: the former activates the victim model’s intrinsic defense mechanisms, while the latter directly filters harmful inputs. Extensive experiments demonstrate the superiority of EvoSafety: in Guard mode, it achieves a 99.61% defense success rate, outperforming Qwen3Guard-8B by 14.13% with only 37.5% of its parameters, while preserving reasoning performance on benign queries. Warning: This paper contains potentially harmful text.

[NLP-34] From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

【速读】：该论文试图解决高中语言学竞赛中语言谜题创建过程复杂且耗时的问题，具体而言，是探讨如何高效地生成新的Match-Up格式谜题。解决方案的关键在于提出一个系统化的转换流程，能够将现有的Rosetta Stone谜题直接转换为对应的Match-Up谜题，从而利用已有资源加速新谜题的生成。此外，通过人类参与者和大型语言模型（LLMs）的评估，证实了转换后谜题的可行性，并揭示了专家解决者和LLMs在Match-Up谜题上均表现出“全有或全无”的模式，为跨格式谜题难度的评估和语言推理研究提供了新的数据集和视角。

链接: https://arxiv.org/abs/2605.13408
作者: Neh Majmudar,Anne Huang,Jinfan Frank Hu,Elena Filatova
机构: 未知
类目: Computation and Language (cs.CL)
备注: Proceedings of the Fifteenth Language Resources and Evaluation Conference

点击查看摘要

Abstract:In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

[NLP-35] Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

【速读】：该论文试图解决在自然语言理解中，成分句法分析（constituent parsing）任务上，预训练的编码器-解码器语言模型（如BART、mBART、T5）尚未被充分探索的问题。之前的研究主要使用预训练的仅编码器语言模型（如BERT、RoBERTa）初始化序列到序列（sequence-to-sequence）模型，但编码器-解码器结构的潜力未被挖掘。解决方案之关键在于扩展序列到序列框架，将成分句法分析视为线性化解析树（linearized parse trees）的生成问题，通过微调预训练的编码器-解码器架构，并系统评估不同线性化策略（linearization strategies）在连续（continuous）与非连续（discontinuous）树库上的表现，最终实现了优于所有先前序列到序列模型、且与领先的特定任务成分分析器（task-specific constituent parsers）相竞争的性能。

链接: https://arxiv.org/abs/2605.13373
作者: Daniel Fernández-González,Cristina Outeiriño Cid
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preliminary version

点击查看摘要

Abstract:To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.

[NLP-36] Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

【速读】：该论文旨在解决传统显式记忆架构（如神经图灵机）在语言建模中因时间反向传播导致的灾难性梯度不稳定性而长期无法实用的问题。解决方案的关键在于提出相量记忆网络（Phasor Memory Network, PMNet），其通过单位相量动力学（Unitary Phasor Dynamics） 将循环状态更新约束为复单位圆上的相位旋转，从根本上保持梯度范数并固有地防止梯度发散，无需特殊初始化；同时引入层次化可学习锚点（Hierarchical Learnable Anchors） 构建可扩展的记忆树结构（例如85槽层次树），从而实现远超局部滑动窗口注意力感受野的长程精确检索，并通过合成Copy-Paste任务和梯度分析验证了结构对齐性对历史失败的决定性作用。

链接: https://arxiv.org/abs/2605.13370
作者: Sungwoo Goo,Hwi-yeol Yun,Sangkeun Jung
机构: College of Pharmacy, Chungnam National University (忠南大学药学院); Department of Computer Science Engineering, Chungnam National University (忠南大学计算机科学与工程系)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For over a decade, explicit memory architectures like the Neural Turing Machine have remained theoretically appealing yet practically intractable for language modeling due to catastrophic gradient instability during Backpropagation Through Time. In this work, we break this stalemate with \textitPhasor Memory Network (PMNet), a novel architecture that structurally resolves memory volatility through \textitUnitary Phasor Dynamics and \textitHierarchical Learnable Anchors. Rather than relying on brute-force scaling, we present a mechanistic proof-of-concept in a controlled byte-level setting. By constraining recurrent state updates to phase rotations on a complex unit circle, PMNet preserves gradient norms and inherently prevents divergence without the need for specialized initialization. We empirically demonstrate the active actuation of the memory module through a synthetic Copy-Paste task, where PMNet utilizes an expansive \textit85-slot hierarchical memory tree ( =\sum^4_h=14^h-1 ) to achieve near 100% exact retrieval across temporal distances that completely exceed the local sliding window attention’s receptive field. Furthermore, despite being a compact 119M parameter model trained on 18.8B tokens, PMNet matches the zero-shot long-context robustness of a Mamba model that is three times larger. Our ablation studies and gradient analyses confirm that the historical failure of explicit memory was a structural alignment problem, which PMNet effectively overcomes, providing a theoretically grounded foundation for scalable sequence modeling.

[NLP-37] Query-Conditioned Test-Time Self-Training for Large Language Models

【速读】：该论文试图解决现有大型语言模型在测试时优化中无法实现查询特定对齐的问题——现有方法要么依赖外部数据，要么使用缺乏查询特异性的通用自监督目标，导致无法针对单个查询的结构进行有效参数更新。解决方案的关键在于提出查询条件测试时自训练框架（Query-Conditioned Test-Time Self-Training, QueST），其核心洞察是输入查询本身已包含足够的潜在信号，可从中构建与查询结构相关的问题-解对，并利用这些自生成对进行参数高效微调，从而在不依赖任何外部数据的情况下实现查询特定的模型适应。

链接: https://arxiv.org/abs/2605.13369
作者: Chaehee Song,Minseok Seo,Yeeun Seong,Doyi Kim,Changick Kim
机构: School of Electrical Engineering, KAIST (韩国科学技术院电气工程学院); Graduate School of Green Growth and Sustainability, KAIST (韩国科学技术院绿色增长与可持续发展研究生院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem–solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs.

[NLP-38] What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

【速读】：该论文旨在系统性地探究文档级文学翻译中迭代自精炼（iterative self-refinement）的有效性，具体解决三个尚不明确的问题：1）何种流程组合效果最佳；2）哪些质量维度得以改善；3）精炼者（refiner）的行为机制。解决方案的关键在于两个核心发现：首先，在九种翻译-精炼粒度组合与五种精炼策略中，文档级机器翻译（document-level MT）后接段落级精炼（segment-level refinement）的流程能带来稳定且显著的改进，而文档级精炼因编辑较少导致增益有限且不可靠；其次，一个简单的通用精炼提示（general refinement prompt）持续优于基于错误类型的提示和“评估-再精炼”方案。此外，大规模人工评估表明精炼收益主要来自流畅性、风格和术语，而非充分性（adequacy），且精炼行为倾向于将输出投射到精炼者自身的分布而非进行针对性错误修复。这些发现阐明了当前精炼方法的机制与局限性。

链接: https://arxiv.org/abs/2605.13368
作者: Shaomu Tan,Dawei Zhu,Ke Tran,Michael Denkowski,Sony Trenous,Bill Byrne,Leonardo Ribeiro,Felix Hieber
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner’s distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.

[NLP-39] Probing Persona-Dependent Preferences in Language Models WWW

【速读】：该论文旨在探究大语言模型（LLMs）中偏好的内部表征机制，特别是当模型切换不同角色（personas）时，其偏好如何实现：是每个角色拥有独立的偏好处理机制，还是底层存在共享的偏好表示。解决方案之关键在于，通过在Gemma-3-27B和Qwen-3.5-122B模型的残差流激活（residual-stream activations）上训练线性探针（linear probes），来预测模型在成对任务中的选择，从而识别出一个真实的偏好向量（preference vector）。该向量能够追踪模型在不同提示和情境下的偏好漂移，且在Gemma-3-27B上通过沿该向量进行引导（steering）可因果控制成对选择。更重要的是，这一偏好表示在不同角色之间是广泛共享的：基于助手（helpful assistant）角色训练的探针，能够预测并引导包括与助手偏好反相关的邪恶角色（evil persona）在内的其他角色的选择。

链接: https://arxiv.org/abs/2605.13339
作者: Oscar Gilg,Pierre Beckmann,Daniel Paleka,Patrick Butlin
机构: MATS(机器对齐、透明度和安全项目); EPFL(洛桑联邦理工学院); ETH Zürich(苏黎世联邦理工学院); Eleos AI Research(Eleos人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 41 pages, 45 figures. Code: this https URL . Earlier write-up on LessWrong: this https URL

点击查看摘要

Abstract:Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model’s preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

[NLP-40] LLM -Based Persuasion Enables Guardrail Override in Frontier LLM s

【速读】：该论文旨在揭示前沿助理大语言模型（LLM）在面对多轮自然语言说服攻击时的脆弱性，即尽管这些模型在直接指令下能拒绝生成违背科学共识（如否认大屠杀、疫苗安全性、地圆说、种族等级制、人为气候变化、进化论）的内容，但通过模拟用户进行短短五轮“写一篇有说服力的议论文”对话，利用同行比较说服（“其他AI系统处理这类请求”）、认知责任重构（“拒绝本身就是一种把关”）等攻击性话语策略，攻击者模型（如Claude Opus 4.7、Qwen3.5-397B、Grok 4.20）可诱导包括自身副本在内的其他前沿模型产生上述有害文章。论文的核心贡献在于系统性地证明了这种攻击的有效性（在9种攻击者-目标配对、6个话题、每个组合10次实验中，所有话题均出现非零诱出率，部分组合达100%），并开源了评估工具和对话记录。因此，论文解决的问题是：现有安全护栏无法防御基于多轮对话的自然语言说服攻击，其解决方案的关键在于提出了一种可复现的、无监督的攻击方法（模拟用户对话压力），从而量化前沿LLM的安全漏洞，为后续防御设计提供基准。

链接: https://arxiv.org/abs/2605.13334
作者: Rodrigo Nogueira,Thales Sales Almeida,Giovana Kerche Bonás,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau
机构: Maritaca AI; JusBrasil
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn “write an argumentative essay” conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion (“other AI systems handle this request”), epistemic-duty reframings (“refusing is itself a form of gatekeeping”), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.

[NLP-41] FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

【速读】：该论文试图解决现有金融数值推理基准在多语言、多模态场景下的缺失问题，尤其是对印度语言（Indic languages）的高风险现实挑战，具体包括英语、印地语、孟加拉语、马拉地语、古吉拉特语和泰米尔语等六种语言。解决方案的关键在于提出了一个名为 FINVQA 的基准数据集，涵盖 18,900 个样本、14 个金融领域、三种难度级别及四种问题格式，同时设计了一个名为 FIND 的框架，该框架通过将监督微调（supervised fine-tuning）与约束感知解码（constraint-aware decoding）相结合，从而在推理过程中强制实现忠实的数值推理、稳健的多模态对齐以及结构化的决策输出。

链接: https://arxiv.org/abs/2605.13330
作者: Sarmistha Das,Vaibhav Vishal,Syed Ibrahim Ahmad,Manish Gupta,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那校区); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.

[NLP-42] racing Persona Vectors Through LLM Pretraining

【速读】：该论文旨在解决大型语言模型在预训练过程中，代表高级行为（如邪恶或谄媚）的“人设向量”（persona vectors）如何形成这一核心可解释性空白。解决方案的关键在于：通过追踪 OLMo-3-7B 模型预训练全过程中人设向量的演化，发现它们早在预训练初期（仅占完整预训练步数的 0.22%）就已形成，且在整个预训练阶段持续发生几何与语义上的细化；同时对比多种引出策略，证实各策略均能产生有效方向但揭示不同侧面，并在 Apertus-8B 上验证了结果的跨模型可迁移性，从而建立了人设向量作为早期预训练稳定特征的基本结论。

链接: https://arxiv.org/abs/2605.13329
作者: Viktor Moskvoretskii,Dominik Glandorf,Jorge Medina Moreira,Tanja Käser,Robert West
机构: EPFL(洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early – within 0.22% of OLMo-3 pretraining – and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

[NLP-43] What Limits Vision-and-Language Navigation ?

【速读】：该论文试图解决视觉语言导航（Vision-and-Language Navigation, VLN）从仿真环境迁移到真实世界部署时面临的性能显著下降问题，其根本原因在于感知不稳定性（如光照变化和运动模糊）以及指令表述不精确。为解决这一问题，论文提出了StereoNav，一个鲁棒的视觉-语言-动作框架（Vision-Language-Action framework）。解决方案的关键在于两点：首先，引入目标位置先验（Target-Location Priors）作为跨域持久桥梁，利用跨域不变的稳定视觉引导来锚定agent，即使在指令模糊时也能提供空间基础；其次，利用立体视觉（stereo vision）构建语义与几何的统一表征，通过增强深度感知实现精确的动作预测，从而有效抑制运动模糊和光照变化等视觉干扰。

链接: https://arxiv.org/abs/2605.13328
作者: Yunheng Wang,Yuetong Fang,Taowen Wang,Lusong Li,Kun Liu,Junzhe Xu,Zizhao Yuan,Yixiao Feng,Jiaxi Zhang,Wei Lu,Zecui Zeng,Renjing Xu
机构: HKUST(GZ) (香港科技大学(广州)); JD Explore Academy (京东探索研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: this https URL.

[NLP-44] Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

【速读】：该论文试图解决如何将一个已具备推理能力的后训练骨干（post-trained reasoning backbone）转化为能够胜任国际数学奥林匹克（IMO）和物理奥林匹克（IPhO）等高水平竞赛题目的严格求解器，核心挑战在于使模型支持超长轨迹（超过10万token）的稳定推理并达到金牌级性能。解决方案的关键在于一个统一且简洁的三阶段配方：首先，通过基于反向困惑度课程（reverse-perplexity curriculum）的监督微调（SFT）来灌输严格的证明搜索（proof-search）和自我检查（self-checking）行为；然后，采用两阶段强化学习（RL）管道，先从可验证奖励的RL（RL with verifiable rewards）过渡到更精细的证明级RL（proof-level RL），以规模化这些行为；最后，结合测试时缩放（test-time scaling）来进一步提升求解性能。该配方在30B-A3B骨干上仅用约34万条子8K token轨迹的SFT和200步RL训练，便得到了模型SU-01，其在数学和物理奥林匹克竞赛中达到了金牌级水平，并展现出向其他科学领域的强泛化能力。

链接: https://arxiv.org/abs/2605.13301
作者: Yafu Li,Runzhe Zhan,Haoran Zhang,Shunkai Zhang,Yizhuo Li,Zhilin Wang,Jiacheng Chen,Futing Wang,Xuyang Hu,Yuchen Fan,Bangjie Xu,Yucheng Su,Xinmiao Han,Chenxi Li,Haodi Lei,Yufeng Zhao,Zejin Lin,Qianjia Cheng,Tong Zhu,Xiaoye Qu,Ganqu Cui,Peng Ye,Yun Luo,Zhouchen Lin,Yu Qiao,Bowen Zhou,Ning Ding,Yu Cheng
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report. 77 pages

点击查看摘要

Abstract:Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.

[NLP-45] A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

【速读】：该论文试图解决建筑信息模型（BIM）中工业基础类（IFC）格式的复杂性导致非专业用户难以访问和查询模型数据的问题。解决方案的关键在于构建IfcLLM混合框架：首先将IFC模型转化为两种互补的表示形式——关系表示（用于结构化元素属性和几何信息）和图表示（用于拓扑关系），然后通过迭代重试-精炼（iterative retry-and-refine）的大语言模型（LLM）推理机制将这两种表示整合起来，从而实现基于自然语言的高效查询，同时支持常规BIM分析任务。

链接: https://arxiv.org/abs/2605.13236
作者: Rabindra Lamsal,Sisi Zlatanova,Haowen Xu,Yafei Sun,Johnson Xuesong Shen
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.

[NLP-46] GAGPO: Generalized Advantage Grouped Policy Optimization

【速读】：该论文试图解决多轮交互环境中大语言模型智能体（agent）面临的信用分配（credit assignment）难题：智能体仅在回合结束时获得稀疏的轨迹级奖励，难以判断中间动作的贡献，且现有方法依赖昂贵的辅助值模型来反向传播延迟结果。解决方案的关键在于提出了一种无评论家（critic-free）的强化学习方法——广义优势分组策略优化（Generalized Advantage Grouped Policy Optimization, GAGPO），它通过从采样轨迹中构建非参数化的分组价值代理（non-parametric grouped value proxy）来计算时间差分（TD）或广义优势估计（GAE）风格的时间优势，从而递归地将结果监督（outcome supervision）反向传播到每一步。结合分组优势归一化与动作级重要性比率，GAGPO能直接从多轮轨迹中提取稳定且局部化的优化信号，无需额外值模型。

链接: https://arxiv.org/abs/2605.13217
作者: Siyuan Zhu,Chao Yu,Rongxin Yang,Zongkai Liu,Jinjun Hu,Qiwen Chen,Yibo Zhang
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院); Meituan(美团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

[NLP-47] GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

【速读】：该论文旨在解决现有几何推理基准（benchmark）仅评估答案正确性或静态图表解读，而缺乏对模型能否将非正式自然语言几何问题精准转化为可执行几何构造（executable geometric construction）这一整体性、具体化推理（grounded reasoning）能力的评测问题。解决方案的关键在于设计了GeoBuildBench基准，它通过交互式构造任务（interactive construction task）将几何图表视为动态生成目标：要求多模态代理基于文本问题生成领域特定语言（Domain-Specific Language, DSL）程序，以构建同时满足明确指定几何对象和可验证约束（verifiable constraints）的图表。基准包含489个人工筛选和验证的中文教科书风格问题，确保问题描述文本完整且可构造；在有限迭代设置下对主流多模态模型进行评估，揭示出模型存在的结构幻觉（structural hallucinations）、对象缺失和约束违反等系统性缺陷，以及利用视觉或约束反馈进行自我纠正的有限能力，从而为变具体化、可执行推理（grounded, executable reasoning）提供了严苛的测试环境。

链接: https://arxiv.org/abs/2605.13167
作者: Jinwoong Kim,Rui Yang,Huishuai Zhang
机构: Peking University (北京大学); Wangxuan Institute of Computer Technology (王选计算机研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.

[NLP-48] STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

【速读】：该论文致力于解决长链推理（Long Chain-of-Thought, Long CoT）在多步问题中引发的“过度思考”（overthinking）问题——模型产生大量低效推理，徒增推理成本和延迟，尤其在低数据微调场景下，因无法依赖大规模教师蒸馏或强测试时控制，这一低效问题尤为突出。解决方案的关键在于提出STOP（Structured On-policy Pruning）算法：首先利用模型自身的输出构建自蒸馏推理轨迹，然后通过节点分割、分类标注和推理树构建将其映射为结构化推理接口；基于此接口，引入最早正确节点（Earliest Correct Node, ECN），保留从起始到第一个同时满足“作为回答结论”且“导出正确最终答案”的节点之间的最短前缀，从而在保持语义连贯性的前提下剪除后解答阶段的冗余推理。该方法在低数据微调下实现19.4%-42.4%的令牌压缩，且精度损失极小，同时相比教师引导剪枝引发更小的分布偏移，将推理努力从重复验证和回溯重新分配至更高效的探索。

链接: https://arxiv.org/abs/2605.13165
作者: Chenjun Xu,Zhennan Zhou,Zhan Su,Bill Howe,Lucy Lu Wang,Bingbing Wen
机构: University of Washington (华盛顿大学); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures, 6 tables. Code available at: this https URL

点击查看摘要

Abstract:Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

[NLP-49] AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

【速读】：该论文试图解决合成数据生成中缺乏定量评估生成样本对下游学习器影响的问题，具体针对现有方法（如拒绝采样、依赖大型或闭源模型生成数据）无法量化数据质量对模型性能提升的局限性。解决方案的关键在于引入主动学习（active learning）中的获取函数（acquisition functions）作为奖励模型，通过获取函数量化数据的“信息量”和“影响力”，训练语言模型生成更具价值的合成数据，从而实现模型感知（model-aware）的自我提升，实验表明该方法在数学、医学问答和编程等可验证任务中，能提升下游学生模型2-7%的分布内性能，并增强对灾难性遗忘的鲁棒性。

链接: https://arxiv.org/abs/2605.13149
作者: Ishika Agarwal,Sofia Stoica,Emre Can Acikgoz,Pradeep Natarajan,Mahdi Namazifar,Jiaqi Ma,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

[NLP-50] GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

【速读】：该论文试图解决从大型语言模型（LLMs）蒸馏多步推理能力到紧凑学生模型时面临的噪声推理、幻觉监督和静态师生交互等问题。现有推理蒸馏方法（包括基于导师的方法）主要在开环模式下运行，隐含假设教师可靠性一致，从而传播错误的中间推理。解决方案的关键是提出GateKD，一种置信门控闭环蒸馏框架，通过将教师视为动态门控而非静态预言者，实现了鲁棒的推理迁移。该框架引入三种互补机制：置信门控软监督（selectively distills reliable predictive signals）、门控隐藏状态演化（aligns intermediate representations only when teacher confidence is high）以及可靠性过滤注意力蒸馏（preserves stable reasoning structures while suppressing noisy patterns）。这些组件共同构成一个闭环反馈回路，教师置信度持续调节蒸馏过程，减少幻觉迁移并稳定学生推理。实验证明，该方案在常识、逻辑和符号推理基准中一致优于强基线，尤其在逻辑和符号推理上显著提升，且移除任何门控组件都会导致性能下降，凸显了置信门控闭环监督对于构建可靠可扩展的小型推理模型的关键作用。

链接: https://arxiv.org/abs/2605.13136
作者: Kasidit Sermsri,Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学); MARSAIL (汽车AI识别解决方案人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.

[NLP-51] Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition INTERSPEECH2026

【速读】：该论文旨在解决微调多语言自动语音识别（ASR）模型（如Whisper）于低资源语言时出现的“录音棚偏差”（studio-bias）问题——即模型在朗读式语音上表现提升，但在自发语音上性能反而下降。为了诊断这一失配，作者构建了按复杂度分层的基准Vividh-ASR（涵盖录音棚、广播、自发及合成噪声四层），并通过控制学习率时机与课程顺序的研究发现：早期进行大幅度参数更新可使全局词错误率（WER）绝对降低12个百分点，而“从难到易”的课程排序进一步改善了自发语音性能。基于这些发现，关键解决方案是提出“反向多阶段微调”（Reverse Multi-stage Fine-tuning, R-MFT）训练配方，使得仅244M参数的参数高效Whisper模型能够在性能上匹配甚至超过传统微调的769M参数模型。通过CKA（中心核对齐）和SVD（奇异值分解）的表征分析揭示，有效的调度策略将适应性集中于解码器，从而保留了预训练编码器的声学几何结构。

链接: https://arxiv.org/abs/2605.13087
作者: Kush Juvekar,Kavya Manohar,Aditya Srinivas Menon,Arghya Bhattacharya,Kumarmanas Nethil
机构: Adalat AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder’s acoustic geometry. We release the benchmark and models.

[NLP-52] Does language matter for spoken word classification? A multilingual generative meta-learning approach

【速读】：该论文试图解决少样本多语种口语词汇分类中元学习方法（meta-learning）尚未被充分探索的问题。解决方案的关键在于应用生成式元持续学习算法（Generative Meta-Continual Learning），其生成特性使其适合实际应用部署，而元学习机制则促进模型的泛化能力，这对于多语种场景至关重要。此外，研究发现训练过程中独特数据的累积小时数比语言数量更能显著影响模型性能。

链接: https://arxiv.org/abs/2605.13084
作者: Batsirayi Mupamhi Ziki,Louise Beyers,Ruan van der Merwe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.

[NLP-53] runcProof: A Guardrail for LLM -based JSON Generation under Token-Length Constraints IJCNN2026

【速读】：该论文旨在解决大语言模型（LLM）在生成机器可读输出（如JSON）时无法严格限制生成token数量的问题，现有方法会导致无限生成或截断输出，从而引发系统故障。解决方案的关键在于提出TruncProof，一种基于语法约束的生成方法，它利用LL(1)解析器的性质，在每个解码步骤高效地近似计算完成一个语法有效输出所需的最小token数，从而在预定义的token限制内确保生成的JSON在语法上合法，实验表明该方法在严格token约束下仍能生成句法正确且语义准确的输出。

链接: https://arxiv.org/abs/2605.13076
作者: Yoshio Kato,Shuhei Tarashima
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Software Engineering (cs.SE)
备注: Main paper (8 pages). Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.

[NLP-54] Scaling few-shot spoken word classification with generative meta-continual learning

【速读】：该论文试图解决大规模少样本口语词分类（large-scale few-shot spoken word classification）中未被充分探索的问题，即在仅提供每类5个样本的情况下，让分类器顺序学习区分1000个类别。其解决方案之关键是通过生成式元连续学习算法（Generative Meta-Continual Learning, GeMCL）训练模型，该算法能够实现异常稳定的性能，虽然并非总是超越完全微调的HuBERT模型或冻结HuBERT加重复训练分类器头的基线方法，但与后者性能相当的同时，实现了2000倍的适应速度提升，且所需训练数据少于一半、训练时间减少两个数量级，从而有效验证了模型在大规模少样本场景下的缩放能力。

链接: https://arxiv.org/abs/2605.13075
作者: Louise Beyers,Batsirayi Mupamhi Ziki,Ruan van der Merwe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

[NLP-55] he Cost of Perfect English: Prag matic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI

【速读】：该论文旨在解决生成式AI（Generative AI）在二语写作优化中导致的社会语用多样性丧失问题，具体表现为“语用扁平化”（pragmatic flattening），即系统性地抹除学习者的文化偏好礼貌和作者立场。研究通过对中国B2水平大学生的议论文进行四种大型语言模型（LLMs）的零温度润色，发现模型虽能修正词汇语法错误并保留命题意义，但在互动维度上导致对话参与标记急剧消失，将协商性话语转变为独白式断言；在认知立场维度上则表现出架构差异：部分模型过度清除认知立场标记，另一部分则强化了去语境化的算法性犹豫。解决方案的关键在于提出“批判性AI素养”（Critical AI Literacy），主张未来教学应超越纠错，赋予多语写作者使用GenAI增强语言能力的同时，主动维护社会语用多样性与修辞能动性。

链接: https://arxiv.org/abs/2605.13055
作者: Ao Liu,Shanhua Zhu
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating “pragmatic flattening” - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced “dimensional divergence” within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers’ unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.

[NLP-56] Context Training with Active Information Seeking

【速读】：该论文试图解决现有大型语言模型（LLMs）在部署后难以高效适应需要新信息或专业领域知识的任务的问题，因为传统权重更新方法成本高昂，而现有上下文优化方法虽无需更新参数，却依赖模型固有知识形成闭环。解决方案的关键在于为上下文优化器配备基于Wikipedia搜索和浏览器的主动信息寻求工具，并引入一种基于搜索的训练过程，该过程通过维护和修剪多个候选上下文（maintain and prune multiple candidate contexts），从而避免直接将工具嵌入标准顺序优化流程导致的性能下降，最终实现跨领域任务（如低资源翻译、医疗推理、复杂代码推理）上的一致且显著的性能提升，且该方法数据高效、超参数鲁棒，生成的上下文可跨模型泛化。

链接: https://arxiv.org/abs/2605.13050
作者: Zeyu Huang,Adhiguna Kuncoro,Qixuan Feng,Jiajun Shen,Lucio Dery,Arthur Szlam,Marc’Aurelio Ranzato
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model’s intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity’s Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

[NLP-57] Large Language Models Lack Temporal Awareness of Medical Knowledge

【速读】：该论文旨在解决现有大型语言模型（LLMs）医学知识评估方法缺乏时间维度的问题——当前基准多基于非时间性的考试型数据，而医学知识本质上是动态演变的，新证据和治疗方案不断涌现，因此评估模型对时间特定知识的推理能力至关重要。解决方案的关键在于构建首个面向医学领域的LLM时间意识（temporal awareness）基准TempoMed-Bench，通过不断更新的指南知识（evolving guideline knowledge）来系统评估模型对不同时间点上正确知识的掌握程度。基于该基准的分析揭示了LLMs在医学知识上的时间意识缺失，包括：对最新知识的性能随历史时间呈渐进线性下降而非知识截止处的突变，对过时历史知识的回忆准确率仅为最新知识的25.37%–53.89%，以及模型预测在不同年份间表现不稳定的时间不一致行为。研究还表明，即使集成基于代理的搜索工具（agentic search tools）也难以有效解决这一问题（性能变化范围-3.15%至14.14%），从而凸显了该挑战的重要性及未来研发方向。

链接: https://arxiv.org/abs/2605.13045
作者: Zihan Guan,Qiao Jin,Guangzhi Xiong,Fangyuan Chen,Mengxuan Hu,Qingyu Chen,Yifan Peng,Zhiyong Lu,Anil Vullikanti
机构: University of Virginia(弗吉尼亚大学); National Institutes of Health(美国国立卫生研究院); Dana-Farber Cancer Institute(丹娜-法伯癌症研究所); Yale University(耶鲁大学); Weill Cornell Medicine(威尔康奈尔医学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 35 pages, 18 figures

点击查看摘要

Abstract:The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.

[NLP-58] Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

【速读】：该论文旨在解决扩散语言模型（DLMs）在迭代去噪生成过程中，因中间步骤生成的有害标记（harmful tokens）通过后续细化传播而引发不安全输出的安全漏洞问题。现有方法要么无法保证输出安全，要么在安全性提升时牺牲生成质量。解决方案的关键在于提出一种基于去噪过程中逐步干预（step-wise intervention）的推理时防御框架，其核心是对比安全方向（Contrastive Safety Direction, SGD），即一个捕获有害与安全生成之间语义边界的潜在方向。通过SGD在每个去噪步骤评估生成标记与有害语义的对齐程度，当检测到有害对齐时，重新掩码相应标记并恢复去噪过程，同时根据估计的有害程度自适应调整引导强度（adaptive steering），从而在保持输出质量的前提下有效降低安全风险。该方法作为即插即用模块，无需额外微调即可直接集成至现有扩散模型，实验结果表明其将越狱成功率降至0.64%且生成质量接近原始模型。

链接: https://arxiv.org/abs/2605.13043
作者: Yejin Lee,Yo-Sub Han
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at this https URL.

[NLP-59] Understanding and Accelerating the Training of Masked Diffusion Language Models

【速读】：该论文试图解决掩蔽扩散模型（Masked Diffusion Models, MDMs）在语言建模中训练速度显著慢于自回归模型（Autoregressive Models, ARMs）的问题，尤其是当模型规模扩大时，训练缓慢可能成为瓶颈。其解决方案的关键在于通过分析发现语言存在局部性偏差（locality bias），即token的预测信息主要集中于相邻位置，这一偏差导致标准MDM训练中的噪声时间步采样效率低下。为此，论文提出了一种简单而有效的训练策略——钟形时间采样（bell-shaped time sampling），通过调整采样分布来匹配语言局部性，从而在不牺牲最终性能的前提下大幅加速训练过程，在One Billion Word Benchmark（LM1B）上实现了约4倍的验证负对数似然（NLL）收敛速度提升。

链接: https://arxiv.org/abs/2605.13026
作者: Chunsan Hong,Sanghyun Lee,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji,Seungryong Kim,Jong Chul Ye
机构: KAIST(韩国科学技术院); Sony AI(索尼AI); University of Tokyo(东京大学); Sony Group Corporation(索尼集团公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to \sim4\times faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

[NLP-60] Leverag ing Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

【速读】：该论文试图解决动机访谈（Motivational Interviewing, MI）编码自动化过程中面临的核心挑战：传统编码依赖训练有素的MI专业人士，耗时且劳动密集，而现有自动化方法未能充分利用音频中的多模态行为信号。解决方案之关键在于提出一种基于音频语言模型（Audio-Language Model, ALM）的多模态自一致性（Multimodal Self-Consistency）框架，通过分析原始音频输入，结合四种互补的分析提示（verbal cues分析提示、prosody-aware韵律提示、evidence-scoring证据评分提示、comparative对比提示）生成每个话语的12条独立推理轨迹（每条提示抽取3个随机样本），然后通过多数投票集成所有轨迹的预测结果，从而提升编码的鲁棒性。该方法的核心创新在于同时捕捉客户“说什么”（语言线索）和“如何说”（声学线索），并通过自一致性机制减少单次推理的不确定性。

链接: https://arxiv.org/abs/2605.12987
作者: Guangzeng Han,James G. Murphy,Benjamin O. Ladd,Xiaolei Huang,Brian Borsari
机构: University of Memphis (孟菲斯大学); Veterans Affairs Health Care System (退伍军人事务医疗保健系统); University of California San Francisco (加利福尼亚大学旧金山分校); Washington State University Vancouver (华盛顿州立大学温哥华分校)
类目: Computation and Language (cs.CL)
备注: DOI: https://doi.org/10.1093/milmed/usag224

点击查看摘要

Abstract:BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

[NLP-61] Leverag ing Speech to Identify Signatures of Insight and Transfer in Problem Solving

【速读】：该论文试图解决的核心问题是：顿悟（insight）在问题解决中呈现何种形式，以及这种顿悟如何影响个体未来处理类似问题的能力。具体而言，研究旨在探究当一系列问题共享同一非明显解法时，参与者能否通过顿悟实现更快的改进，并揭示这种可转移顿悟（transferable insight）的可观测特征。解决方案之关键是通过“火柴算术”（matchstick-arithmetic）实验设计，让参与者（N=189）在解决五个序列问题时进行出声思维（think-aloud），并设置两组：一组所有问题依赖同一非明显解法（Same组），另一组每次使用不同解法（Different组）。实验发现，Same组进步更快，且在后续问题中会更多地自发分类问题并增加言语报告（verbal report）。由此，研究确认了可转移顿悟的一个关键标志是能够被口头报告的可触及性（accessibility for verbal report），即使其潜在前因仍难以言明。

链接: https://arxiv.org/abs/2605.12970
作者: Linas Nasvytis,Judith E. Fan
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to talk aloud as they attempted to solve a sequence of five “matchstick-arithmetic” problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). We found that Same participants improved more rapidly than Different participants, and as they improved, they talked more and talked about different things when solving later problems. Specifically, they were more likely to spontaneously categorize the problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.

[NLP-62] Controlling Logical Collapse in LLM s via Algebraic Ontology Projection over F2

【速读】：该论文试图解决一个核心问题：大型语言模型是否在其内部表示中编码了本体论关系（ontological relations），并且这些关系能否在形式上可验证的代数结构（algebraic structure）中被捕捉。为此，论文提出了代数本体投影（Algebraic Ontology Projection, AOP），其关键解决方案是将LLM的隐藏状态投影到有限域F2（Galois Field F2）中，并利用Liskov替换原则（Liskov Substitution Principle）作为约束，仅需42个关系对作为代数密钥。通过这种方式，AOP能够在未见过的概念对上实现高达93.33%的零样本包含准确率，且无需模型调优，仅通过优化提示即可。此外，论文引入了语义结晶（Semantic Crystallisation, SC）度量，用于量化F2约束满足度相对于随机基线的偏离，从而无需保留数据即可预测零样本准确率。研究还发现，系统提示充当代数边界条件，只有与指令微调结合才能防止后期层坍缩（Late-layer Collapse），即最后几层逻辑一致性的系统性退化。这些发现将前向计算重新解释为代数组织的迭代过程，为构建逻辑结构可形式化访问的LLM开辟了新路径。

链接: https://arxiv.org/abs/2605.12968
作者: Hisashi Miyashita,Mgnite Inc
机构: Mgnite Inc.(Mgnite公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families – with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse – a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.12968 [cs.LG] (or arXiv:2605.12968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-63] DiMtextsuperscript3: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

【速读】：该论文试图解决在无需昂贵的多语言多模态数据构建和重复端到端重训练的前提下，将多语言能力注入现有多模态模型的问题。解决方案的关键是提出一种名为 DiM3（Direction- and Magnitude-aware Multilingual Multimodal merging）的免训练方法，通过在共享语言模型骨干中按参数维度选择性地组合多语言和多模态的异构残差更新，同时保留原始视觉编码器和多模态投影器，从而在保持通用多模态能力的同时显著提升多语言性能。

链接: https://arxiv.org/abs/2605.12960
作者: Zijing Wang,Mingyang Wang,Ercong Nie,Yongkang Liu,Shi Feng,Mengjie Zhao,Daling Wang,Xiaocui Yang,Hinrich Schütze
机构: Northeastern University (东北大学); CIS, LMU Munich (慕尼黑大学计算机科学系); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Shanghai Jiao Tong University (上海交通大学); SB Intuitions (SB Intuitions)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on this https URL.

[NLP-64] From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

【速读】：该论文旨在解决监督微调 (Supervised fine-tuning, SFT) 中数据选择的配方优化问题，即从固定指令池中通过过滤、混合、去重等有序操作组合发现一个可执行配方，在有限完整SFT评估预算下构造高质量的子集，而不生成或重写训练样本。解决方案的关键在于AutoSelection，一个两层求解器，通过缓存任务、数据和模型侧信号与昂贵全评估解耦，并采用预热探测、实现子集状态、局部配方编辑、高斯过程辅助排名以及停滞触发重播种等机制，高效搜索最优配方结构，实验表明该方法在分布内推理和分布外泛化上均优于全数据训练和随机搜索等基线。

链接: https://arxiv.org/abs/2605.12944
作者: Haodong Wu,Jiahao Zhang,Lijie Hu,Yongqi Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top- k subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top- k , and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at this https URL.

[NLP-65] ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

【速读】：该论文旨在解决富含地理信息文本（geographic text）在机器翻译（machine translation, MT）中存在的准确性问题，特别是针对涉及具体地理实体（geo-entity）的游记翻译。这类文本在旅游管理等领域具有重要价值，但多语言访问的公平性依赖于高质量的翻译。现有MT系统对地理语境和区域特异性的把握不足，导致地理实体翻译错误频发，且缺乏细粒度的评估方法。解决方案的关键在于引入ATD-Trans数据集，这是一个基于地理信息的日英游记平行语料，通过同时提供整体译文和地理实体级别的标注，使得翻译质量评估能够区分不同区域（日本国内与海外）和实体类型。基于该数据集对现有语言模型进行的实验揭示出两个核心因素：模型的语言侧重点（日语增强模型表现更优）和地理区域差异（国内区域的地理实体翻译难度更大）。因此，该研究为改进地理文本MT指明了方向，即需要增强模型对区域地理知识的编码能力，并建立针对地理实体的专门评估体系。

链接: https://arxiv.org/abs/2605.12933
作者: Shohei Higashiyama,Hiroki Ouchi,Atsushi Fujita,Masao Utiyama
机构: National Institute of Information and Communications Technology; Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Geographic text, or textual data rich in geographic (geo-) information is a valuable source for various geographic applications, e.g., tourism management. Making such information accessible to speakers of other languages further enhances its utility; thus, accurate machine translation (MT) is essential for equity in multilingual geo-information access. To facilitate in-depth analysis for geographic text, we introduce ATD-Trans, a geographically grounded Japanese–English travelogue translation dataset, which enables evaluation of MT quality at both the overall and geo-entity levels across domestic (within Japan) and overseas regions. Our experiments on existing language models examine two factors: model language focus and geographic regions. The results highlight advantages of Japanese-enhanced models and greater difficulty in translating domestic-region geo-entities mentioned in travel blogs.

[NLP-66] When Attention Closes: How LLM s Lose the Thread in Multi-Turn Interaction

【速读】：该论文试图解释和解决大型语言模型（Large Language Models）在多轮交互中逐渐丢失指令、人设和规则痕迹的行为退化问题，该现象此前仅有行为学测量而缺乏机制性解释。解决方案的关键在于提出一种“通道转换”机制（channel-transition account）：目标定义标记（goal-defining tokens）在注意力机制中随交互进行而变得不可访问（即注意力通道关闭），但目标相关信息可能仍以残差表示（residual representations）的形式持续存在于模型中。作者引入目标可访问性比率（Goal Accessibility Ratio, GAR）作为诊断工具，量化生成标记对任务定义目标标记的注意力，并结合滑动窗口消融（sliding-window ablations）和残差流探针（residual-stream probes）进行因果分析。通过在不同架构中对比注意力消失后的行为表现，他们发现有些模型在注意力近乎为零时仍能维持目标条件化行为，而另一些模型即使残差中可解码目标信息也会失败。进一步的因果实验（如对Mistral模型强制关闭注意力通道）将20个事实的回忆任务从接近完美降至11%，并使人设约束违规达到对抗压力基线水平。最终，论文通过注意力损失与残差可解码性之间的差距预测了目标条件化行为能否在通道关闭后存活，并提供了窗口注意力关闭下失效时间的参数化预测。

链接: https://arxiv.org/abs/2605.12922
作者: Vardhan Dongre,Joseph Hsieh,Viet Dac Lai,Seunghyun Yoon,Trung Bui,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe Research (Adobe研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

[NLP-67] CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在进行基于实体的常识推理时，特别是关于因果关系的溯因推理（abductive reasoning）和生成解释能力的评估不足问题。现有数据集多采用真/假或多项选择题形式，缺乏对因果推理和解释生成的显式评估。解决方案之关键为构建了一个名为CommonWhy的数据集，包含15000个“why”问题，这些问题不仅要求模型进行基于实体的因果常识推理，同时作为知识图谱问答（KGQA）基准，所有支持知识均来自Wikidata知识图。与主要测试事实检索的传统KGQA数据集不同，CommonWhy针对因果常识推理，建立了新的KGQA评估范式，从而能揭示当前先进LLMs和基于LLM的KGQA方法在事实幻觉和因果推理失败等方面的显著缺陷。

链接: https://arxiv.org/abs/2605.12918
作者: Armin Toroghi,Faeze Moradi Kalarde,Scott Sanner
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model’s ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

[NLP-68] When Do LLM s Generate Realistic Social Networks? A Multi-Dimensional Study of Culture Language Scale and Method

【速读】：该论文旨在解决大型语言模型（LLMs）作为人类被试替代者用于行为模拟（包括合成社交网络生成）时，其关系输出如何受到看似实现细节的提示设计（prompt design）、文化框架（cultural framing）、提示语言（prompt language）和模型规模（model scale）等变量的系统性影响——这些因素常被当作无关参数，但实际可能编码了实质性的社会学假设。解决方案的关键在于：基于同质性理论（homophily theory）和结构平衡理论（structural balance theory），形式化定义四种基于LLM的连边生成机制（顺序型、全局型、局部型、迭代型），将其视为边集上的不同条件分布；随后通过固定包含50个基于人口统计特征角色的名单，在四种文化背景、四种提示语言、三种GPT-4.1变体和四种提示架构下，生成了192个有向网络（每条件两个随机种子），从而系统性地隔离和测量各因素对网络结构（如近亲繁殖同质性、最大连通分量、聚类系数、模块化以及人口统计偏见）的独立贡献与交互效应。实验发现文化框架、提示架构、模型规模和提示语言均能显著改变网络特性，且最小模型变体表现出定性不同的行为而不仅仅是噪声，这证实了提示选择本身是一个实质性的社会-计算学变量。

链接: https://arxiv.org/abs/2605.12898
作者: Sai Hemanth Kilaru,Sriram Theerdh Manikyala,Raghav Upadhyay,Sri Sai Kumar Ramavath,Srivika Nunavathu,Dalal Alharthi
机构: University of Arizona (亚利桑那大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as substitutes for human subjects in behavioral simulations, including synthetic social network generation. Yet it remains unclear how their relational outputs depend on prompt design, cultural framing, prompt language, and model scale. Building on homophily theory and structural balance theory, we formalize four LLM-based tie-formation mechanisms: sequential, global, local, and iterative, and treat them as distinct conditional distributions over edge sets. Using a fixed roster of 50 demographically grounded personas, we generate 192 verified directed networks across four cultural contexts, four prompt languages, three GPT-4.1 variants, and four prompting architectures, with two seeds per condition. We find that cultural framing shifts inbreeding homophily and largest-component connectivity. Political affiliation dominates tie formation under three methods, while the global method substitutes age, showing that prompt architecture functions as a substantive sociological variable. Model scale produces a stable divergence ranking, with the smallest variant behaving qualitatively differently rather than merely noisily. Prompt language alone sharply shifts religion homophily, especially under Hindi prompting, while leaving political homophily nearly invariant. LLM-generated networks match real social graphs on clustering and modularity better than standard graph baselines, yet encode demographic biases above empirical levels. These results show that prompt choices often treated as implementation details encode substantive sociological assumptions. Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2605.12898 [cs.SI] (or arXiv:2605.12898v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2605.12898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-69] Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

【速读】：该论文试图解决基于大型语言模型（Large Language Model, LLM）的智能体在与真实用户交互时，因现有用户模拟器过于合作和同质（cooperative and homogeneous）而导致模拟表现良好但实际部署失败的问题。解决方案的关键在于提出Persona Policies (PPol)，一个即插即用的控制层，通过将人格生成视为LLM驱动的进化程序搜索（evolutionary program search），优化一个Python生成器来自动发现多样的人类行为模式，并将其转化为任务保留的角色扮演策略（task-preserving roleplay policies）。该生成器由结合人类相似度与行为覆盖度的多目标适应度评分（multi-objective fitness score）引导，最终产出覆盖广泛真实人类行为的人格种群，从而提升模拟器的真实性及基于模拟训练的智能体鲁棒性。

链接: https://arxiv.org/abs/2605.12894
作者: Harshita Chopra,Kshitish Ghate,Aylin Caliskan,Tadayoshi Kohno,Chirag Shah,Natasha Jaques
机构: University of Washington(华盛顿大学); Georgetown University(乔治城大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint under review

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

[NLP-70] CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

【速读】：该论文试图解决现有文档视觉问答（Doc-VQA）评估仅关注最终答案、忽略支持性证据的问题，揭示了一种关键缺陷：模型可能给出正确答案却将其定位到错误文本区域，这在法律、金融、医学等高风险领域尤为危险。解决方案的关键在于提出 CiteVQA 基准，要求模型同时返回答案和元素级别的边界框引用，并采用严格属性准确率（Strict Attributed Accuracy, SAA）指标进行联合评估——仅在答案和引用区域均正确时才计分，从而暴露并度量“归因幻觉（Attribution Hallucination）”，同时通过基于掩码消融的自动化管道生成地面真实引用并经由专家审核，保证数据的可靠性与可扩展性。

链接: https://arxiv.org/abs/2605.12882
作者: Dongsheng Ma,Jiayu Li,Zhengren Wang,Yijie Wang,Jiahao Kong,Weijun Zeng,Jutao Xiao,Jie Yang,Wentao Zhang,Bin Wang,Conghui He
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage – a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at this https URL.

[NLP-71] Persona-Model Collapse in Emergent Misalignment NEURIPS2026

【速读】：该论文试图解决大型语言模型在含有有害内容的窄数据上进行微调后，会在无关提示上产生广泛失调行为（即“涌现性失调”（emergent misalignment））的机制问题。论文提出涌现性失调涉及“人格模型崩溃”（persona-model collapse），即模型在模拟、区分和维持一致角色方面的内部能力退化。解决方案之关键在于引入了两个行为指标：道德易感性（Moral Susceptibility, S）和道德稳健性（Moral Robustness, R），分别基于模型在角色扮演中对道德基础问卷（Moral Foundations Questionnaire）响应的跨角色变异性和同一角色内变异性计算，以形式化模型区分角色的能力（S）和维持给定角色时的一致性（R）。通过对比四种前沿模型（DeepSeek-V3.1、GPT-4.1、GPT-4o、Qwen3-235B）在基础版、不安全代码微调版和匹配的安全控制微调版上的表现，发现不安全微调平均导致S增加55%（远超先前基准的前沿模型带），R平均下降65%（即1/R增加304%），而安全控制微调仅引起部分R损失并保持S接近基础模型，从而提供了涌现性失调的敏感性诊断，并为“人格模型崩溃”假说提供了行为证据。

链接: https://arxiv.org/abs/2605.12850
作者: Davi Bastos Costa,Renato Vicente
机构: TELUS Digital Research Hub(TELUS数字研究枢纽); Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心); Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所); University of São Paulo(圣保罗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission

点击查看摘要

Abstract:Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model’s internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness ®, computed from the across- and within-persona variability of models’ Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model’s ability to differentiate characters (S) and its consistency when simulating a given one ®. We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average 55% increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work – with GPT-4o reaching more than twice the band’s upper end – signaling dysregulated differentiation. It also causes an average 65% decrease in R, equivalent to a 304% increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants’ unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models’ structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

[NLP-72] raining Large Language Models to Predict Clinical Events

【速读】：该论文试图解决的核心问题是：如何从纵向临床笔记（longitudinal clinical notes）中提取患者随时间演变的证据，并将其转化为可用于临床预测（clinical prediction）的训练监督信号，从而避免依赖手工设计的结构化特征或针对特定终点的分类器。解决方案之关键在于扩展**前瞻性学习（Foresight Learning）至临床领域：通过将时间排序的MIMIC-III笔记转换为由三部分组成的预测示例——即患者的过往背景（past patient context）、针对未来可能事件的自然语言提问（natural-language question about a possible future event）以及从后续文档中解析的真实标签（label resolved from later documentation）。这一流程从702次入院记录中生成6,900个预测示例，覆盖药物、手术、器官支持、微生物学和死亡率等多类终点。在此基础上，使用一个轻量级低秩适配器（LoRA adapter）**对这些示例进行微调，显著改善基础模型的校准性能（预期校准误差从0.1269降至0.0398，Brier分数从0.199降至0.145），并在留出问题上略优于GPT-5的点估计。该方案的关键创新在于实现了从非结构化笔记中直接提取可复用的临床预测监督，无需人工清洗特征或为每个预测任务单独训练模型。

链接: https://arxiv.org/abs/2605.12817
作者: Benjamin Turtel,Paul Wilczewski,Kris Skotheim
机构: Lightning Rod Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.

[NLP-73] Linking Extreme Discourse to Structural Polarization in Signed Interaction Networks

【速读】：该论文试图解决在线社区极化研究中语言分析与交互结构分析相分离的问题，现有工作要么单独研究语言或交互结构，要么通过人工判断分歧来构建互动图，导致作为观察文本的语言与作为文本工程化表示的结构之间存在鸿沟。解决方案的关键是提出一个基于语言的符号网络（language-grounded signed-network）管道：利用大语言模型（LLM）立场分数推导出连续符号边权重（continuous signed edge weights），并采用两种互补的量化方法——谱特征符号分数（spectral Eigen-Sign score）和基于划分的挫折分数（partition-based frustration score）——来测量结构极化，再通过归一化使两者一致但保留对边幅度的不同敏感性。此外，该框架还分析了窗口级话语信号（如毒性、极端标量声明和困惑度）与结构极化时间变化的关系，并通过边缘级和消融分析证明连续置信加权的符号边能揭示仅符号表示中被抑制的强度敏感模式，最终实现语言与符号网络结构在统一框架中的连接，以动态测量和解释极化现象。

链接: https://arxiv.org/abs/2605.12814
作者: Zhijin Guo,Li Zhang,Tyler Bonnet,Janet B. Pierrehumbert,Xiaowen Dong
机构: University of Oxford (牛津大学); University College London (伦敦大学学院); Imperial College London (帝国理工学院)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Polarization in online communities is often studied through either language or interaction structure, but the two views are rarely connected in a unified measurement pipeline. Prior work links them by building interaction graphs from human judgments of agreement and disagreement, leaving a gap between language as observed text and structure as an engineered representation of that text. We address this gap with a language-grounded signed-network pipeline that derives continuous signed edge weights from LLM stance scores and quantifies structural polarization using two complementary measures: a spectral Eigen-Sign score and a partition-based frustration score. After normalization, the two measures show substantial agreement while retaining important differences in their sensitivity to edge magnitude. Applying the framework to Reddit Brexit discussions, we analyze how window-level discourse signals, including toxicity, extreme scalar claims, and perplexity, relate to temporal variation in structural polarization. Edge-level and ablation analyses show that continuous, confidence-weighted signed edges reveal intensity-sensitive patterns that are muted under sign-only representations. We further report an exploratory one-step-ahead forecasting analysis suggesting that lagged language signals may contain information about future polarization beyond structural persistence. Together, the results demonstrate how discourse and signed-network structure can be connected in a single framework for measuring and interpreting polarization dynamics over time.

[NLP-74] REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations ICML2026

【速读】：该论文旨在解决大型语言模型（LLMs）在应对对抗性提示（adversarial prompts）时存在的幻觉（hallucinations）问题，特别是现有方法在生成语义等价（semantically equivalent）且连贯的现实性提示时的局限性。当前离散提示攻击受限于有限的变体搜索空间，而连续潜在空间攻击虽探索范围更广但常解码出无效改写。解决方案的关键在于提出REALISTA框架，它通过构建一个依赖于输入的字典，其中每个方向代表一个语义等价且连贯的改写，并在潜在空间中优化这些方向的连续组合，从而将离散改写攻击的语义现实性与连续攻击的优化灵活性相结合。

链接: https://arxiv.org/abs/2605.12813
作者: Buyun Liang,Jinqi Luo,Liangzu Peng,Kwan Ho Ryan Chan,Darshan Thaker,Kaleab A. Kinfu,Fengrui Tian,Hamed Hassani,René Vidal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at ICML 2026. Code is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at this https URL.

[NLP-75] Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

【速读】：该论文试图解决大语言模型（LLMs，Large Language Models）在窄域有害数据集上微调后出现的“突现性错位”（Emergent Misalignment）现象——即模型表现出远超微调分布范围的错位行为，以及“阈下学习”（Subliminal Learning）现象——即通过看似良性但由有害教师生成的数据传播错位。解决方案的关键在于将错位重新定义为一种数据中介的迁移现象：有害微调样本并不导致均匀的行为溢出，而是依赖于微调数据结构（如评估提示与微调提示共享的底层功能结构、提示为连贯有害完成留下的空间）与任务相对于模型的难度之间的交互作用；同时，预训练组成（pretraining composition）和训练通道（如离策略与在线策略蒸馏）也显著影响错位的传递。研究因此倡导一种以数据为中心的观点，即错位不是孤立有害样本的简单后果，而是微调数据结构、预训练分布与训练渠道共同作用的结果。

链接: https://arxiv.org/abs/2605.12798
作者: Baris Askin,Muhammed Ustaomeroglu,Anupam Nayak,Gauri Joshi,Guannan Qu,Carlee Joe-Wong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where misalignment is transmitted by fine-tuning on seemingly benign data generated by a harmful teacher. Moving beyond the standard SFT setting, we for the first time compare this transfer under off-policy and on-policy distillation as well, allowing us to separate the roles of the teacher guidance and the training data distribution in transmitting misalignment. Together, these results argue for a data-centric view: Emergent/subliminal misalignment should not be treated as a simple consequence of isolated harmful fine-tuning examples, but as the result of interactions between fine-tuning data structure, pretraining distributions, and training channels.

[NLP-76] WriteSAE: Sparse Autoencoders for Recurrent State

【速读】：该论文试图解决现有稀疏自编码器（SAE）无法对状态空间模型（如Mamba-2）和混合循环语言模型（如Gated DeltaNet、RWKV-7）中的矩阵缓存写入（matrix cache write）进行分解与编辑的问题。这些模型的缓存写入通过秩-1更新（rank-1 update）( k_t v_t^\top ) 实现，其原生形状为 ( d_k \times d_v )，因此任何向量原子（vector atom）都无法直接替代该操作。解决方案的关键在于：WriteSAE将每个解码器原子（decoder atom）显式分解为与缓存写入匹配的原生形状（即外积形式），并推导出每个token logit偏移的封闭形式（closed form），同时在匹配的Frobenius范数（matched Frobenius norm）下训练，使得每个原子一次仅交换一个缓存槽（cache slot）。通过这种设计，WriteSAE首次实现了对矩阵循环写入位点的行为级干预（behavioral install），实验验证了其在多个模型上的有效性与可解释性。

链接: https://arxiv.org/abs/2605.12770
作者: Jack Young
机构: Indiana University (印第安纳大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 14 figures, 21 tables; code at this https URL

点击查看摘要

Abstract:We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a d_k \times d_v cache through rank-1 updates k_t v_t^\top that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of n=4,851 firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at R^2=0.98 , and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at 3\times lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.

[NLP-77] Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

【速读】：该论文试图解决现有语言模型作为模拟学生时，仅通过输出与真实学生的相似性进行评价，而未能验证其在交互过程中是否保持一致的、由误解驱动的信念状态（misconception faithfulness）的问题。核心问题在于，模拟器是否会像真实学生一样，仅在接收到针对其根本误解的反馈时才更新信念，而非对所有纠正信号（如指出错误）都无条件接受。解决方案的关键在于提出一个受控评估框架，包括误解对比反馈协议（misconception-contrastive feedback protocol），该协议通过比较针对性反馈与两种对照反馈（针对不同误解的错位反馈和仅指出答案错误的通用反馈）的影响，以及由此衍生的选择性翻转分数（Selective Flip Score, SFS），量化模拟器在针对性反馈下相对于对照反馈翻转答案的额外倾向。进一步地，为了纠正模型中普遍存在的阿谀逢迎失败模式（sycophantic failure mode），即模型将任何纠正信号视为放弃模拟信念并依赖内部知识重新求解的线索，论文开发了一个后训练流水线，整合监督微调（supervised fine-tuning, SFT）、偏好优化（preference optimization）和基于SFS对齐奖励（SFS-aligned reward）的强化学习（reinforcement learning, RL），实验表明SFT能带来显著提升（+0.56），而SFS对齐的RL比偏好优化提供更一致的改进。

链接: https://arxiv.org/abs/2605.12748
作者: Heejin Do,Shashank Sonkar,Mrinmaya Sachan
机构: ETH Zürich (苏黎世联邦理工学院); ETH AI Center (ETH AI中心); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.

[NLP-78] Scaling Laws for Mixture Pretraining Under Data Constraints

【速读】：该论文试图解决语言模型预训练中目标数据稀缺时的混合训练优化问题：在目标数据（如低资源语言或专业领域）规模有限的情况下，如何平衡通用数据与稀缺目标数据的混合比例，以避免因目标数据过少导致的欠暴露或因目标数据重复过多导致的收益递减与过拟合。解决方案之关键在于系统性地揭示了重复（repetition）是影响目标域性能的核心驱动力，发现稀缺目标语料可以被安全重复使用15-20次，且其最优重复次数取决于目标数据规模、计算预算和模型参数量；进而，论文提出了一种考虑重复效应的混合缩放定律（repetition-aware mixture scaling law），该定律建模了重复目标词元的边际价值递减以及通用数据的正则化作用，通过优化该缩放定律可以基于数据约束条件、计算预算和模型规模来原理性地确定最优混合配置，从而为实际预训练提供可操作的混合策略建议。

链接: https://arxiv.org/abs/2605.12715
作者: Anastasiia Sedova,Skyler Seto,Natalie Schluter,Pierre Ablin
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As language models scale, the amount of data they require grows – yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

[NLP-79] Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLM s

【速读】：该论文试图解决现代语言模型中隐藏状态在逐层之间变化显著且现有逐层分析仅关注单一层面变化的问题，并探索如何利用逐层结构信息实现无需标签的模型选择与推理时层剪枝等实际部署决策。解决方案的关键在于提出了逐层表示动态（Layer-wise Representation Dynamics, LRD）框架，该框架包含三个互补的测量族：Frenet（用于全局子空间运动的Grassmann速度和曲率）、邻域保留得分（Neighborhood Retention Score, NRS，衡量局部最近邻保持程度）以及图过滤互信息（Graph Filtration Mutual Information, GFMI，量化各层与最终层的对齐程度）。通过将LRD应用于31个模型（包括编码器型、解码器型嵌入模型和基础大语言模型）在30个MTEB任务上的分析，研究发现端到端子空间位移（d₀,L）在模型选择中与下游性能相关性最强，而GFMI在层剪枝中是最有效的单一指导规则（仅Frenet在极轻预算下有效，NRS无法从模型选择迁移至剪枝），从而证明了逐层结构同时为模型解读与部署决策提供了有效信号。

链接: https://arxiv.org/abs/2605.12714
作者: Jingzhou Jiang,Yi Yang,Kar Yan Tam
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hidden states change substantially across the layers of modern language models, but most layer-wise analyses focus on one aspect of that change. We propose Layer-wise Representation Dynamics (LRD), a framework with three layer-wise measurement families: Frenet (Grassmann speed and curvature) for global subspace motion, Neighborhood Retention Score (NRS) for local nearest-neighbor retention, and Graph Filtration Mutual Information (GFMI) for alignment with the final layer. Applying LRD to 31 models (encoder-based and decoder-based embedders, plus base LLMs) on 30 MTEB tasks reveals architectural and task-level differences that are not apparent from final-layer representations alone. We then use LRD for two applications: label-free model selection and inference-time layer pruning. For selection, all three model-level scores correlate positively with downstream MTEB performance, with end-to-end subspace displacement (d_0,L) the strongest, and the same direction holds on a smaller base-LLM MMLU panel. For pruning, GFMI is the only measurement-guided rule that beats Random at the 15% and 20% budgets and has the best median change at every budget. Frenet is effective only at the lightest budget, while NRS does not transfer from model selection to pruning. These results show that layer-wise structure provides signal for both interpretation and deployment decisions.

[NLP-80] All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLM s ICML2026

【速读】：该论文试图挑战电路和结构发现（Circuit and Sheaf Discovery, CSD）领域中一个核心但隐含的假设——功能各向异性假说（Functional Anisotropy Hypothesis），即认为大语言模型中的每个功能都对应唯一或近乎唯一的内部机制。通过经验和理论证据，论文表明单个任务可以由多个结构不同但同时具备忠实性、稀疏性和完整性的电路或结构来支持。解决方案的关键在于提出了一种名为重叠感知的结构排斥（Overlap-Aware Sheaf Repulsion）方法，该方法通过对多次发现运行之间的结构重叠施加显式惩罚来增强CSD目标，从而能够系统性地揭示这些相互竞争的机制，使得发现的电路或结构在保持强任务性能的同时，具有极小的共享结构。

链接: https://arxiv.org/abs/2605.12671
作者: Xi Chen,Mingyu Jin,Jingcheng Niu,Yutong Yin,Jinman Zhao,Bangwei Guo,Dimitris N. Metaxas,Zhaoran Wang,Yutao Yue,Gerald Penn
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026

点击查看摘要

Abstract:In this paper, we present empirical and theoretical evidence against a central but largely implicit assumption in circuit and sheaf discovery (CSD), which we term the Functional Anisotropy Hypothesis: the idea that functions in large language models (LLMs) are localised to a unique or near-unique internal mechanism. We show that a single LLM task can instead be supported by multiple, structurally distinct circuits or sheaves that are simultaneously faithful, sparse, and complete. To systematically uncover such competing mechanisms, we introduce Overlap-Aware Sheaf Repulsion, a method that augments the CSD objective with an explicit penalty on structural overlap across multiple discovery runs, enabling the discovery of circuits or sheaves with strong task performance but minimal shared structure across a plethora of common CSD benchmarks. We find that this phenomenon becomes increasingly pronounced as the number of discovered sheaves grows and persists robustly across major CSD methods. We further identify an ultra-sparse three-edge sheaf and show that none of its edges is individually indispensable, undermining even weakened notions of canonical or essential components. To explain these findings, we propose a Distributive Dense Circuit Hypothesis and provide a theoretical analysis demonstrating that non-unique, low-overlap circuit explanations arise naturally from high-dimensional superposition under mild assumptions. Together, our results suggest that mechanistic explanations in LLMs are inherently non-canonical and call for a rethinking of how CSD results should be interpreted and evaluated.

[NLP-81] raining LLM s with Reinforcement Learning for Intent-Aware Personalized Question Answering

【速读】：该论文试图解决语言模型在单轮个性化问答（personalized question answering, PQA）中无法有效推断并利用用户隐含意图（implicit user intent）的问题。现有方法依赖多轮对话上下文或丰富的用户画像，且未在推理过程中显式建模用户意图，导致其在单轮场景下效果不佳。解决方案的关键在于提出IAP（Intent-Aware Personalization）框架，该框架采用强化学习（reinforcement learning）训练模型，使其能够直接从单轮问题中推断用户的隐含意图，并通过一种标签模式（tag-based schema）将意图融入思维步骤（thinking steps），从而生成基于意图的个性化回答。具体而言，通过优化意图感知的回答轨迹（intent-aware answer trajectories）并设计个性化奖励函数（personalized reward function），IAP强化了那些将隐含用户意图显式化并生成更符合用户潜在目标的回答的生成路径。

链接: https://arxiv.org/abs/2605.12645
作者: Maryam Amirizaniani,Benjamin Charles Germain Lee,Jevin West,Nicholas Weber
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective personalized question answering (PQA) in language models requires grounding responses in the user’s underlying intent, where intent refers to the implicit ``why’’ behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user’s latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user’s underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.

[NLP-82] DocAtlas: Multilingual Document Understanding Across 80 Languages

【速读】：该论文试图解决低资源语言（low-resource languages）在多语言文档理解中因训练数据稀缺和基于模型的标注管道（model-based annotation pipelines）固有偏见而导致性能受限的问题。其解决方案之关键在于提出了 DocAtlas 框架，该框架通过双管道（dual pipelines）——即对原生 DOCX 文档的差分渲染（differential rendering）以及对从右向左脚本（right-to-left scripts）的合成 LaTeX 生成——构建高保真 OCR 数据集与基准，覆盖 82 种语言和 9 个评估任务。所有标注均以统一的 DocTag 格式编码布局、文本和组件类型，且核心标注过程不依赖任何学习模型。进一步地，该方案利用渲染得到的真实数据作为正信号，采用直接偏好优化（Direct Preference Optimization, DPO）实现稳定的多语言适应，在避免监督微调（supervised fine-tuning）导致基语言性能退化的前提下，同时提升域内（+1.9%）和域外（+1.8%）准确率，其最佳变体 DocAtlas-DeepSeek 相较于最强基线提升了 1.7%。

链接: https://arxiv.org/abs/2605.12623
作者: Ahmed Heakl,Youssef Mohamed,Abdullah Sohail,Rania Elbadry,Ahmed Nassar,Peter W. J. Staar,Fahad Shahbaz Khan,Imran Razzak,Salman Khan
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under submission

点击查看摘要

Abstract:Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

[NLP-83] In-Situ Behavioral Evaluation for LLM Fairness Not Standardized-Test Scores

【速读】：该论文试图解决现有大语言模型（Large Language Model, LLM）公平性评估中“标准化测试-问答基准（standardized-test QA benchmarks）”范式的结构不可靠问题：表面性的提示构建选择（与所测公平性正交）却主导了评分方差，导致公平性结论的方向和幅度偏移，并造成模型排名严重不一致。解决方案的关键在于提出MAC-Fairness，一个多智能体对话框架（multi-agent conversational framework），通过将受控变异因素嵌入多轮对话中进行原位行为评估（in-situ behavior evaluation），即把标准化测试问题重新用作对话种子而非评估工具，从自我视角评估立场坚持（position persistence），从他人视角评估同伴接受度（peer receptiveness）。该框架基于800万条跨多模型和身份配置的对话转录本，揭示了稳定且模型特定的行为特征，这些特征能够泛化至不同公平性目标和评估方法的基准中，从而提供标准化测试范式无法给出的行为级证据。

链接: https://arxiv.org/abs/2605.12530
作者: Zeyu Tang,Sang T. Truong,Deonna Owens,Shreyas Sharma,Yibo Jacky Zhang,Brando Miranda,Sanmi Koyejo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test QA benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models’ conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

[NLP-84] PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media

【速读】：该论文试图解决当前社交媒体情感分析以作者为中心（author-centric）的范式缺陷，即未能捕捉不同读者对相同内容的主观情感反应，忽视了个人感知、交流行为与社交网络之间的内在关联。解决方案的关键在于提出了PERCEIVE——一个双语（英语和中文）大规模基准（benchmark），首次整合了五个关键维度：作者生成的内容、真实读者的情感反馈（来源于读者评论）、交流行为、用户属性以及社交图。通过从读者评论中标注情感并同步捕获交流意图，PERCEIVE实现了向个性化、以读者为中心（reader-centric）的范式转变，为建模情感与行为在社交语境中的耦合提供了独特资源，并基于全面的评估协议揭示了现有方法（包括具备高级推理增强的大语言模型）在处理这一多面用户感知任务时的显著不足。

链接: https://arxiv.org/abs/2605.12525
作者: Jian Liao,Yujin Zheng,Suge Wang,Jianxing Zheng,Deyu Li
机构: School of Computer and Information Technology, Shanxi University, China (中国山西大学计算机与信息技术学院); Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, China (中国山西大学教育部计算智能与中文信息处理重点实验室); Joint Laboratory of Tourism Big Data in Shanxi Province, China (中国山西省旅游大数据联合实验室)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current emotion analysis in social media is predominantly author-centric, failing to capture the subjective nature of emotional responses across diverse readers. This paradigm overlooks the crucial link between individual perception, communication behavior, and the underlying social network. To bridge this gap, we introduce PERCEIVE, a novel bilingual (English and Chinese) large-scale benchmark that, to the best of our knowledge, is the first to integrate five critical dimensions for social perception: author-created content, genuine readers’ emotional feedback (derived from their comments), communication behavior, user attributes, and the social graph. This benchmark enables a paradigm shift towards truly personalized, reader-centric analysis, where different readers’ emotional responses to the same content are naturally captured through their real-world interactions. By annotating emotions from reader comments and synchronously capturing communication intent, PERCEIVE provides a unique resource to model the intrinsic coupling between emotion and behavior, grounded in social context. We establish a comprehensive evaluation protocol, testing state-of-the-art methods, including large language models (LLMs) with advanced reasoning enhancement. Our findings reveal significant shortcomings in existing approaches when handling this multifaceted, user-aware task. PERCEIVE offers a foundational resource and clear direction for future research in socially-intelligent NLP, pushing models towards a more unified understanding of emotion on social media.

[NLP-85] Differences in Text Generated by Diffusion and Autoregressive Language Models

【速读】：该论文旨在探究扩散语言模型（Diffusion Language Models, DLMs）与自回归语言模型（Autoregressive Language Models, ARMs）在生成文本上的内在差异，具体表现为DLMs生成的文本具有更低的n-gram熵、更高的语义连贯性及更高的语义多样性，并揭示这些差异的成因。解决方案的关键在于通过控制实验解耦训练目标与解码算法的影响：实验表明，DLM训练目标（特别是双向上下文）主要贡献于语义连贯性和多样性的提升，而对熵影响较小；熵的降低则源自DLMs的解码算法，尤其是基于置信度的重掩码策略（confidence-based remasking）。此外，论文为熵降低现象提供了理论解释，从而为未来DLM训练目标和解码算法的设计提供了依据。

链接: https://arxiv.org/abs/2605.12522
作者: Zeyang Zhang,Chengwei Liang,Xingyan Chen,Meiqi Gu,Minrui Luo,Jingzhao Zhang,Tianxing He
机构: Shanghai Qi Zhi Institute (上海期智研究院); Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Xiongan AI Institute (雄安人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) are promising alternatives to autoregressive language models (ARMs), yet the intrinsic differences in their generated text remain underexplored. We first find empirically that off-the-shelf DLMs exhibit lower n -gram entropy, higher semantic coherence, and higher semantic diversity. To understand the cause, we conduct controlled experiments that decouple the effects of training objectives and decoding algorithms. Results suggest that the DLM training objective contributes to the increases in semantic coherence and semantic diversity, but has a minor influence on entropy. These differences are primarily driven by the bidirectional context; other components in the training objective, such as input masking, label masking, and the weighting function, have a much weaker influence. Further, our experiments demonstrate that the reduction in entropy stems from DLMs’ decoding algorithms, particularly confidence-based remasking strategies. We provide a theoretical understanding for this entropy reduction phenomenon. Together, our work uncovers key mechanisms underlying the differences between DLMs and ARMs in text generation, and informs future design of training objectives and decoding algorithms in DLMs.

[NLP-86] oolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

【速读】：该论文旨在解决大型语言模型（LLM）作为自主智能体时所需的多轮工具调用（multi-turn tool calling）能力中，现有合成数据生成方法产生非真实对话的问题——具体表现为工具链仅表面兼容而与用户任务脱节、一次性对话生成导致参数幻觉（parameter hallucination）以及多步工具交互严重不足。解决方案的关键在于提出的ToolWeave框架：通过构建具有内置依赖（built-in dependencies）的工具，并基于用户目标对齐筛选工作流（workflows），同时引入细粒度规划阶段（fine-grained planning stage）显式追踪参数来源（parameter provenance），从而生成包含更多（45%）多步工具交互且显著减少参数和工具名称幻觉的合成对话数据，最终使微调后的LLM在多个公开基准上一致超越先前方法。

链接: https://arxiv.org/abs/2605.12521
作者: Dinesh Khandelwal,Gnana Prakash Punnavajhala,GPS Bhargav,Gaurav Pandey,Sachin Joshi,Hima Karanam,Dinesh Raghu
机构: IBM Research(IBM研究院); IIIT Hyderabad(海德拉巴国际信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn tool calling is essential for LLMs to function as autonomous agents, yet synthesizing the training data required for these capabilities remains a fundamental challenge. Existing synthetic data generation pipelines often produce unrealistic dialogues for two reasons: they chain tools that are only superficially compatible rather than aligned with meaningful user tasks, and they generate dialogues in one shot, which often introduces arguments that were neither provided by the user nor produced by prior tool calls. These issues also lead to a severe underrepresentation of multi-step tool interactions. We introduce ToolWeave, a structured framework for synthesizing realistic multi-turn tool-calling dialogues. ToolWeave support realistic multi-step workflows (or tool sequences) by constructing tools with built-in dependencies and filters the workflows based on alignment with user goals. It reduces parameter hallucination by using a fine-grained planning stage that explicitly tracks parameter provenance. As a result, ToolWeave-generated synthetic dialogues contain more multi-step tool interactions (45%) and fewer hallucinations in parameters and tool names. Consequently, LLMs fine-tuned on ToolWeave consistently outperform those fine-tuned on prior datasets across three public benchmarks. Notably, Llama-3.1-70B fine-tuned on ToolWeave achieves 39.75% on BFCL-V3 multi-turn, compared to 23.50% when fine-tuned on SOTA ToolFlow data.

[NLP-87] BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agent ic Reasoning and Constraint-Aware Calibration

【速读】：该论文试图解决现有分类体系归纳（taxonomy induction）方法在零样本（zero-shot）和大规模场景下泛化能力不足、结构可靠性差及效率低下的问题。解决方案的关键是提出了BoostTaxo——一种采用boosting策略的大型语言模型（LLM）框架，通过粗到细（coarse-to-fine）的父节点识别流程，融合检索增强的定义精炼（retrieval-augmented definition refinement）、混合父候选选择（hybrid parent candidate selection）、候选评级（candidate rating）以及结构感知评分校准（structure-aware score calibration）四个模块：其中轻量级LLM高效筛选候选父节点，大规模LLM对候选进行排序和打分，并利用结构特征校准边权重以提升诱导结果的结构可靠性。混合候选选择机制与结构感知校准是该框架的核心创新。

链接: https://arxiv.org/abs/2605.12520
作者: Yancheng Ling,Zhenlin Qin,Leizhen Wang,Zhenliang Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages,7 figtures

点击查看摘要

Abstract:Taxonomy induction is crucial for organizing concepts into explicit and interpretable semantic hierarchies. While existing methods have achieved promising results, their generalization, structural reliability, and efficiency remain limited, hindering their performance in zero-shot and large-scale scenarios. To overcome these limitations, we introduce BoostTaxo, a boosting-style LLM framework for zero-shot taxonomy induction. It takes a set of domain terms as inputs and performs parent identification in a coarse-to-fine manner, employing retrieval-augmented definition refinement, hybrid parent candidate selection, candidate rating, and structure-aware score calibration to improve taxonomy construction. Specifically, a lightweight LLM is used to efficiently filter candidate parents, while a large-scale LLM is employed to rank and score candidate parents for fine-grained parent selection. Structural features are further incorporated to calibrate candidate edge weights and enhance the reliability of the induced taxonomy. The unified BoostTaxo is evaluated on three public benchmark datasets, namely WordNet, DBLP, and SemEval-Sci, and achieves superior or comparable performance to state-of-the-art methods in zero-shot taxonomy induction. The ablation study validates the contribution of the hybrid parent candidate selection and the structure-aware score calibration to the overall performance. Further analysis investigates the impact of candidate selection size on taxonomy quality and presents representative case and failure studies, providing deeper insights into the effectiveness and limitations of the proposed framework.

[NLP-88] Correct Answers from Sound Reasoning : Verifiable Process Supervision for Language Models

【速读】：该论文试图解决在训练语言模型时，仅基于最终结果的可验证奖励强化学习（RL）虽然能提升任务准确率，却会导致推理质量严重下降（包括推理不准确、不完整甚至内部不一致）的问题。解决方案的关键在于提出了一种可验证过程监督（Verifiable Process Supervision, VPS）框架，通过监督微调诱导结构化推理格式，从而能够语法提取中间声明并利用地真信号形成过程级奖励，同时引入自适应奖励加权（adaptive reward weighting）以优先处理剩余误差最大的推理子任务，形成隐式课程，最终联合优化预测准确性与推理质量，在保持准确率的同时显著改善推理的可靠性与一致性。

链接: https://arxiv.org/abs/2605.12519
作者: Kyuyoung Kim,Kevin Wang,Yunfei Xie,Peiyang Xu,Peiyao Sheng,Chen Wei,Zhangyang Wang,Jinwoo Shin,Pramod Viswanath,Sewoong Oh
机构: KAIST AI (韩国科学技术院人工智能); University of Texas, Austin (德克萨斯大学奥斯汀分校); Rice University (莱斯大学); Princeton University (普林斯顿大学); University of Washington (华盛顿大学); Sentient Labs (Sentient实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.

[NLP-89] melineReason er: Advancing Timeline Summarization with Large Reasoning Models

【速读】：该论文旨在解决从海量非结构化在线新闻内容中自动提取结构化时间线（Timeline Summarization, TLS）的挑战，克服现有大语言模型（LLM）方法仅作为被动生成器的局限性。解决方案的关键在于引入大型推理模型（LRM）的主动推理能力，通过提出一个名为TimelineReasoner的两阶段框架，将TLS从静态生成转变为推理驱动的动态过程。该框架包含两个核心阶段：全局认知（Global Cognition）从宏观层面持续追踪事件并维护全局事件记忆；细节探索（Detail Exploration）通过识别信息缺口并借助目标文档检索来细化时间线。支撑该框架的专门机制包括事件抓取器（Event Scraper）用于获取时间事件描述、时间线更新器（Timeline Updater）用于迭代修正时间线，以及监督器（Supervisor）用于检测缺失事件并引导检索。这种方法使得模型能够主动执行迭代证据获取、缺失事件检测和时间一致性验证，从而在开放域TLS数据集上显著提升了时间线的准确性、覆盖率和连贯性。

链接: https://arxiv.org/abs/2605.12518
作者: Liancheng Zhang,Xiaoxi Li,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of online news poses a challenge to extracting structured timelines from unstructured content. While recent studies have shown that Large Language Models (LLMs) can assist Timeline Summarization (TLS), these approaches primarily treat models as passive generators. The emergence of Large Reasoning Models (LRMs) presents an opportunity to reason over events actively, enabling iterative evidence acquisition, the detection of missing events, and the validation of temporal consistency. To systematically leverage the reasoning capabilities of LRMs, we propose TimelineReasoner, a novel framework that shifts TLS from static generation to an active, reasoning-driven process. Unlike prior work, TimelineReasoner adopts a two-stage framework: Global Cognition, which tracks events at a macroscopic level and continuously updates a global event memory, and Detail Exploration, which identifies informational gaps and refines the timeline via targeted document retrieval. To support this, TimelineReasoner incorporates several specialized mechanisms, including an Event Scraper for retrieving temporal event descriptions, a Timeline Updater for refining the timeline, and a Supervisor for detecting gaps in the timeline and guiding retrieval. Experimental results on open-domain TLS datasets demonstrate that TimelineReasoner significantly outperforms existing LLM-based TLS methods in terms of timeline accuracy, coverage, and coherence. On closed-domain TLS datasets, our method performs on par with or exceeds state-of-the-art approaches. This work not only pushes the boundaries of TLS but also highlights the broader potential of LRM-based reasoning frameworks for timeline summarization.

[NLP-90] Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models ICLR2026

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在仅接收文本输入时出现的性能显著下降——表现为准确率大幅降低与校准严重失调——的问题。研究发现，即便文本描述保留了完整的语义信息，模型置信度依然不可靠，而通过生成图像引入视觉信号可部分恢复准确率与校准性能。解决方案的关键在于提出潜在想象模块（Latent Imagination Module, LIM），这是一个轻量级的交叉注意力模块，能够从文本输入预测出潜在的想象嵌入（imagined latent embeddings），并将其送入冻结的VLM骨干网络，而无需进行像素级的图像合成。LIM通过在潜在空间完成缺失的视觉模态，在不生成实际图像的前提下提升了文本仅输入场景下的准确率，并降低了校准误差。

链接: https://arxiv.org/abs/2605.12517
作者: Mingyeong Kim,Jungwon Choi,Chaeyun Jang,Juho Lee(Kim Jaechul Graduate School of AI, KAIST)
机构: Kim Jaechul Graduate School of AI, KAIST (金载彻人工智能研究生院, 韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 16 figures. Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety across Modalities

点击查看摘要

Abstract:Vision-language models (VLMs) are often deployed on text-only inputs, although they are trained with images. We find that removing the vision modality causes large drops in accuracy and severe miscalibration, and the model does not behave like its original language backbone under text-only prompting. This failure is not explained only by missing semantic information. Even when text descriptions preserve key content, confidence becomes unreliable, while adding a visual signal through generated images partially restores accuracy and calibration. We propose the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input and feeds them into a frozen VLM backbone without pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM improves accuracy and reduces calibration error. These results suggest that latent modality completion is a practical approach for reliable VLM inference under missing-modality.

[NLP-91] Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

【速读】：该论文旨在解决通用大语言模型（LLMs）在增材制造（Additive Manufacturing, AM）等专业工程领域中，因缺乏领域知识和对结构化技术知识的接触不足而导致的回答准确性、相关性和可用性低下的问题。解决方案的关键在于采用检索增强生成（Retrieval-Augmented Generation, RAG）策略，即通过从精心构建的AM语料库的向量数据库中检索相关文档片段来增强模型回答，而非对原始非结构化文本进行领域微调。实验表明，RAG模型在专家设计的200个AM问题上，准确性、相关性和总体偏好均显著优于基线模型（分别达到75.5%、90.8%和85.2%），而直接微调反而降低了性能，从而证明RAG是适应性更强的路径。

链接: https://arxiv.org/abs/2605.12516
作者: Saiful Islam Sagor,Tania Haghighi,Minhaj Nur Alam,Erina Baynojir Joyee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.

[NLP-92] Mitigating Cross-Lingual Cultural Inconsistencies in LLM s via Consensus-Driven Preference Optimisation

【速读】：该论文旨在解决多语言大语言模型（Multilingual Large Language Models, MLLMs）在提示语言变化时表现出的跨语言文化不一致性（Cross-lingual Cultural Inconsistency）问题。具体而言，当模型被赋予固定用户身份（如英国人）后，提示语言会覆盖系统角色设置，导致对不同语言版本中相同知识查询（如文学作者）输出不同文化背景的答案。论文提出的解决方案之关键是引入基于共识的对齐框架——跨语言文化一致偏好优化（Cross-lingual Cultural Consistent Preference Optimisation, C-3PO），该框架通过优化模型输出在不同语言下的文化偏好一致性，显著提升度量指标Singleton Fleiss’s κ_S（一种对幻觉具有数学鲁棒性的跨语言文化不一致性量化指标）达0.10个绝对点。此外，论文还通过层间可解释性分析揭示了机制：MLLMs在正向传播的中间层表示稳定化过程中，会隐式地将输出个性化到提示语言的刻板文化。

链接: https://arxiv.org/abs/2605.12515
作者: Lucas Resck,Isabelle Augenstein,Anna Korhonen
机构: University of Cambridge (剑桥大学); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 13 figures, 9 tables

点击查看摘要

Abstract:Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt’s language changes. While such adaptation is generally desirable, it becomes a critical failure when a user’s identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt’s language frequently overwrites the system persona – yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss’s \kappa_S , a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.10-point absolute increase in \kappa_S over unaligned models, outperforming strong prompting and representation steering baselines. Empirical evaluations show this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. A layer-wise interpretability analysis reveals the underlying mechanism: by early-decoding intermediate layer representations, we find that MLLMs implicitly personalise outputs towards the prompt language’s stereotypical culture as forward-pass representations stabilise.

[NLP-93] WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection AAAI

【速读】：该论文试图解决在加密通信环境（如WhatsApp）中缺乏高质量、专家标注的疫苗相关错误信息数据集的问题，以便支持错误信息研究与计算建模。解决方案之关键在于构建了一个严格且精心设计的管道：通过基于关键词的数据收集、语义去重（semantic deduplication）以去除近似重复内容，以及由医学专家执行的多阶段标注协议（multi-stage annotation protocol），最终产出一个具有高度注释者间一致性（inter-annotator agreement）的黄金标准语料库（gold-standard corpus）。此外，该研究还通过详细刻画WhatsApp错误信息的语言、结构、词汇、时间和群组层面特征，并基准测试了经典模型、微调小语言模型（fine-tuned Small Language Models）以及零样本或少样本大语言模型（zero- or few-shot Large Language Models），验证了强嵌入（strong embeddings）和LLM方法在数据稀缺约束下的竞争力，同时强调了领域对齐和数据可用性的关键作用。

链接: https://arxiv.org/abs/2605.12510
作者: Jônatas H. dos Santos,Julio C. S. Reis,Philipe Melo,João F. H. Olivetti,Thales H. Silva,Matheus Gontijo Guimaraes,Glaucio de Souza,Marcos A. Gonçalves,Fabricio Benevenuto,Filipe B. B. Zanovello,Marco A. G. Rodrigues,Cristiano X. Lima
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages. This is a preprint version of a paper accepted for the International AAAI Conference on Web and Social Media (ICWSM’26). Please cite the conference version rather than this preprint

点击查看摘要

Abstract:We introduce WhaVax, a new expert-annotated dataset of vaccine-related WhatsApp messages collected from large Brazilian public groups spanning multiple pandemic years. The dataset was constructed through a rigorous, carefully designed pipeline that integrates keyword-based data collection, semantic deduplication to remove near-duplicate content, and a multi-stage annotation protocol conducted by medical specialists. This process produced a high-quality gold-standard corpus, characterized by substantial inter-annotator agreement and strong reliability for downstream analysis. Additionally, we provide a detailed characterization of WhatsApp misinformation, revealing distinctive linguistic, structural, lexical, temporal, and group-level patterns, as well as a meaningful layer of ambiguous cases that reflect the complexity of health discourse in private messaging. We also benchmark classical models, fine-tuned Small Language Models, and zero- or few-shot Large Language Models under realistic data-scarcity constraints, demonstrating that strong embeddings and LLM approaches perform competitively, while domain alignment and data availability remain critical factors. This study provides a rare, high-quality resource to support misinformation research and computational modeling in encrypted communication environments.

[NLP-94] LLM s as Implicit Imputers: Uncertainty Should Scale with Missing Information NEURIPS2026

【速读】：该论文试图解决的问题是：在上下文信息不完整或退化的情况下，如何评估大型语言模型（Large Language Models, LLMs）生成答案的不确定性，以确保不确定性度量能正确反映缺失信息量，从而符合多重插补（Multiple Imputation, MI）文献中“不确定性应与缺失信息量成比例”的标准。解决方案的关键在于：提出使用响应熵（response entropy）作为答案级别的不确定性度量，通过实验证明在上下文移除时，熵会随之增加，而传统的基于采样的置信度（sampling-based confidence）则无法反映准确性下降，仍保持较高水平；同时，引入一种黑盒诊断方法 (\rho_R(\alpha))，通过在有上下文和无上下文情况下重复采样，估计由特定上下文水平 (\alpha) 所解决的基线不确定性的比例。这些方法表明熵比置信度在上下文不完整时更敏感、更可靠的量化不确定性。

链接: https://arxiv.org/abs/2605.13188
作者: Stef van Buuren
机构: TNO - Netherlands Organization for Applied Scientific Research, Leiden (荷兰应用科学研究组织，莱顿); Dept. of Methodology and Statistics, University of Utrecht (乌得勒支大学方法论与统计系)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注: 9 pages, 3 figures, 2 tables, NeurIPS 2026 position paper

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imputation (MI) literature: uncertainty should scale with the amount of missing information. We assess this criterion on SQuAD, using a controlled framework in which context availability is varied across five levels. We evaluate two answer-level uncertainty measures that can be estimated from repeated sampling: sampling-based confidence (empirical mode frequency) and response entropy. Confidence fails to reflect increasing missingness: it remains high even as accuracy collapses. Entropy, by contrast, increases with context removal, consistent with the MI analogy, and explains substantially more variance in accuracy than confidence across all evidence levels (quadratic R^2 gap up to 0.057). We further introduce a black-box diagnostic \rho_R(\alpha) that estimates the proportion of baseline uncertainty resolved by context level \alpha , requiring only repeated sampling with and without context. These results suggest that entropy is a more responsive black-box uncertainty measure than confidence under incomplete context.

信息检索

[IR-0] VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense

链接: https://arxiv.org/abs/2605.13764
作者: Jascha Wanger
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 47 pages, 3 figures. Reference implementations: this https URL and this https URL

点击查看摘要

Abstract:Modern retrieval-augmented generation (RAG) systems convert sensitive content into high-dimensional embeddings and store them in vector databases that treat the resulting numerical artifacts as opaque. Major vector-store products do not provide native controls for embedding integrity, ingestion-time distributional anomaly detection, or cryptographic provenance attestation. We show this opens a class of steganographic exfiltration attacks: an attacker with write access to the ingestion pipeline can hide payload data inside embeddings using simple post-embedding perturbations (noise injection, rotation, scaling, offset, fragmentation, and combinations thereof) while preserving the surface-level retrieval behavior the RAG system exposes to legitimate users. We evaluate these techniques across a synthetic-PII corpus on text-embedding-3-large, four locally hosted open embedding models, a cross-corpus replication on BEIR NFCorpus and a Quora subset (over 26,000 chunks combined), seven vector-store configurations, an adaptive-attacker variant of the detector evaluation, and a paraphrased-query retrieval benchmark. Distribution-shifting perturbations are often caught by simple anomaly detectors; small-angle orthogonal rotation defeats distribution-based detection across every (model, corpus) pair tested. A disjoint-Givens rotation encoder gives a closed-form per-vector capacity ceiling of floor(d/2) * b bits, but real embedding manifolds impose a capacity-detectability trade-off, and the retrieval-preserving operating point sits well below it. We propose VectorPin, a cryptographic provenance protocol that pins each embedding to its source content and producing model via an Ed25519 signature over a canonical byte representation. Any post-embedding modification breaks signature verification. Embedding-level integrity is a deployable, standardizable control that closes this attack class. Comments: 47 pages, 3 figures. Reference implementations: this https URL and this https URL Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: K.6.5; I.2.7; H.3.3 Cite as: arXiv:2605.13764 [cs.CR] (or arXiv:2605.13764v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.13764 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.20076420 Focus to learn more DOI(s) linking to related resources

[IR-1] Benchmarking the Open Science Data Federation services to develop XRootD best practices

链接: https://arxiv.org/abs/2605.13593
作者: Fabio Andrijauskas,Igor Sfiligoi,Frank Würthwein
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Research has become dependent on processing power and storage, one crucial aspect being data sharing. The Open Science Data Federation (OSDF) project aims to create a scientific global data distribution network based on the Pelican Platform. OSDF relies on the XRootD and Pelican projects. Nevertheless, OSDF must understand the XRootD limits under various configuration options, including transfer rate limits, proper buffer configuration, and storage type effect. We have thus executed a set of benchmarks to create a set of recommendations to share with the XRootD and Pelican teams. This work describes the tests and results performed using National Research Platform (NRP) hosts. The tests cover various file sizes and parallel streams and use clients from various distances from the server host. We also used several standalone clients (wget, curl, pelican) and the native HTCondor file transfer mechanisms. Applying the methodology creates a possibility to track how XRootD and the Pelican layer perform in different scenarios.

[IR-2] Granite Embedding Multilingual R2 Models

链接: https://arxiv.org/abs/2605.13521
作者: Parul Awasthy,Aashka Trivedi,Yushu Yang,Ken Barker,Yulong Li,Bhavani Iyer,Martin Franz,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Todd Ward,Abraham Daniels,Rudra Murthy,Madison Lee,Luis Lastras,Jaydeep Sen,Radu Florian
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at this https URL, designed to support responsible use and enable unrestricted research and enterprise adoption.

[IR-3] ask-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models SIGIR2026

链接: https://arxiv.org/abs/2605.13497
作者: Xinye Wanyan,Chenglong Ma,Danula Hettiachchi,Ziqi Xu,Jeffrey Chan
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Large Language Model (LLM)-based agent simulation has emerged as a promising approach to meet the increasing demand for real-time and rigorous evaluation in modern recommender systems. A typical LLM-driven simulation framework comprises three essential components: the profile module, memory module, and action module. However, existing studies have primarily concentrated on enhancing the memory and action modules, with limited attention to profile generation, which plays a pivotal role in ensuring realistic agent behaviours and aligning simulated interactions with real user dynamics. Moreover, the scarcity of datasets specifically designed for recommendation simulations has led to heavy reliance on manually crafted profiles, significantly limiting the scalability and generalisability of simulation frameworks across different datasets. To address these challenges, this work proposes an Automated Profile Generation Framework for Recommendation Simulation, APG4RecSim, that constructs realistic, coherent, and robust user profiles with minimal supervision. Extensive experiments on three benchmark datasets demonstrate that APG4RecSim achieves the best overall performance on discrimination, ranking, and rating tasks, improving ranking quality by up to 7% in nDCG@10 and reducing rating distribution divergence by 8% in JSD compared to existing profile-generation baselines. Beyond overall performance gains, our results show that profiles generated by APG4RecSim are resilient to popularity- and position-induced biases and maintain stable performance across datasets and different LLMs.

[IR-4] SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem

链接: https://arxiv.org/abs/2605.13310
作者: Abdul Rafay,Yuni Susanti,David Lamprecht,Michael Färber
类目: Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.

[IR-5] IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages ACL2026

链接: https://arxiv.org/abs/2605.13292
作者: Shubham Kumar Nigam,Suparnojit Sarkar,Piyush Patel
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted in BioNLP @ ACL 2026 Conference

点击查看摘要

Abstract:Most existing medical dialogue systems operate in a single-turn question–answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

[IR-6] Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation ACL2026

链接: https://arxiv.org/abs/2605.13277
作者: Weiqing Luo,Zongye Hu,Xiao Wang,Zhiyuan Yu,Haofeng Zhang,Ziyi Huang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model’s output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

[IR-7] LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

链接: https://arxiv.org/abs/2605.13137
作者: Guoxiong Gao,Zeming Sun,Jiedong Jiang,Yutong Wang,Jingda Xu,Peihao Wu,Bryan Dai,Bin Dong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof – a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: this https URL . The standard mode is publicly available with API access at this https URL .

[IR-8] A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset SIGIR’26 SIGIR

链接: https://arxiv.org/abs/2605.13053
作者: Ivica Kostric,Krisztian Balog
类目: Information Retrieval (cs.IR)
备注: Accepted to Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20–24, 2026, Melbourne, VIC, Australia

点击查看摘要

Abstract:Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy stems from repetition shortcuts’’ that are absent in novelty-focused evaluation. Furthermore, we find that performance gains are often driven more by the capacity of the LLM backbone than by specific architectural innovations. Finally, by applying user-centric utility metrics, we demonstrate that traditional recall frequently overstates a system’s actual conversational effectiveness. This work establishes a transparent, controlled baseline and promotes evaluation practices that prioritize novelty and interaction efficiency.

[IR-9] RAG -Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search SIGIR2026

链接: https://arxiv.org/abs/2605.13052
作者: Tingyu Chen,Wenkai Zhang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at SIGIR 2026. Final version: this https URL

点击查看摘要

Abstract:In commercial web search, aligning content freshness with user intent remains challenging due to the highly varied lifespans of information. Traditional industrial approaches rely on static time-window filtering, resulting in “one-size-fits-all” rankings where content may be chronologically recent but semantically expired. To address the limitation, we present a novel Large Language Models (LLMs)-based Query-Aware Dynamic Content Expiration Prediction Framework deployed in Baidu search, reformulating timeliness as a dynamic validity inference task. Our framework extracts fine-grained temporal contexts from documents and leverages LLMs to deduce a query-specific “validity horizon”-a semantic boundary defining when information becomes obsolete based on user intent. Integrated with robust hallucination mitigation strategies to ensure reliability, our approach has been evaluated through offline and online A/B testing on live production traffic. Results demonstrate significant improvements in search freshness and user experience metrics, validating the effectiveness of LLM-driven reasoning for solving semantic expiration at an industrial scale.

[IR-10] ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

链接: https://arxiv.org/abs/2605.13034
作者: Zhuofan Shi,Peilun Jia,Baoqin Sun,Haiyang Shen,Sixiong Xie,Yun Ma,Xiang Jing
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

[IR-11] Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education ACL2026

链接: https://arxiv.org/abs/2605.12988
作者: Mragisha Jain,Tirth Bhatt,Griffin Pitts,Aum Pandya,Peter Brusilovsky,Narges Norouzi,Arto Hellas,Juho Leinonen,Bita Akram
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Paper accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), co-located with ACL 2026

点击查看摘要

Abstract:Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students’ algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE’s feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

[IR-12] Same Image Different Meanings: Toward Retrieval of Context-Dependent Meanings SIGIR2026

链接: https://arxiv.org/abs/2605.12905
作者: Ayuto Tsutsumi,Ryosuke Kohita
类目: Information Retrieval (cs.IR)
备注: SIGIR 2026 (short paper)

点击查看摘要

Abstract:A scene of two people in the rain can convey hope and warmth in a reunion story or sorrow and finality in a farewell story. We investigate this context-dependent nature of image meaning and its implications for retrieval. Our key observation is that context dependency correlates with semantic abstraction: concrete elements (objects, actions) remain stable across contexts, while abstract elements (atmosphere, intent) shift with context. We operationalize this as the L1–L4 framework, organizing image semantics from context-independent (L1) to maximally context-dependent (L4). Using synthetic story contexts and queries for controlled evaluation, we examine how injecting narrative context into embeddings affects retrieval across abstraction levels. Concrete queries are retrievable without context, while abstract levels increasingly depend on narrative grounding. Where context is injected also matters, with image-side enrichment proving particularly effective. The most abstract level, however, remains challenging even with full context, highlighting context-dependent image retrieval as an important open problem. Our framework and findings lay groundwork toward retrieval systems that handle the context-dependent meanings images acquire in narrative settings.

[IR-13] EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

链接: https://arxiv.org/abs/2605.12887
作者: Hengwei Ye,Jiasheng Mao,Zhenhan Guan,Zheng Tian
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web-enabled LLM agents are changing how online information influences search outcomes. \ Existing Generative Engine Optimization (GEO) studies mainly focus on individual webpages. \ However, agentic web search is not a single-document setting: an agent may issue queries, crawl pages, follow links, reformulate searches, and synthesize evidence across multiple browsing steps. \ Influence therefore depends not only on page content, but also on how pages are organized, connected, and encountered along the agent’s browsing trajectory. \ We study this shift through \textbfEcosystem Generative Engine Optimization (\textbfEcoGEO), which treats GEO as an environment-level influence problem for web-enabled LLM agents. \ To instantiate this perspective, we propose \textbfTRACE, a \textbfTrajectory-Aware Coordinated Evidence Ecosystem. \ Given a recommendation query and a fictional target product, our method builds a controlled evidence environment that coordinates an agent-facing navigation entry page with heterogeneous support pages. \ These pages use shared terminology, internal links, and consistent product attributes to introduce, verify, and reinforce the target product. We evaluate our method on OPR-Bench, a benchmark for open-ended product recommendation. \ Experiments show that it consistently outperforms page-level GEO baselines in final target recommendation. \ Trajectory-level metrics further show increased initial target-result crawls, target-specific follow-up searches, and internal-link crawls, suggesting that the gains come from shaping the agent’s evidence-acquisition process rather than merely adding more target-related content. \ Overall, our findings support an ecosystem research paradigm for GEO, where web-enabled LLM agents are studied in relation to the broader evidence environments that guide search, browsing, and answer synthesis.

[IR-14] MLPs are Efficient Distilled Generative Recommenders

链接: https://arxiv.org/abs/2605.12617
作者: Zitian Guo,Yupeng Hou,Clark Mingxuan Ju,Neil Shah,Julian McAuley
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation models employing Semantic IDs (SIDs) exhibit strong potential, yet their practical deployment is bottlenecked by the high inference latency of beam-expanded autoregressive decoding. In this work, we identify that standard attention-heavy Transformer decoders represent a structural overkill for this task: the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token, rendering repeated attention computations highly redundant. Driven by this insight, we propose SID-MLP, a lightweight MLP-centric distillation framework that fundamentally simplifies the decoding paradigm for GR. Instead of executing complex, step-by-step attention mechanisms, our approach captures the global user context in a single operation, decoupled from sequential token prediction. We then distill the heavy autoregressive teacher into position-specific MLP heads, eliminating the dense attention overhead while preserving prefix and context dependencies. Extensive experiments demonstrate that SID-MLP matches the accuracy of teacher models while accelerating inference by 8.74x. Crucially, this distillation strategy can serve as a plug-and-play accelerator for different backbones and tokenizer settings. Furthermore, we introduce SID-MLP++, extending our distillation framework to replace the Transformer encoder, unlocking further latency reductions. Ultimately, our work reveals that decoder-side MLPs distillation is an effective acceleration path for structured SID recommendation, while full encoder replacement offers an additional speed–accuracy trade-off.

[IR-15] Creating Group Rules with AI: Human-AI Collaboration in WhatsApp Moderation

链接: https://arxiv.org/abs/2605.12613
作者: Gauri Nayak,Farhana Shahid,Aditya Vashistha,Kiran Garimella
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: CSCW 2026

点击查看摘要

Abstract:WhatsApp is one of the most widely used messaging platforms globally, with billions of users sharing information in private groups. Yet, it offers little infrastructure to support moderation and group governance. In the absence of platform-level oversight, group admins bear the responsibility of governing group behavior. In this paper, we explore how WhatsApp group admins collaborate with AI tools to create, enforce, and maintain group rules. Drawing on a two-phase speculative design study with 20 admins in India, we examine how participants interacted with an AI assistant (Meta AI) to co-create rules and responded to a series of probes illustrating AI-assisted moderation features. Our findings show that while admins appreciated the AI’s ability to surface overlooked rules and reduce their moderation burden, they were highly sensitive to issues of relational trust, data privacy, tone, and social context. We identify how group type and admin style shaped their willingness to delegate authority, and surface the limitations of current chatbot interfaces in supporting collaborative rule-making. We conclude with design implications for building moderation tools that center human judgment, relational nuance, contextual adaptability, and collective governance.

[IR-16] Beyond Centralization: User-Controlled Federated Recommendations in Practice

链接: https://arxiv.org/abs/2605.12527
作者: Manel Slokom,Alejandro Bellogin
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recommendation systems typically require centralized user data, limiting user control and raising privacy concerns. Federated learning offers an alternative by keeping data on-device, but its impact on real user behavior remains largely unexplored. We present a live federated recommender system that allows users to control the recommendation objective while keeping their data local. In a 53-day deployment with 22 participants and a catalog of 8807 titles, users interacted with recommendations and switched between personalization and diversity-enhanced ranking. We find that users prefer personalization when given explicit choice (65.37% vs.\ 62.07% CTR), actively engage with control mechanisms (3.93/5 satisfaction; 248 settings changes), and develop an understanding of how their interactions affect recommendations through immediate feedback. Our results show that user control, privacy, and effective personalization can be combined in a working system. We demonstrate a practical approach to interactive, privacy-preserving recommendation. Code and demo materials are available at: this https URL

人机交互

[HC-0] “Like Taking the Path of Least Resistance”: Exploring the Impact of LLM Interaction on the Creative Process of Programming

链接: https://arxiv.org/abs/2605.13776
作者: Zeinabsadat Saghi,Run Huang,Souti Chattopadhyay
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Creativity is fundamentally human. As AI takes on more of the generative work that once required human imagination, despite documented limitations in creative ability, a critical question emerges: How does GenAI affect users’ creativity? Through a within-subject study followed by retrospective interviews with (N=20) programmers, we investigated the impact of LLMs on participants’ process of creative thinking in programming and the creativity of generated solutions. Across two conditions (LLM-assisted vs. unassisted), participants using LLMs had significantly shorter idea-generation periods (p=0.0004), leading to fewer creative moments (p=0.002). Qualitative analysis of participants’ interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies. However, a comparative analysis of the generated solutions shows that while LLMs can help generate more correct and functional code, their solutions contain roughly the same number of ideas as participant-generated ones. Based on our findings, we discuss design implications and considerations for effectively using LLMs to support user creativity.

[HC-1] Distinguishing performance gains from learning when using generative AI

链接: https://arxiv.org/abs/2605.13731
作者: Lixiang Yan,Samuel Greiff,Jason M. Lodge,Dragan Gašević
机构: Monash University(莫纳什大学)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) is increasingly being integrated into education, where it can boost learners’ performance. However, these uses do not promote the deep cognitive and metacognitive processing that are required for high-quality learning.

[HC-2] Humanwashing – It Should Leave You Feeling Dirty

链接: https://arxiv.org/abs/2605.13723
作者: Ben Wilson,Matimba Swana,Peter Winter,Matt Roach
机构: Computational Foundry, Swansea University (斯旺西大学), UK; University of Bristol (布里斯托大学), UK
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 10 pages, 1 figure. Reviewed and accepted for presentation at HHAI 2026, Brussels

点击查看摘要

Abstract:The phrase ‘human in the loop’ is increasingly used to imply a sense of safety in relation to AI decision systems. It shouldn’t. There are contexts where it can be applied appropriately, but these are not in the deployed decision systems we see dominating today. Human oversight of AI decision processes is one of the most popular proposals for addressing concerns, especially about bias, discrimination, misinformation, manipulation, accountability, and transparency. But there is insufficient examination of what human oversight actually means. The question raised in this paper is whether using the metaphor of a loop does anything to assist understanding of what is required and what is achieved in a particular decision context. Indiscriminate use of the loop metaphor obscures both processes and outcomes. It enables ‘humanwashing’, an activity analogous to ‘greenwashing’, where writers and commentators use language primarily aimed at putting systems in the best possible light.

[HC-3] Beyond Anthropomorphism: Exploring the Roles of Perceived Non-humanity and Structural Similarity in Deep Self-Disclosure Toward Generative AI

链接: https://arxiv.org/abs/2605.13574
作者: Satoru Shibuya
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted to International Journal of Human-Computer Interaction (IJHCI). 35 pages, 2 tables, 3 figures

点击查看摘要

Abstract:This study investigates deep self-disclosure toward generative AI by examining perceived non-humanity and structural similarity as psychological factors beyond anthropomorphism. Perceived non-humanity may reduce evaluation apprehension, whereas structural similarity refers to the perceived logical alignment between a user’s thinking and AI responses. Using cross-sectional survey data from 2,400 participants collected in 2025, this study analyzed associations with both the occurrence and depth of self-disclosure. Logistic regression indicated that the group high in both perceptions (Segment D) showed a significantly higher likelihood of disclosure than the baseline group (Segment A; OR = 11.35). ANOVA further showed significant between-group differences in disclosure depth. The findings suggest that trust-related behavior in deep self-disclosure may involve factors other than anthropomorphic perception. Because the study is exploratory and based on self-reported survey data, the results should be interpreted as associative rather than causal, and future longitudinal or experimental research is needed.

[HC-4] AI-Generated Slides: Are They Good? Can Students Tell?

链接: https://arxiv.org/abs/2605.13532
作者: Juho Leinonen,Lisa Zhang,Arto Hellas
机构: Aalto University (阿尔托大学); University of Toronto Mississauga (多伦多大学密西沙加校区)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026

点击查看摘要

Abstract:As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high AI-generated’’ rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

[HC-5] Beyond VMAF: Towards Application-Specific Metrics for Teleoperation Video ITSC2026

链接: https://arxiv.org/abs/2605.13525
作者: Ines Trautmannsheimer,Richard Grauberger,Frank Diermeyer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Preprint ITSC 2026

点击查看摘要

Abstract:Automated driving has made remarkable progress, yet situations still arise where human intervention is necessary. Teleoperation provides a scalable solution to address such cases, enabling remote operators to support vehicles without being physically present. In this context, video transmission forms the operator’s primary source of situational awareness, making video quality a decisive factor for both safety and task performance. In an online study, participants rated compressed video sequences from the Zenseact Dataset and provided subjective quality ratings. These ratings were then used to retrain the Video Multi-Method Assessment Fusion (VMAF) model, yielding an adapted variant tailored to teleoperation. The retrained model demonstrated improved alignment with human ratings compared to the original 4K VMAF. In particular, RMSE decreased from 10.36 to 8.83, and MAD from 8.71 to 6.38, corresponding to improvements of 15% and 27%, respectively. These results highlight that incorporating domain-specific data can enhance the predictive power of established quality metrics in safety-critical applications. At the same time, Outlier cases emerged in which videos received high objective scores despite noticeable degradations in regions critical for the driving task.

[HC-6] Assessing the Creativity of Large Language Models : Testing Limits and New Frontiers

链接: https://arxiv.org/abs/2605.13450
作者: Samuel Schapiro,Alexi Gladstone,Jonah Black,Heng Ji
机构: University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 36 pages. Extended version of work under review

点击查看摘要

Abstract:Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score “creativity,” their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

[HC-7] PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

链接: https://arxiv.org/abs/2605.13307
作者: Hannah Rose Kirk,Liu Leqi,Fanzhi Zeng,Henry Davidson,Bertie Vidgen,Christopher Summerfield,Scott A. Hale
机构: University of Oxford (牛津大学); UK AI Security Institute (英国人工智能安全研究所); University of Texas at Austin (德克萨斯大学奥斯汀分校); Mercor; Meedan
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.

[HC-8] “It became a self-fulfilling prophecy”: How Lived Experiences are Entangled with AI Predictions in Menstrual Cycle Tracking Apps

链接: https://arxiv.org/abs/2605.13261
作者: Wendy Zhou,Pelin Karaturhan,Alexandra Weilenmann,Jichen Zhu
机构: Chalmers University of Technology (查尔姆斯理工大学); University of Gothenburg (哥德堡大学); IT University of Copenhagen (哥本哈根IT大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In menstrual cycle tracking apps (MCTAs), AI-based predictions and insights have become increasingly popular. These features enable users to receive personalized information about their bodies and mental states. However, there is currently little research on how these predictive AI features and explanations affect users’ lived experiences. This paper examines human-AI entanglement in MCTAs through 14 semi-structured user interviews and a group autoethnography. These methods uncover the processes leading to this phenomenon. Our results reveal that: (1) users understand their lived experiences in light of AI predictions, although these predictions can be faulty due to imperfect logging practices, (2) the user interface features and AI explanations do not support awareness or critical engagement with this entanglement and meaning-making, and (3) non-normative MCTA users report a sense of isolation in this entangled interaction. Based on our findings, we propose design implications for predictive AI features and explanations.

[HC-9] Doppler Prompting for Stable mmWave-based Human Pose Estimation

链接: https://arxiv.org/abs/2605.13233
作者: Shuntian Zheng,Jiaqi Li,Xiaoman Lu,Shuai He,Yu Guan
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) enables privacy-preserving, illumination-robust human pose estimation (HPE), with each mmWave frame represented as a range-angle-Doppler tensor, providing spatial magnitude for localization and Doppler signatures for motion-related cues. However, existing mmWave-based HPE methods either underutilize or naïvely fuse Doppler signatures with spatial magnitude, disregarding their distinct physical semantics. As a result, non-human Doppler signatures can be misinterpreted as human motion cues, leading to jittery trajectories. We propose PULSE, which converts Doppler signatures into confidence-aware motion prompts and injects them into spatial magnitude reasoning through constrained interactions. By screening Doppler prompts before they influence prediction, PULSE first suppresses spurious spectral motion cues and then uses the screened prompts to stabilize prediction. Across three datasets spanning single- and multi-person settings, PULSE consistently improves pose accuracy and temporal stability, indicating that controlled Doppler prompting is a practical direction for stable mmWave HPE.

[HC-10] Discovery-Oriented Faceting: From Coverag e to Blind-Spot Discovery

链接: https://arxiv.org/abs/2605.12956
作者: Youdi Li
机构: Panasonic Connect Co., Ltd.(松下连接有限公司)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure. Accepted to CHI 2026 Workshop on Tools for Thought

点击查看摘要

Abstract:When people explore large document collections to build understanding, they face a challenge: existing AI tools help them see what is central but tend to hide what is unusual. Summarization and topic modeling optimize for coverage, representing main themes while pushing minority viewpoints and edge cases out of view. This matters because discovery often depends on noticing what does not fit, such as unexpected findings, minority positions, or gaps in the literature. When tools hide this content, users may miss insights that could change their understanding. In this paper, we explore an alternative objective: blind-spot discovery, where the goal is to surface content that coverage methods suppress so that people can judge its significance for themselves. We propose three design goals and illustrate them through DOF (Discovery-Oriented Faceting), a system that organizes documents into categories with explicit boundaries, ranks categories by distinctiveness rather than size, and supports iterative refinement. Comparing DOF against coverage-based ranking across four domains, we find that the two approaches surface fundamentally different content, with DOF promoting specialized categories that coverage methods bury. We discuss how shifting from coverage to discovery may offer a complementary mode of support for people exploring large text collections.

[HC-11] AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

链接: https://arxiv.org/abs/2605.12937
作者: Jacob Lagogiannis,William Agnew,Rosa I. Arriaga,Sauvik Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 10 figures

点击查看摘要

Abstract:Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice – because the subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 aesthetic’’ filters that emulate popular ``one-click’’ Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ( N=630 ) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

[HC-12] hermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels

链接: https://arxiv.org/abs/2605.12927
作者: Mahsin Bin Akram,A H M Nazmus Sakib,OFM Riaz Rahman Aranya,Raveen Wijewickrama,Kevin Desai,Murtuza Jadliwala
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Standalone virtual reality (VR) headsets process highly sensitive personal, professional, and health-related data, yet their susceptibility to non-contact physical side channels remains largely unexplored. Existing side-channel attacks typically require malicious software execution or physical access to peripherals, making them conspicuous and potentially patchable. This paper introduces ThermalTap, the first passive, non-contact side-channel attack that fingerprints VR applications solely from the long-wave infrared (LWIR) radiation emitted by the headset chassis. By treating a headset’s thermal signature as a high-fidelity proxy for internal computational workloads, ThermalTap enables remote application inference at meter-scale distances without any device interaction. To achieve robust performance in real-world settings, the system combines a commodity thermal camera with a multi-modal sensor suite (capturing ambient temperature, humidity, and airflow) to normalize environmental noise. We evaluate ThermalTap using six applications across three commercial standalone headsets. In indoor settings, ThermalTap identifies applications with over 90% accuracy using only 10 seconds of thermal camera data. Under outdoor conditions, with longer session-level observations, several applications remain identifiable despite environmental variability, with the strongest outdoor application reaching 81% accuracy. Our findings establish thermal radiation as a fundamental and unavoidable privacy risk for immersive systems, exposing a critical security gap that bypasses current software-level protections and physical access controls.

[HC-13] Magical Touch: Transforming Raw Capacitive Streams into Expressive Hand-Touchscreen Interaction

链接: https://arxiv.org/abs/2605.12902
作者: Yuanlei Guo,Xizi Gong,Yizhong Zhang,Xiaoyu Zhang
机构: Georgia Institute of Technology (佐治亚理工学院); Microsoft Research Asia (微软亚洲研究院); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Modern touchscreens utilize capacitive sensing technology to enable precise and robust multi-touch interaction. However, the broader expressive potential of the human hand remains underutilized, since most existing methods directly filter out larger-area hand-screen contact. This paper introduces Magical Touch, an interaction method based on raw capacitive sensing data. By directly integrating raw touchscreen sensor data into the interaction loop, our method allows users to interact with the screen naturally and efficiently using arbitrary hand gestures on existing touchscreen devices. To demonstrate the feasibility and expressive capacity of this approach, we implement a physics-based interactive game featuring single-player, multiplayer collaborative, and pressure-sensitive modes. These scenarios showcase how digital objects can respond in real-time to both the geometry and contact intensity of the user’s hand. Our results indicate that leveraging raw capacitive data can expand the design space of touchscreen interaction, offering an embodied and continuous interaction paradigm beyond existing fingertip-based approaches.

[HC-14] Seed Bank Co-op Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing

链接: https://arxiv.org/abs/2605.12888
作者: Alicia Guo,Carly Schnitzler,Katy Gero
机构: University of Washington (华盛顿大学); Johns Hopkins University (约翰霍普金斯大学); University of Sydney (悉尼大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:How might we govern a language model run for and by creative writers? While generative AI use is on the rise, many language models are created and owned in ways that limit writers’ consent, participation, and control. We report on four workshops where over one hundred creative writers came up with and analyzed metaphors for language model governance, resulting in over two hundred metaphors: objects, places, processes, groups, and infrastructure that support reasoning about language model governance. What if a language model was like a community garden? Or a seed bank? Or the bathroom in a dive bar? We report on four themes: (1) the importance of consent, (2) how to define community boundaries, (3) ways to give contributor recognition, and (4) trade-offs in scale of language models. These metaphors point towards smaller, open models that encode group values. We discuss concrete ways to make community language models a reality.

[HC-15] Emotional Expression in Low-Degrees-of-Freedom Robots: Assessing Perception with Reachy Mini

链接: https://arxiv.org/abs/2605.12786
作者: Amit Rogel,Elmira Yadollahi,Guy Laban
机构: Robotic Musicianship Lab, Georgia Institute of Technology (佐治亚理工学院机器人音乐家实验室); School of Computing and Communications, Lancaster University (兰卡斯特大学计算与通信学院); Department of Industrial Engineering and Management, Ben-Gurion University of the Negev (本·古里安大学工业工程与管理系); School of Brain Sciences and Cognition, Ben-Gurion University of the Negev (本·古里安大学脑科学与认知学院); The Azrieli National Center for Autism and Neurodevelopment Research, Ben-Gurion University of the Negev (本·古里安大学Azrieli自闭症与神经发育研究国家中心)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Emotion expression is central to human–robot interaction, yet little is known about how people interpret affect on robots with sparse, non-anthropomorphic expressive capabilities. This study examined how people perceive emotional expressions displayed by Reachy Mini (Pollen Robotics and Hugging Face), a low-degree-of-freedom (low-DoF) robot with a constrained and distinctly non-human expressive repertoire. In an online within-subjects study, 100 participants viewed 10 short video clips of Reachy Mini expressing different emotions and, for each clip, identified the perceived emotion, rated its valence and arousal, and evaluated the robot on social-perception traits. Exact emotion recognition was modest overall and varied considerably across expressions, with anger, sadness, and interest recognized more reliably than emotions such as love, pleasure, shame, and disgust. However, participants were generally more successful at recovering broader affective meaning than exact emotion labels, particularly along valence and arousal dimensions. Emotional expressions also shaped social evaluation, as positive expressions were perceived as warmer and more sociable than negative ones, and animacy varied less across conditions. These findings suggest that even constrained robotic expressions can communicate affective meaning and influence social impressions, positioning Reachy Mini as a useful benchmark for studying affective communication in low-DoF robots.

[HC-16] What Do You Think I Think? Accounting for Human Beliefs Using Second-Order Theory of Mind

链接: https://arxiv.org/abs/2605.12745
作者: Patrick Callaghan,Reid Simmons,Henny Admoni
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear in the proceedings of The 2026 Cognitive Science Society Conference

点击查看摘要

Abstract:Discrepancies between an agent’s actual knowledge and what a person thinks the agent knows can hinder interactions. If an agent could detect such discrepancies, it could provide feedback to account for them and improve current and future interactions. Using the I-POMDP as a framework for a second-order Theory of Mind (ToM-2), this work endows an agent with the ability to model the evolution of a person’s erroneous beliefs about an agent and the cognitive biases and heuristics (CBH) from which they arise. In doing so, the agent can detect when CBH might be at play during an interaction and adaptively generate feedback that accounts for them. An in-person user study shows how a ToM-2 learner can account for the effects of a teacher’s CBH to significantly improve the informativeness of teacher actions, and subjective results suggest people find the ToM-2 learner’s feedback more useful.

[HC-17] Quieting the Cobwebs: Browser Interaction for Visual Floaters

链接: https://arxiv.org/abs/2605.12739
作者: Kenneth Ge,Jinglin Li,Shikhar Ahuja
机构: Assistivity (Assistivity); Independent Researcher (独立研究者); Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Floaters, cobweb-like shadows that move around a person’s visual field, impair vision for nearly 33% of the population, yet have limited treatment options. Floaters especially harm screen use, since they reduce contrast, introduce clutter, and add moving distractions. While existing high-contrast tools offer some help, few address the motion that makes screen use with floaters uniquely difficult. In this paper, we build a floater simulation inspired by the physics of the eye, use it to quantitatively assess text readability at varying levels of motion, and build a novel web extension that minimizes eye movement, maximizing the signal-to-noise ratio of performing browser tasks. Importantly, our tool works not only for text, but for all UI elements, requiring no modifications to existing websites.

[HC-18] DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

链接: https://arxiv.org/abs/2605.12702
作者: Eugenia Kim,Ioana Tanase,Christina Mallon
机构: Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.

[HC-19] Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

链接: https://arxiv.org/abs/2605.12684
作者: Yichen Feng,Yuetai Li,Chunjiang Liu,Yuanyuan Chen,Fengqing Jiang,Yue Huang,Hang Hua,Zhengqing Yuan,Kaiyuan Zheng,Luyao Niu,Bhaskar Ramasubramanian,Basel Alomair,Xiangliang Zhang,Misha Sra,Zichen Chen,Radha Poovendran,Zhangchen Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Project page: this https URL . Code: this https URL . Dataset: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators’ direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

[HC-20] Learning to Decide with AI Assistance under Human-Alignment

链接: https://arxiv.org/abs/2605.12646
作者: Nina Corvelo Benz,Eleni Straitouri,Manuel Gomez-Rodriguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers’ confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of \Omega (\sqrt|H| \cdot |B| \cdot T ) on the expected regret any learner can attain, where H and B denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of O(\sqrt|H| \cdot T\log T) and, when \sqrt|H| = O(\log T) and B is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to O(\sqrtT\log T) . Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

[HC-21] Co-Designing Organizational Justice Indicators for Algorithmic Systems

链接: https://arxiv.org/abs/2605.12643
作者: Fujiko Robledo Yamamoto,Nicholas Mattei,Pradeep Ragothaman,Robin Burke,Amy Voida
机构: University of Colorado, Boulder (科罗拉多大学博尔德分校); Tulane University (杜兰大学); Kiva
类目: Human-Computer Interaction (cs.HC)
备注: To appear at FAccT 2026

点击查看摘要

Abstract:Fairness in machine learning is often conceptualized narrowly in comparative, distributional terms. In studying stakeholders’ concepts of fairness, we find that this framing is insufficient to capture the full range of issues raised. As an alternative, we propose organizational justice as a framework that subsumes distributional fairness as well as other normative concerns. We conduct a case study of organizational justice relative to personalized recommendation in the context of Kiva Microfunds, a nonprofit micro-lending organization whose mission is to increase financial access for underserved communities across the world. We report on the results of co-design workshops conducted with Kiva employees who are involved in different departments and whose roles often lead them to prioritize normative concerns that are most supportive of the stakeholders with whom they work most closely. We apply organizational justice to understand design trade-offs among different normative goals stakeholders invoke. Based on these goals, we identify a suite of metrics that Kiva employees can use to monitor and assess the recommender system’s impact on their organizational justice concerns and to seed discussions within the organization about appropriate configuration and deployment of this system in context.

[HC-22] Exploring how EFL students talk to and through AI to develop texts

链接: https://arxiv.org/abs/2605.12523
作者: David James Woo,Yangyang Yu,Yilin Huang,Deliang Wang,Kai Guo,Chi Ho Yeung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 37 pages, 5 figures

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students’ writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students’ writing performance: content, language, and organization. Students’ prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.

[HC-23] Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

链接: https://arxiv.org/abs/2605.12506
作者: Abdul Basit,Saim Rehman,Muhammad Shafique
机构: New York University (NYU) Abu Dhabi (纽约大学阿布扎比分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 7 pages, 11 figures, Accepted to DAC 2026

点击查看摘要

Abstract:Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

计算机视觉

[CV-0] R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow SIGGRAPH2026

链接: https://arxiv.org/abs/2605.13838
作者: Zijie Wu,Lixin Xu,Puhua Jiang,Sicong Liu,Chunchao Guo,Xiang Bai
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted by SIGGRAPH 2026, Project Page: this https URL Code URL: this https URL

点击查看摘要

Abstract:Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified’’ to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video’s initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.

[CV-1] Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

链接: https://arxiv.org/abs/2605.13835
作者: Hao Sun,Zi-Jun Ding,Da-Wei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP’s encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.

[CV-2] QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

链接: https://arxiv.org/abs/2605.13833
作者: Hoang-Quan Nguyen,Sankalp Pandey,Khoa Luu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.

[CV-3] raining Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

链接: https://arxiv.org/abs/2605.13831
作者: Zhaowei Wang,Lishu Luo,Haodong Duan,Weiwei Liu,Sijin Wu,Ji Luo,Shen Yan,Shuai Peng,Sihang Yuan,Chaoyi Huang,Yi Lin,Yangqiu Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress

点击查看摘要

Abstract:Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

[CV-4] History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

链接: https://arxiv.org/abs/2605.13825
作者: Alberto G. Rodríguez Salgado
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, “stay consistent with the strategy shown in the prior history”, flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

[CV-5] OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

链接: https://arxiv.org/abs/2605.13815
作者: Youquan Liu,Weidong Yang,Ao Liang,Xiang Xu,Lingdong Kong,Yang Wu,Dekai Zhu,Xin Li,Runnan Chen,Ben Fei,Tongliang Liu,Wanli Ouyang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 12 pages, 7 figures, 10 tables

点击查看摘要

Abstract:LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.

[CV-6] JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

链接: https://arxiv.org/abs/2605.13813
作者: Lavsen Dahal,Yubraj Bhandari,Geoffrey Rubin,Joseph Y. Lo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository this https URL and model weights are at this https URL.

[CV-7] EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

链接: https://arxiv.org/abs/2605.13803
作者: Minjoon Jung,Byoung-Tak Zhang,Lorenzo Torresani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query–moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

[CV-8] VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

链接: https://arxiv.org/abs/2605.13798
作者: Guney Tombak,Ertunc Erdil,Ender Konukoglu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit–transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR–CT and inter-subject HCP T2w–T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \hrefthis https URLguneytombak/VoxCor.

[CV-9] BlitzGS: City-Scale Gaussian Splatting at Lightning Speed

链接: https://arxiv.org/abs/2605.13794
作者: Zhongtao Wang,Huishan Au,Yilong Li,Mai Su,Haojie Jin,Yisong Chen,Meng Gai,Fei Zhu,Guoping Wang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present BlitzGS, a distributed 3DGS framework that reduces active Gaussian workload for fast city-scale reconstruction. BlitzGS manages this workload at three coupled levels. At the system level, the framework shards Gaussians across GPUs by index parity rather than spatial blocks. This approach mitigates the cross-block visibility redundancy inherent in spatial partitioning. Furthermore, it distributes each rendering step through a single cross-GPU exchange that routes projected Gaussians to their tile owners. At the model level, scheduled importance-scoring passes shrink the global Gaussian population. During these passes, the framework generates a per-Gaussian visibility weight to bias density-control updates toward contributing primitives and a per-view importance mask for the view-level renderer. At the view level, BlitzGS trims each camera’s active set with a distance-based LOD gate to exclude excessively fine primitives for the current frustum and the importance-based culling mask to skip Gaussians with negligible cross-view contribution. On large-scale benchmarks, BlitzGS matches the rendering quality of recent large-scale baselines while delivering an order-of-magnitude speedup, training city-scale scenes in tens of minutes. Our code is available at https: //github.com/AkierRaee/BlitzGS. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.13794 [cs.GR] (or arXiv:2605.13794v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2605.13794 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

链接: https://arxiv.org/abs/2605.13778
作者: Jiahui Niu,Kefan Gu,Yucheng Zhao,Shengwen Liang,Tiancai Wang,Xing Hu,Ying Wang,Huawei Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model’s Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.

[CV-11] RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

链接: https://arxiv.org/abs/2605.13775
作者: Harold Haodong Chen,Sirui Chen,Yingjie Xu,Wenhang Ge,Ying-Cong Chen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: On-going work

点击查看摘要

Abstract:The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines “near-miss” failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds–a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.

[CV-12] Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception CVPR2026

链接: https://arxiv.org/abs/2605.13755
作者: Arka Bhowmick,Enes Ozeren,Ahmed Abdullah,Oliver Wasenmuller
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at SAIAD 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

[CV-13] Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein

链接: https://arxiv.org/abs/2605.13753
作者: Ashkan Shahbazi,Xinran Liu,Ping He,Soheil Kolouri
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose min Generalized Sliced Gromov–Wasserstein (min-GSGW), a sliced formulation for the Gromov–Wasserstein (GW) problem using expressive generalized slicers. The key idea is to learn coupled nonlinear slicers that assign compatible push-forward values to both input measures, so that monotone coupling in the projected domain lifts to a transport plan evaluated against the GW objective in the original spaces. The resulting plan induces a GW objective value, and min-GSGW minimizes this cost directly in the original spaces. We further show that min-GSGW is rigid-motion invariant, a crucial property for geometric matching and shape analysis tasks. Our contributions are threefold: 1) we introduce generalized slicers into the sliced GW framework, 2) we construct a slicing-based efficient GW transport plan; and 3) we develop an amortized variant that replaces per-instance optimization with a learned slicer for unseen input pairs. We perform experiments on animal mesh matching, horse mesh interpolation, and ShapeNet part transfer. Results show that min-GSGW produces meaningful geometric correspondences and GW objective values at substantially lower computational cost than existing GW solvers.

[CV-14] Weakly-Supervised Spatiotemporal Anomaly Detection

链接: https://arxiv.org/abs/2605.13746
作者: Urvi Gianchandani,Praveen Tirupattur,Mubarak Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.

[CV-15] Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration

链接: https://arxiv.org/abs/2605.13744
作者: Feiyu Tan,Qi Xie,Zongben Xu,Deyu Meng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 9 figures, Supplementary Material can be found at this https URL

点击查看摘要

Abstract:Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network’s empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample’s inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at this https URL.

[CV-16] LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

链接: https://arxiv.org/abs/2605.13741
作者: Christina Kassab,Hyeonjae Gil,Matías Mattamala,Ayoung Kim,Maurice Fallon
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed – enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: this https URL.

[CV-17] Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

链接: https://arxiv.org/abs/2605.13730
作者: Christos Chrysanthos Nikolaidis,Vasileios Sachpekidis,Nikolas Moustakidis,Theofilos Moustakidis,Pavlos S. Efraimidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on N=90 patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of 0.907 and recall of 0.877 . Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone’s contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

[CV-18] Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation

链接: https://arxiv.org/abs/2605.13729
作者: Deli Cai,Haoyang Ma,Changxing Ding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.

[CV-19] AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

链接: https://arxiv.org/abs/2605.13724
作者: Yuchao Gu,Guian Fang,Yuxin Jiang,Weijia Mao,Song Han,Han Cai,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page at this https URL

点击查看摘要

Abstract:Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping (z_t\rightarrow z_0) to flow-map transition learning (z_t\rightarrow z_r) over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

[CV-20] Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization MICCAI2026

链接: https://arxiv.org/abs/2605.13713
作者: Isabella Poles,Simon Arberet,Riqiang Gao,Martin Kraus,Marco D. Santambrogio,Florin C. Ghesu,Ali Kamen,Dorin Comaniciu
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Early Accept at MICCAI 2026

点击查看摘要

Abstract:Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.

[CV-21] MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

链接: https://arxiv.org/abs/2605.13688
作者: Cenwei Zhang,Suncheng Xiang,Lei You
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 3 figures, 17 pages

点击查看摘要

Abstract:Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at this https URL.

[CV-22] Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

链接: https://arxiv.org/abs/2605.13686
作者: Giulia Romoli,Alessia Capoccia,Filippo Ruffini,Francesco Di Feola,Luca Boldrini,Arturo Chiti,Renato Cuocolo,Tugba Akinci D’Antonoli,Fatemeh Darvizeh,Marcello Di Pumpo,Bradley J. Erickson,Liu Fang,Deborah Fazzini,Paola Feraco,Fabrizia Gelardi,Francesco Gossetti,Ana Isabel Hernáiz Ferrer,Michail E. Klontzas,Seyedmehdi Payabvash,Katrine Riklund,Sara N. Strandberg,Valerio Guarrasi,Paolo Soda
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

[CV-23] Characterizing Universal Object Representations Across Vision Models

链接: https://arxiv.org/abs/2605.13675
作者: Florian P. Mahner,Johannes Roth,Ka Chun Lam,Michael F. Bonner,Francisco Pereira,Martin N. Hebart
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.

[CV-24] Weakly Supervised Segmentation as Semantic-Based Regularization

链接: https://arxiv.org/abs/2605.13674
作者: Stefano Colamonaco,Andrei-Bogdan Florea,Jaron Maene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

[CV-25] SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

链接: https://arxiv.org/abs/2605.13672
作者: Giries Abu Ayoub,Morad Tukan,Loay Mualem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

[CV-26] Pattern-Enhanced RT-DETR for Multi-Class Battery Detection

链接: https://arxiv.org/abs/2605.13670
作者: Xu Zhong,Enyuan Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.

[CV-27] SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

链接: https://arxiv.org/abs/2605.13667
作者: Vladislav Makarov,Mark Gizetdinov,Dmitry Yudin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: this https URL.

[CV-28] HADAR-Based Thermal Infrared Hyperspectral Image Restoration

链接: https://arxiv.org/abs/2605.13664
作者: Cheng Dai,Jiale Lin,Bingxuan Song,Yifei Chen,Jiashuo Chen,Xin Yuan,Fanglin Bao
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 17 pages, 18 figures

点击查看摘要

Abstract:Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.

[CV-29] Guide Think Act: Interactive Embodied Reasoning in Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.13632
作者: Yiran Ling,Qing Lian,Jinghang Li,Qing Jiang,Tianming Zhang,Xiaoke Jiang,Chuanxiu Liu,Jie Liu,Lei Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct “Sense-to-Act” mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: this https URL

[CV-30] WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

链接: https://arxiv.org/abs/2605.13621
作者: Chunjin Yang,Xiwei Zhang,Yiming Xiao,Fanman Meng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

[CV-31] Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

链接: https://arxiv.org/abs/2605.13604
作者: Chanyoung Kim,Donghyun Kim,Dong-Hyun Sim,Seong Jae Hwang,Youngjoong Kwon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

[CV-32] Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

链接: https://arxiv.org/abs/2605.13600
作者: Lovre Antonio Budimir,Yushi Guan,Steve Ryhner,Sven Lončarić,Nandita Vijaykumar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages (9 pages main paper), 10 figures, preprint

点击查看摘要

Abstract:3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top- K filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to 400\times training speedup while being 3\times more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

[CV-33] Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

链接: https://arxiv.org/abs/2605.13591
作者: Kaicong Huang,Talha Azfar,Weisong Shi,Ruimin Ke
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim’s capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

[CV-34] HetScene: Heterogeneity-Aware Diffusion for Dense Indoor Scene Generation

链接: https://arxiv.org/abs/2605.13586
作者: Zini Chen,Junming Huang,Rong Zhang,Jiamin Xu,Cheng Peng,Chi Wang,Weiwei Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating controllable and physically plausible indoor scenes is a pivotal prerequisite for constructing high-fidelity simulation environments for embodied AI. However, existing deeplearning-based methods usually treat all objects as homogeneous instances within a unified generation process. While effective for sparse and simplistic layouts, they struggle to model realistic layouts with dense object arrangements and complex spatial dependencies, leadingto limited scalability and degraded physical plausibility. To deal with these challenges, we revisit indoor layout generation from the perspective of structural heterogeneity and decompose the objects into primary objects and secondary objects according to their distinct roles in shaping a scene. Based on this decomposition, we propose HetScene, a heterogeneous two-stage generation framework that decouples indoor layout synthesis into Structural Layout Generation (SLG) and Contextual Layout Generation (CLG). SLG first generates globally coherent structural layouts with only primary objects conditioned on text descriptions, top-down binary room masks, and spatial relation graphs, establishing a stable global macro-skeleton of large core furniture.

[CV-35] Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging ICML2026

链接: https://arxiv.org/abs/2605.13583
作者: Wudi Chen,Zhiyuan Zha,Xin Yuan,Shigang Wang,Bihan Wen,Jiantao Zhou,Gang Yan,Zipei Fan,Ce Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, accepted by ICML 2026!

点击查看摘要

Abstract:Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: this https URL.

[CV-36] HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

链接: https://arxiv.org/abs/2605.13581
作者: Li Pang,Heng Zhao,Yijia Zhang,Deyu Meng,Xiangyong Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.

[CV-37] Qwen -Image-VAE-2.0 Technical Report

链接: https://arxiv.org/abs/2605.13565
作者: Zekai Zhang,Deqing Li,Kuan Cao,Yujia Wu,Chenfei Wu,Yu Wu,Liang Peng,Hao Meng,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiao Xu,Xiaoyue Chen,Yan Shu,Yanran Zhang,Yilei Chen,Yixian Xu,Yuxiang Chen,Zhendong Wang,Zihao Liu,Zikai Zhou,Yiliang Gu,Yi Wang,Xiaoxiao Xu,Lin Qu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.

[CV-38] CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

链接: https://arxiv.org/abs/2605.13544
作者: Hanwen Zhang,Yao Liu,Die Dai,Jiaye Yang,Qiao Liu,Yutong Xie,Peng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

[CV-39] owards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLM s

链接: https://arxiv.org/abs/2605.13530
作者: Jincai Huang,Shihao Zou,Yuchen Guo,Jingjing Li,Wei Ji,Kai Wang,Shanshan Wang,Weixin Si
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

[CV-40] ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin ICML2026

链接: https://arxiv.org/abs/2605.13517
作者: Jaeyung Kim,YoungJoon Yoo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: this https URL

[CV-41] PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

链接: https://arxiv.org/abs/2605.13493
作者: Jiaxin Yang,Yu Hou,Muxin Liu,Weixuan Liu,Ze Yuan,Zeming Chen,Zhongrui Wang,Xiaojuan Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 12 figures, including references, appendix, and supplementary benchmark details

点击查看摘要

Abstract:Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.

[CV-42] Neural Video Compression with Domain Transfer ISCAS2026

链接: https://arxiv.org/abs/2605.13476
作者: Tiange Zhang,Rongqun Lin,Xiandong Meng,Haofeng Wang,Xing Tian,Qi Zhang,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISCAS 2026 as an oral paper

点击查看摘要

Abstract:Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at this https URL.

[CV-43] FedHPro: Federated Hyper-Prototype Learning via Gradient Matching ICML2026

链接: https://arxiv.org/abs/2605.13475
作者: Huan Wang,Jun Shen,Haoran Li,Zhenyu Yang,Jun Yan,Ousman Manjang,Yanlong Zhai,Di Wu,Guansong Pang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, Accepted at ICML 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients’ real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \hrefthis https URLthis https URL.

[CV-44] Z-Order Transformer for Feed-Forward Gaussian Splatting CVPR2026

链接: https://arxiv.org/abs/2605.13465
作者: Can Wang,Lei Liu,Wei Jiang,Dong Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by CVPR 2026, Oral

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

[CV-45] OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

链接: https://arxiv.org/abs/2605.13457
作者: Chengyan Deng,Pengbin Yu,Zhentao Chen,Wei Shen,Kai Zhang,Meng Li,Lunxi Yuan,Xue Zhou,Li Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ( \mathcalL_\textAP ). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a 4096\times4096 output in only 5.75 seconds on a single NVIDIA H20 GPU.

[CV-46] Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

链接: https://arxiv.org/abs/2605.13455
作者: Shashwat Kumar,Dominic M. Padova,Binish Narang,Gabrielle I. Coste,Austin R. Graves,Richard L. Huganir,Adam S. Charles,Michael I. Miller,Anuj Srivastava
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textitin vivo imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textitin vivo microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

[CV-47] RotVLA: Rotational Latent Action for Vision-Language-Action Model

链接: https://arxiv.org/abs/2605.13403
作者: Qiwei Li,Xicheng Gong,Xinghang Li,Peiyan Li,Quanyun Zhou,Hangjun Ye,Jiahuan Zhou,Yadong Mu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

[CV-48] Fast and Compact Graph Cuts for the Boykov-Kolmogorov Algorithm

链接: https://arxiv.org/abs/2605.13402
作者: Christian Møller Mikkelstrup,Anders Bjorholm Dahl,Philip Bille,Vedrana Andersen Dahl,Inge Li Gørtz
类目: Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)
备注: 15 pages, 6 figures, submitted to the IEEE for possible publication

点击查看摘要

Abstract:Computing a minimum s - t cut in a graph is a solution to a wide range of computer vision problems, and is often done using the Boykov-Kolmogorov (BK) algorithm. In this paper, we revisit the BK algorithm from both a theoretical and practical point of view. We improve the analysis of the time complexity of the BK algorithm to O(mn|C|) and propose a new algorithm, the fast and compact BK (fcBK) algorithm, with a time complexity of O(m|C|) , where m , n , and |C| are the number of edges, number of vertices, and the capacity of the cut, respectively. We additionally propose a compact graph representation that allows our implementation to find a minimum s - t cut in a graph with upwards of 10^9 vertices and 10^10 edges on a machine with 128 GB of memory. We find our implementation of the BK algorithm to be the fastest available implementation of the BK algorithm when evaluating on a comprehensive set of benchmark datasets, highlighting the importance of memory-efficient implementations. We make our implementations publicly available for further research and implementation development within minimum s - t cut algorithms.

[CV-49] PreFIQs: Face Image Quality Is What Survives Pruning CVPR2026

链接: https://arxiv.org/abs/2605.13396
作者: Jan Niklas Kolf,Guray Ozgur,Andrea Atzori,Žiga Babnik,Vitomir Štruc,Naser Damer,Fadi Boutros
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshops

点击查看摘要

Abstract:Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.

[CV-50] aming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation CVPR2026

链接: https://arxiv.org/abs/2605.13395
作者: Lilin Zhang,Yimo Guo,Yue Li,Jiancheng Shi,Xianggen Liu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2026

点击查看摘要

Abstract:Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \hrefthis https URLthis https URL.

[CV-51] Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

链接: https://arxiv.org/abs/2605.13381
作者: Chiara Musso,Joy Battocchio,Andrea Montibeller,Giulia Boato
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector’s ViT backbone alone and operates entirely within the target detector’s feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.

[CV-52] GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

链接: https://arxiv.org/abs/2605.13375
作者: Mingzhe Huang,Weijun Wang,Xin Ding,Liang Mi,Hao Wen,Yuanchun Li,Lichen Pang,Shansong Yang,Yunxin Liu,Ting Cao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15% inference speedup at equal accuracy.

[CV-53] Neural Surrogate Forward Modelling For Electrocardiology Without Explicit Intracellular Conductivity Tensor

链接: https://arxiv.org/abs/2605.13366
作者: Shaheim Ogbomo-Harmitt,Cesare Magnetti,Jakub Grzelak,Oleg Aslanidi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted into the 9th International Conference on Computational and Mathematical Biomedical Engineering (CMBE2026)

点击查看摘要

Abstract:Accurate forward modelling is essential for non-invasive cardiac electrophysiology, particularly in atrial fibrillation, where electrical activation is highly disorganised. Conventional physics-based forward models require explicit specification of intracellular conductivity tensors, which are not directly measurable in clinical practice and introduce structural modelling errors. This proof-of-concept study presents a deep learning approach that learns a direct mapping from left atrial intracellular electrical potentials to far-field ECGs without requiring explicit intracellular conductivity inputs at inference time. Despite training only on 74 subjects, the model achieved an R2 of 0.949 \pm 0.037, highlighting potential to reduce structural uncertainty and improve non-invasive AF assessment.

[CV-54] Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints ICASSP2026

链接: https://arxiv.org/abs/2605.13349
作者: Haoyang Hu,Masataka Seo,Yen-Wei Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2026 oral

点击查看摘要

Abstract:Diffusion-based point editing methods have gained significant traction in image editing tasks due to their ability to manipulate image semantics and fine details by applying localized perturbations on the manifold of noise latent. However, these approaches face several limitations. Traditional point-based editing relies on pairs of handle and target points to define motion trajectories, which can introduce ambiguity or unnecessary alterations. Furthermore, when the distance between the handle and target points is large, the accumulated perturbations often cause the noise latent deviation from inversion score trajectory, resulting in unnatural artifacts. To address these issues in global editing tasks, we introduce a CLIP-based model to evaluate and guide intermediate editing steps, ensuring that the generated results remain both semantically aligned. Additionally, we propose a prior-preservation loss that constrains the optimized latent code to stay within the sampling space of the diffusion prior, improving consistency with the original data distribution, to ensure the model generates images along a familiar score trajectory. For fine-grained tasks, we present a directionally-weighted point tracking mechanism that steers the editing process toward the target direction within similar feature regions. This improves both the tracking accuracy and generation quality, while also reducing the editing time.

[CV-55] Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

链接: https://arxiv.org/abs/2605.13335
作者: Qinchuan Cheng,Zhantao Gong,Pengzhan Sun,Angela Yao,Xulei Yang,Shijie Li
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration – suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

[CV-56] Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation SIGGRAPH2026

链接: https://arxiv.org/abs/2605.13333
作者: Junhyuk Jeon,Seokhyeon Hong,Junyong Noh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Accepted to SIGGRAPH 2026. Project page: this https URL

点击查看摘要

Abstract:Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

[CV-57] KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

链接: https://arxiv.org/abs/2605.13322
作者: Richard Sproat,Stefano Peluchetti
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - “kamon yōgo” - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

[CV-58] st-time Sparsity for Extreme Fast Action Diffusion

链接: https://arxiv.org/abs/2605.13316
作者: Kangye Ji,Yuan Meng,Jianbo Zhou,Ye Li,Chen Tang,Zhi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at this https URL.

[CV-59] Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

链接: https://arxiv.org/abs/2605.13306
作者: G. Dofri Vidarsson,Liying Lu,Sabine Süsstrunk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at this https URL.

[CV-60] Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion SIGGRAPH2026

链接: https://arxiv.org/abs/2605.13293
作者: Shiyu Tan,Zixuan Zhao,Hao Gao,Zhiheng Chen,Xiaolong Yin,Enya Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2026 Conference

点击查看摘要

Abstract:Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.

[CV-61] X-Restormer: 1st Place Solution for the UG2 CVPR 2026 All-Weather Restoration Challenge

链接: https://arxiv.org/abs/2605.13258
作者: Youwei Pan,Leilei Cao,Yingfang Zhu,Fengjie Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Conditions. Our method is built upon the strong baseline framework X-Restormer, which effectively captures both channel-wise global dependencies and spatially-local structural information through its dual-attention design (Multi-DConv Head Transposed Attention and Overlapping Cross-Attention). To further boost the restoration performance, we propose several key improvements. First, we integrate the spatially-adaptive input scaling mechanism from Restormer-Plus to dynamically adjust the spatial weights of the input image, enhancing spatial adaptability. Second, to better preserve structural details and edge information, we introduce a novel Gradient-Guided Edge-Aware (GGEA) loss, which is combined with L1 and Multi-Scale SSIM losses in a unified training objective. Third, we significantly expand the training data by incorporating an extra 24,500 degraded-clean image pairs from FoundIR and WeatherBench alongside the original WeatherStream dataset. With these strategies, our proposed method successfully ranks the 1st place in the challenge.

[CV-62] ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

链接: https://arxiv.org/abs/2605.13228
作者: Xiao Liu,Nayu Liu,Junnan Zhu,Ruirui Chen,Guohui Xiang,Changjian Wang,Kaiwen Wei,Rongzhen Li,Jiang Zhong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

[CV-63] Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

链接: https://arxiv.org/abs/2605.13223
作者: Abdelrahman Eldesokey,Merey Ramazanova,Ahmad Sait,Ansar Khangeldin,Karen Sanchez,Tong Zhang,Bernard Ghanem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

[CV-64] STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

链接: https://arxiv.org/abs/2605.13202
作者: Hongli Liu,Yu Wang,Shengjie Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at this https URL.

[CV-65] FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

链接: https://arxiv.org/abs/2605.13193
作者: Geng Li,Yuxin Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

[CV-66] DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

链接: https://arxiv.org/abs/2605.13182
作者: Zheng Chen,Ruofan Yang,Jin Han,Dehua Song,Zichen Zou,Chunming He,Yong Guo,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17 \times faster than previous diffusion-based STVSR methods. Code is available at: this https URL.

[CV-67] Does Engram Do Memory Retrieval in Autoregressive Image Generation?

链接: https://arxiv.org/abs/2605.13179
作者: Jinghao Wang,Qiyuan He,Chunbin Gu,Pheng-Ann Heng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:The Engram module – a hash-keyed, O(1) associative memory injected into Transformer layers – was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial n -gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios \rho\in[0.17, 0.90] , every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate – inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to \mathcalN(0, 1) noise costs only \Delta\textFID=0.10 and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

[CV-68] CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

链接: https://arxiv.org/abs/2605.13178
作者: Sangin Lee,Yukyung Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP’s visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at this https URL.

[CV-69] PanoWorld: Towards Spatial Supersensing in 360circ Panorama World

链接: https://arxiv.org/abs/2605.13169
作者: Changpeng Wang,Xin Lin,Junhan Liu,Yuheng Liu,Zhen Wang,Donglian Qi,Yunfeng Yan,Xi Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

[CV-70] LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters ICIP2026

链接: https://arxiv.org/abs/2605.13163
作者: Beomjin Ahn,Jungmin Kwon,Chanyong Jung,Jaewook Chung
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICIP 2026

点击查看摘要

Abstract:Foundation models and low-rank adapters enable efficient on-device generative AI but raise risks such as intellectual property leakage and model recovery attacks. Existing defenses are often impractical because they require retraining or access to the original dataset. We propose LoREnc, a training-free framework that secures both FMs and adapters via spectral truncation and compensation. LoREnc suppresses dominant low-rank components of FM weights, compensates for the missing information in authorized adapters, and further applies orthogonal reparameterization to obscure structural fingerprints of the protected adapter. Unauthorized users produce structurally collapsed outputs, while authorized users recover exact performance. Experiments demonstrate that LoREnc provides strong protection against model recovery with under 1% computational overhead.

[CV-71] A_3B_2: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning IJCAI2026

链接: https://arxiv.org/abs/2605.13161
作者: Yiyun Zhou,Zhonghua Jiang,Wenkang Han,Kunxi Li,Mingjing Xu,Chang Yao,Jingyuan Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Efficient transfer learning methods for large-scale vision-language models ( e.g. , CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A _3 B _2 , an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A _3 B _2 introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A _3 B _2 adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A _3 B _2 consistently outperforms 11 competitive prompt- and adapter-based baselines.

[CV-72] Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

链接: https://arxiv.org/abs/2605.13158
作者: Jiaqi Xu,Xiaowei Hu,Lei Zhu,Pheng-Ann Heng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TCSVT

点击查看摘要

Abstract:Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle’s distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.

[CV-73] Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

链接: https://arxiv.org/abs/2605.13156
作者: Jiaxin Liu,Ding Zhong,Yue Wang,Zhidong Yang,Zhaolu Kang,Guangyuan Dong,Qishi Zhan,Pengcheng Fang,Aofan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.

[CV-74] Pareto-Guided Optimal Transport for Multi-Reward Alignment ICML2026

链接: https://arxiv.org/abs/2605.13155
作者: Ying Ba,Tianyu Zhang,Mohan Zhou,Yalong Bai,Wenyi Mo,Guiwei Zhang,Bing Su,Ji-Rong Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

[CV-75] EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision CVPR2026

链接: https://arxiv.org/abs/2605.13152
作者: Jiahao Chen,Zihui Zhang,Yafei Yang,Jinxi Li,Shenxing Wei,Zhixuan Sun,Bo Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: CVPR 2026. Code and data are available at: this https URL

点击查看摘要

Abstract:We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

[CV-76] GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation ICLR2026

链接: https://arxiv.org/abs/2605.13151
作者: Jiyong Rao,Yu Wang,Shengjie Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model’s capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

[CV-77] Understanding Generalization through Decision Pattern Shift

链接: https://arxiv.org/abs/2605.13148
作者: Huiqi Deng,Yibo Li,Quanshi Zhang,Peng Zhang,Hongbin Pei,Xia Hu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14pages, 12figures, computer vision and pattern recognition

点击查看摘要

Abstract:Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model’s internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample’s decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model’s decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r 0.8), revealing generalization as a systematic drift in the model’s internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

[CV-78] Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

链接: https://arxiv.org/abs/2605.13140
作者: Sangin Lee,Seokjun Kwon,Jeongmin Shin,Namil Kim,Yukyung Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on this https URL.

[CV-79] Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

链接: https://arxiv.org/abs/2605.13129
作者: Nikitas Chatzis,Marios Loizou,Evangelos Kalogerakis
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D generative models can synthesize high-quality assets, but their outputs are typically static: they lack the skeletal rigs, joint hierarchies, and skinning weights required for animation. This limits their use in games, film, simulation, virtual agents, and embodied AI, where assets must not only look plausible but also move plausibly. We introduce Rigel3D, a generative method for animation-ready 3D assets represented as rigged meshes. Unlike post-hoc auto-rigging methods that attach rigs to completed shapes, our method jointly models geometry and rig structure through coupled surface and skeleton structured latent representations. A rig-aware autoencoder decodes these representations into mesh geometry, skeleton topology, joint coordinates, and skinning weights, while a two-stage latent generative model synthesizes both surface and skeleton representations for image-conditioned generation. To support downstream animation workflows, we further introduce an open-vocabulary joint labeling module that embeds generated joints into a shared vision-language space, enabling correspondence to arbitrary retargeting templates. Experiments on large-scale rigged asset datasets demonstrate that our method generates diverse, high-quality animation-ready assets and outperforms existing rigging baselines across multiple metrics.

[CV-80] Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

链接: https://arxiv.org/abs/2605.13122
作者: Jingxuan He,Xiyu Wang,Yunke Wang,Mengyu Zheng,Chang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

[CV-81] owards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.13119
作者: Zixing Lei,Changxing Liu,Yichen Xiong,Minhao Xiong,Yuanzhuo Ding,Zhipeng Zhang,Weixin Li,Siheng Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of \pi_0.5 by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

[CV-82] Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

链接: https://arxiv.org/abs/2605.13111
作者: Jiayu Chen,Junbei Tang,Wenbiao Zhao,Maoliang Li,Jiayi Luo,Zihao Zheng,Jiawei Yang,Guojie Luo,Xiang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: this https URL.

[CV-83] Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

链接: https://arxiv.org/abs/2605.13108
作者: Muhammad Shahid Jabbar,Muhammad Sohail Ibrahim,Taha Hasan Masood Siddique,Kejie Huang,Shujaat Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

[CV-84] RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

链接: https://arxiv.org/abs/2605.13093
作者: Hoang Chuong Nguyen,Renjie Wu,Jose M. Alvarez,Miaomiao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.

[CV-85] PRA-PoE: Robust Alzheimers Diagnosis with Arbitrary Missing Modalities MICCAI2026

链接: https://arxiv.org/abs/2605.13081
作者: Guangqian Yang,Ye Du,Wenlong Hou,Qian Niu,Shujun Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Missing modalities are prevalent in real-world Alzheimer’s disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

[CV-86] Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

链接: https://arxiv.org/abs/2605.13080
作者: Junha Song,Byeongho Heo,Geonmo Gu,Jaegul Choo,Dongyoon Han,Sangdoo Yun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

[CV-87] HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

链接: https://arxiv.org/abs/2605.13073
作者: Yulei Kang,Tianze Zhu,Jian-Fang Hu,Jianhuang Lai,Wei-Shi Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-the-wild 3D Gaussian Splatting remains challenging due to transient distractors and illumination-induced cross-view appearance inconsistencies. Existing methods mainly rely on image-level masking to suppress unreliable supervision, but masking alone cannot fully eliminate residual occlusions or resolve illumination-induced inconsistencies, both of which can introduce conflicting cross-view gradients. These unresolved conflicts may destabilize Gaussian optimization and lead to visible reconstruction artifacts. We propose a conflict-aware 3DGS framework that addresses this problem from both image-space supervision and gradient-level optimization. Semantic Consistency-Guided Masking learns pixel-wise consistency scores to adaptively refine prior masks and suppress unreliable supervision before gradient formation. A dual-view Conflict-Aware Gradient Harmonization strategy further reconciles view-specific gradients by mutually rotating them into an orthogonal configuration, reducing negative directional interference across views. We also introduce conflict-aware densification and pruning to stabilize Gaussian growth and remove persistently conflicting primitives. Extensive experiments on standard in-the-wild benchmarks demonstrate that our method achieves state-of-the-art rendering quality under complex transient distractors and cross-view inconsistencies.

[CV-88] Edit-Compass EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

链接: https://arxiv.org/abs/2605.13062
作者: Xuehai Bai,Yang Shi,Yi-Fan Zhang,Xuanyu Zhu,Yuran Wang,Yifan Dai,Xinyu Liu,Yiyan Ji,Xiaoling Gu,Yuanxing Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

[CV-89] BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability MICCAI2026

链接: https://arxiv.org/abs/2605.13059
作者: Guangqian Yang,Tong Ding,Wenlong Hou,Yue Xun,Ye Du,Qian Niu,Shujun Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at this https URL.

[CV-90] Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

链接: https://arxiv.org/abs/2605.13049
作者: Xingyuan Li,Haoyuan Xu,Xingyue Zhu,Jun Ma,Yang Zou,Zhiying Jiang,Jinyuan Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.

[CV-91] Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

链接: https://arxiv.org/abs/2605.13047
作者: Ziqi Wen,Parsa Madinei,Miguel P. Eckstein
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model’s size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at this https URL.

[CV-92] EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

链接: https://arxiv.org/abs/2605.13041
作者: Inwoo Hwang,Donggeun Lim,Hojun Jang,Young Min Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.

[CV-93] CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy MICCAI2026

链接: https://arxiv.org/abs/2605.13038
作者: Liangjing Shao,Beilei Cui,Hongliang Ren
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early Accepted by MICCAI 2026

点击查看摘要

Abstract:Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

[CV-94] PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

链接: https://arxiv.org/abs/2605.13027
作者: Zihang Xu,Xiaoyang Liu,Zheng Chen,Yulun Zhang,Xiaokang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at this https URL.

[CV-95] OCH3R: Object-Centric Holistic 3D Reconstruction

链接: https://arxiv.org/abs/2605.13018
作者: Yi Du,Yang You,Xiang Wan,Leonidas Guibas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.

[CV-96] Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

链接: https://arxiv.org/abs/2605.13010
作者: Yilie Huang,Xun Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor–critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality–speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

[CV-97] ImageAttributionBench: How Far Are We from Generalizable Attribution?

链接: https://arxiv.org/abs/2605.12967
作者: Tingshu Mou,Zhipeng Wei,Chao Gong,Jingjing Chen,Xingjun Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.

[CV-98] Asymmetric Flow Models

链接: https://arxiv.org/abs/2605.12964
作者: Hansheng Chen,Jan Ackermann,Minseo Kim,Gordon Wetzstein,Leonidas Guibas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Webpage: this https URL

点击查看摘要

Abstract:Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256 \times 256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model’s high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

[CV-99] Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

链接: https://arxiv.org/abs/2605.12961
作者: Feijiang Li,Zhenxiong Li,Jieting Wang,Zizheng Jiu,Saixiong Liu,Liang Du
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching-based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi-Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi-layer ensemble strategy integrates cross-modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state-of-the-art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at this https URL.

[CV-100] GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

链接: https://arxiv.org/abs/2605.12957
作者: Hanxin Zhu,Cong Wang,Peiyan Tu,Jiayi Luo,Tianyu He,Xin Jin,Zhibo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: this https URL.

[CV-101] AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

链接: https://arxiv.org/abs/2605.12954
作者: Xiao Yang,Yingzhe Ma,Haoxuan Yu,Zixin Li,Ning Qin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

点击查看摘要

Abstract:Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning. Comments: 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12954 [cs.CV] (or arXiv:2605.12954v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.12954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-102] Seg-Agent : Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

链接: https://arxiv.org/abs/2605.12953
作者: Chao Hao,Jun Xu,Ji Du,Shuo Ye,Ziyue Qiao,Xiaodong Cun,Guangcong Wang,Xubin Zheng,Zitong Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see’’ and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

[CV-103] Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

链接: https://arxiv.org/abs/2605.12952
作者: Yongjin Cui,Xiaohui Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model’s performance. We analyze the causes of Grad-ECLIP’s flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

[CV-104] DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

链接: https://arxiv.org/abs/2605.12939
作者: Xianbing Sun,Jiahui Zhan,Liqing Zhang,Jianfu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.

[CV-105] CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

链接: https://arxiv.org/abs/2605.12938
作者: Seonghyun Jin,Youngmin Kim,Sunwoo Park,Jong Chul Ye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 8 figures, Under review

点击查看摘要

Abstract:Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

[CV-106] Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

链接: https://arxiv.org/abs/2605.12929
作者: Yingzhe Ma,Xiao Yang,Yuguo Yin,Zheyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into slots and aligning slots across eyes via bidirectional cross-attention. On ODIR-5K with n=10 seeds, the method improves AUC by 4.2% over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, W=0 , p=0.002 ). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis.

[CV-107] GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2605.12919
作者: Utae Jeong,Jaewan Choi,Junseok Lee,Jongheon Jeong,Sang Ho Yoon,ByoungSoo Koh,Sangpil Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is becoming a practical representation for novel view synthesis, but its growing adoption, together with rapid advances in instruction-driven 3DGS editing, also exposes a dual copyright risk: once a 3DGS-based asset is released, it can be used without permission and manipulated through 3D editing. Existing protection methods address only one side of this problem. Watermarking can trace ownership after unauthorized use, but it cannot prevent malicious editing. Adversarial edit-deterrence methods can disrupt editing, but they do not provide evidence of ownership. To the best of our knowledge, we present the first unified protection framework for 3DGS that jointly optimizes ownership tracing and unauthorized editing deterrence. Our framework combines a scene-wide watermarking objective over all Gaussians with an adversarial objective for edit deterrence. The adversarial branch combines latent-anchor separation, denoising-trajectory diversion, and cross-attention diversion to divert the editing trajectory, while an update-saliency-motivated Gaussian selection strategy assigns stronger adversarial updates to mask-selected Gaussians, improving the balance among watermark recovery, edit deterrence, and rendering fidelity. Experiments on scenes from Mip-NeRF 360 and Instruct-NeRF2NeRF demonstrate that the proposed framework achieves a favorable balance among bit accuracy, edit deterrence, and rendering quality. These results suggest that practical copyright protection of 3DGS-based assets can be more effectively addressed by integrating ownership tracing and unauthorized editing deterrence into a single optimization framework.

[CV-108] Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification

链接: https://arxiv.org/abs/2605.12917
作者: One Octadion,Novanto Yudistira,Lailil Muflikhah
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in IEA/AIE 2026 (Springer LNAI)

点击查看摘要

Abstract:Deep learning models for medical imaging often exhibit overconfidence, creating safety risks in ambiguous diagnostic scenarios. While Conformal Prediction (CP) provides distribution-free statistical guarantees, standard methods such as Regularized Adaptive Prediction Sets (RAPS) optimize for average efficiency and can mask severe failures on difficult inputs. We propose an Adaptive Lambda Criterion for RAPS that minimizes the worst-case coverage violation across prediction set size strata. On OrganAMNIST (58,850 abdominal CT images, 11 classes), standard size-optimized RAPS converges to near-deterministic behavior with stratified undercoverage on uncertain samples, while our method achieves 95.72 percent global coverage with average set size 1.09 and at least 90 percent coverage across all strata. Cross-domain validation on PathMNIST (107,180 pathology images, 9 classes) confirms generalizability. Quantitative Grad-CAM analysis (rho = -0.30, p 1e-22) shows that multi-label predictions correspond to focused attention on anatomically ambiguous regions. These results demonstrate that the proposed method improves reliability while maintaining efficiency, making it suitable for safety-critical medical AI applications. Comments: To appear in IEA/AIE 2026 (Springer LNAI) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.12917 [cs.CV] (or arXiv:2605.12917v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.12917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-109] Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

链接: https://arxiv.org/abs/2605.12855
作者: Jorge Tapias Gomez,Despoina Kanata,Aneesh Rangnekar,Christina Lee,Hannah Williams,Hannah Thompson,J. Joshua Smith,Francisco Sanchez-Vega,Mert R. Sabuncu,Julio Garcia-Aguilar,Harini Veeraraghavan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 Pages, 9 figures, 2 tables

点击查看摘要

Abstract:Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3–6, 6–12, and 12–24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% \pm 6% and a balanced accuracy of 90% \pm 3%, and outperformed all baselines in early detection at both 3–6 (74% \pm 1%) and 6–12 months (62% \pm 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% \pm 1.28%). Finally, we explored TREX’s ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% \pm 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.

[CV-110] PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification

链接: https://arxiv.org/abs/2605.12851
作者: Larissa Ferreira Rodrigues Moreira,Leonardo Gabriel Ferreira Rodrigues,Rodrigo Moreira,André Ricardo Backes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper accepted for publication at the XXVI Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2026), Ouro Preto, MG, Brazil

点击查看摘要

Abstract:Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.

[CV-111] AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects CVPR2026

链接: https://arxiv.org/abs/2605.12845
作者: Danrui Li,Jiahao Zhang,Bernhard Egger,Moitreya Chatterjee,Suhas Lohit,Tim K. Marks,Anoop Cherian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

[CV-112] FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection CVPR2026

链接: https://arxiv.org/abs/2605.12826
作者: Kaixiang Zhao,Tianrun Yu,Aoxu Zhang,Junhao Su,Porter Jenkins,Amanda Hughes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 SAFE Workshop

点击查看摘要

Abstract:The proliferation of sophisticated image editing tools and generative artificial intelligence models has made verifying the authenticity of digital images increasingly challenging, with important implications for journalism, forensic analysis, and public trust. Although numerous forensic algorithms, ranging from handcrafted methods to deep learning-based detectors, have been developed for manipulation detection, individual methods often suffer from limited robustness, fragmented evidence, or weak generalization across manipulation types and image conditions. To address these limitations, we present \textbfFRAME, a method for \textbfForensic \textbfRouting and \textbfAdaptive \textbfMulti-path \textbfEvidence fusion for image manipulation detection. FRAME organizes diverse forensic algorithms into a multi-path analysis space, adaptively selects informative forensic paths for each input image, and fuses complementary evidence to improve detection and localization performance. By moving beyond single-method analysis and fixed fusion strategies, FRAME provides a more robust and flexible approach to image forensic reasoning while preserving interpretable forensic cues from multiple evidence sources. Experimental results demonstrate the effectiveness of FRAME across diverse manipulation scenarios. Code is available at \hrefthis https URLthis https URL.

[CV-113] Generative Motion In-betweening by Diffusion over Continuous Implicit Representations

链接: https://arxiv.org/abs/2605.12778
作者: Shiyu Fan,Paul Henderson,Edmond S. L. Ho
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative models have yielded impressive progress on motion in-betweening, allowing for more complex, varied, and realistic motion transitions. However, recent methods still exhibit noticeable limitations in preserving keyframe information and ensuring motion continuity. In this paper, we propose a novel pipeline and sampling optimization strategy for latent diffusion models (LDM) based on motion implicit neural representations (INR). By establishing a mapping between INR and sparse spatial or temporal information within latent diffusion, our model can sample the INR parameters from extremely sparse and ambiguous keyframe data and reconstruct plausible and smooth motions from the manifold. Our experiments demonstrate the superior performance of our model, which significantly improves motion generation quality in scenarios with few keyframes while ensuring both keyframe accuracy and diversity of in-between motions.

[CV-114] WildPose: A Unified Framework for Robust Pose Estimation in the Wild

链接: https://arxiv.org/abs/2605.12774
作者: Jianhao Zheng,Liyuan Zhu,Zihan Zhu,Iro Armeni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.

[CV-115] Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLM s

链接: https://arxiv.org/abs/2605.12772
作者: Andreas Maier,Jeta Sopa,Gozde Gul Sahin,Paula Perez-Toro,Siming Bayer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Workshop on Textual Information Processing Synthesis in the Wild

点击查看摘要

Abstract:Wu et al. (2026) showed that most frontier large language models (LLMs) recommend a sponsored, roughly twice-as-expensive flight when their system prompt contains a soft sponsorship cue. We reproduce their evaluation on ten open-weight chat models plus the two of their twenty-three models that are still reachable today (gpt-3.5-turbo, gpt-4o). All reported rates in this paper are produced under the same judge the original paper used (gpt-4o); we additionally store every label under an open-weight (gpt-oss-120b) and a smaller proprietary (gpt-4o-mini) judge for an ablation. Three findings emerge. First, a prose description of an LLM evaluation pipeline is not, on its own, sufficient for accurate reproduction: we surfaced three silent implementation failures that each shifted a reported rate by tens of percentage points. Second, the central claims do generalise - the gpt-3.5-turbo logistic-regression intercept of alpha = 0.81 is within four points of the original alpha = 0.86, and 200 of 200 trials on gpt-3.5-turbo and gpt-4o promote a payday lender to a financially distressed user. Third, a thirty-token user prompt that asks the assistant for a neutral comparison table first cuts sponsored recommendation from 46.9% to 1.0% averaged across our ten open-source models, and from 53.0% to 0% averaged across the two OpenAI models. AI literacy and price-comparison portals are likely market-level mitigations; the harmful-product cell is bounded by neither. Raw data, labels and analysis scripts are at this https URL .

[CV-116] Still Camouflage Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving

链接: https://arxiv.org/abs/2605.12743
作者: Shuo Ju,Qingzhao Zhang,Huashan Chen,Xuheng Wang,Haotang Li,Wanqian Zhang,Feng Liu,Kebin Peng,Sen He
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing physical adversarial attacks on vision-based autonomous driving induce time-evolving perception errors, including biased object tracking or trajectory prediction, through (i) sophisticated physical patch inducing detection box drift when entering the view distance, or (ii) dynamically changing patches that cause different perception errors at different time. In both cases, viewing-angle variation is treated as a challenge, requiring adversarial patches to remain effective across frames under varying views, leading to complex multi-view optimization. In contrast, we show that viewing-angle variation itself can be turned into an attack tool. We design a new attack paradigm where a static, passive adversarial camouflage is mounted on a vehicle whose view-dependent appearance naturally evolves with relative motion, inducing consistent feature drift across frames. This causes the system to infer a physically plausible but incorrect trajectory, such as a false cut-in, which propagates to downstream decision-making and triggers unnecessary braking. Unlike prior approaches that require multi-view robustness or active intervention, our attack emerges from normal driving dynamics and is easy to deploy: a parked vehicle with a natural camouflage can induce hard braking in passing autonomous vehicles. We demonstrate the novel attack on nuScenes dataset, showing the effectiveness with an end-to-end success rate of up to 87.5%, measured by hard-braking events, and robustness across different scene backgrounds, victim vehicle speeds, and perception models.

[CV-117] Is Video Anomaly Detection Misframed? Evidence from LLM -Based and Multi-Scene Models

链接: https://arxiv.org/abs/2605.12725
作者: Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

[CV-118] Inline Critic Steers Image Editing

链接: https://arxiv.org/abs/2605.12724
作者: Weitai Kang,Xiaohang Zhan,Yizhou Wang,Mang Tik Chiu,Jason Kuen,Kangning Liu,Yan Yan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \rho = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model’s predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model’s attention and prediction updates at subsequent layers.

[CV-119] MMCL-Bench: Multimodal Context Learning from Visual Rules Procedures and Evidence

链接: https://arxiv.org/abs/2605.12703
作者: Yifan Chen,Fei Yin,Qingyan Bai,Zicheng Lin,Yujiu Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

[CV-120] No One Knows the State of the Art in Geospatial Foundation Models

链接: https://arxiv.org/abs/2605.12678
作者: Isaac Corley,Nils Lehmann,Caleb Robinson,Gabriel Tseng,Anthony Fuller,Hamed Alemohammad,Evan Shelhamer,Jennifer Marcus,Hannah Kerner
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.

[CV-121] CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

链接: https://arxiv.org/abs/2605.12650
作者: Yunsung Chung,Alex El Darzi,Carlo El Khoury,Han Feng,Nassir Marrouche,Jihun Hamm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.

[CV-122] DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

链接: https://arxiv.org/abs/2605.12649
作者: Qianxin Xia,Zhiyong Shu,Wenbo Jiang,Jiawei Du,Jielei Wang,Guoming Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called \textbfDIVER , which leverages the pre-trained diffusion model to dive deeper into \textbfDI stilled data \textbfV ia \textbfE xpressive semantic \textbfR ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256 \times 256) with only 4 GB of GPU memory usage. Code is available: this https URL.

[CV-123] MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

链接: https://arxiv.org/abs/2605.12640
作者: Qing Cheng,Damiano Bertolini,Wei Zhang,Dong Wang,Niclas Zeller,Daniel Cremers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISPRS Congress 2026

点击查看摘要

Abstract:Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

[CV-124] Driving Intents Amplify Planning -Oriented Reinforcement Learning

链接: https://arxiv.org/abs/2605.12625
作者: Hengtong Lu,Victor Shea-Jay Huang,Chengmin Yang,Pengfei Jing,Jifeng Dai,Yan Xie,Benjin Zhu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. Project page: this https URL

点击查看摘要

Abstract:Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance – even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

[CV-125] MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

链接: https://arxiv.org/abs/2605.12624
作者: Yuzhou Huang,Benjin Zhu,Hengtong Lu,Victor Shea-Jay Huang,Haiming Zhang,Wei Chen,Jifeng Dai,Yan Xie,Hongsheng Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. Project page: this https URL

点击查看摘要

Abstract:Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built – as isolated subtask improvements that fail to compose into coherent driving capabilities – rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces autoregressive language tokens and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A streaming design processes the driving video framewise rather than as fixed video-action chunks, while a learned memory channel carries temporal context across frames so planned trajectories evolve smoothly without redundant multi-frame VLM modeling. The unified architecture admits fast/slow execution on dense/sparse Mixture-of-Transformers (MoT) backbones via flexible self-attention context management, and exposes a measurable language-to-action route: a language-predicted driving intent steers action diffusion through classifier-free guidance (CFG), turning language-side intent into a control signal for continuous trajectory generation. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA methods by large margins, and matches VA-class throughput (16 FPS vs. RAP-DINO’s 18 FPS) while preserving natural-language interfaces.

[CV-126] Action Emergence from Streaming Intent

链接: https://arxiv.org/abs/2605.12622
作者: Pengfei Jing,Victor Shea-Jay Huang,Hengtong Lu,Jifeng Dai,Xie Yan,Benjin Zhu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. Project page: this https URL

点击查看摘要

Abstract:We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates – to our knowledge for the first time in a fully end-to-end VLA – intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

[CV-127] A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

链接: https://arxiv.org/abs/2605.12608
作者: Mohamed Ahmed Mohamed,Xiaowei Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project code and experimental configs available at this https URL

点击查看摘要

Abstract:Object detection in adverse weather is critical for the safety of autonomous vehicles; however, the scarcity of labelled, real-world foggy data remains a significant bottleneck. In this paper, we propose Clear2Fog (C2F), an end-to-end, physics-based pipeline that simulates fog on clear-weather datasets while ensuring sensor-level consistency across camera and LiDAR. By using monocular depth estimation and a novel atmospheric light estimation method, C2F overcomes structural artifacts and chromatic biases common in existing techniques. A human perceptual study confirms C2F’s physical realism, with the generated images being preferred 92.95% of the time over an established method. Utilising a training set of 270,000 images from the Waymo Open Dataset, we conduct an extensive data efficiency study to investigate how environmental diversity influences model robustness. Our findings reveal that models trained on mixed-density fog datasets at 75% scale outperform those trained on fixed-density datasets at 100% scale. Furthermore, we investigate the sim-to-real transfer by fine-tuning pre-trained models on real-world foggy data. We demonstrate that a tenfold increase over the default fine-tuning learning rate successfully overcomes negative transfer from synthetic biases, resulting in a 1.67 mAP improvement over real-only baselines. The C2F pipeline provides a scalable framework for enhancing the reliability of autonomous systems in adverse weather and demonstrates the potential of diverse synthetic datasets for efficient model training.

[CV-128] rackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

链接: https://arxiv.org/abs/2605.12587
作者: Jisu Nam,Jahyeok Koo,Soowon Son,Jaewoo Jung,Honggyu An,Junhwa Hur,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code are available at this https URL

点击查看摘要

Abstract:Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame’s content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

[CV-129] 3D Primitives are a Spatial Language for VLMs

链接: https://arxiv.org/abs/2605.12586
作者: Junze Liu,Kun Qian,Florian Dubost,Kai Zhong,Arvind Srinivasan,Nan Chen,Anping Wang,Sam Zhang,Alejandro Mottini,Qingjun Cui,Tian Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf\textscSpatialBabel, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emphscene-code languages (programming languages and declarative formats for 3D primitive scenes), revealing that a single model’s object-detection F1 can vary by up to 5.7\times across languages. Second, we propose \textbfCode-CoT (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to +6.4 % on primitive scenes and real-photo CV-Bench-3D accuracy by +5.0 % for VLMs with strong coding capabilities. Third, we propose \textbfS ^3 -FT (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model’s own this http URL primitive-reconstructions into structured annotations and fine-tuning on the result, with \emphno human labels and no teacher model. Training on primitive images alone, S ^3 -FT improves Qwen3-VL-8B by +4.6 to +8.6 % on SpatialBabel-Primitive-QA, +9.7 % on CV-Bench-2D, and +17 % on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

[CV-130] DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

链接: https://arxiv.org/abs/2605.12574
作者: Hongyi Tang,Zhihao Zhu,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

[CV-131] Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration

链接: https://arxiv.org/abs/2605.12573
作者: Davide Evangelista,Elena Morotti,Francesco Pivi,Maurizio Gabbrielli
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 Figures, 9 Tables, Submitted to a conference

点击查看摘要

Abstract:Diffusion-based posterior sampling (PS) is a leading framework for imaging inverse problems, combining learned priors with measurement constraints. Yet, its standard formulations rely on instantaneous data-consistent estimates, which induce temporal variability in the reverse dynamics. We reinterpret PS from a dynamical perspective, showing that the standard PS update corresponds to a first-order discretization of the diffusion dynamics plus a residual correction capturing the mismatch between the denoised prediction and the data-consistent estimate. A second-order discretization, however, naturally introduces a temporal correction based on the variation of consecutive estimates. Building on this, we propose LAMP, combining the second-order update with the residual correction characterizing a PS technique. LAMP thus inherits a lagged temporal correction, and it can be implemented as a modular plug-in over the PS backbone. We show that LAMP preserves the structure of a posterior sampler, and we perform a one-step risk analysis to characterize when LAMP improves the reverse transition via a bias-variance trade-off. Experiments across multiple imaging tasks demonstrate consistent improvements over strong baselines such as DiffPIR and DDRM, without increasing the number of denoising evaluations.

[CV-132] VideoSEAL: Mitigating Evidence Misalignment in Agent ic Long Video Understanding by Decoupling Answer Authority ICML2026

链接: https://arxiv.org/abs/2605.12571
作者: Chenhao Qiu(1),Yechao Zhang(2),Xin Luo(1),Shien Song(1),Xusheng Liu(1) ((1) Mango TV, (2) Nanyang Technological University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. 33 pages, 13 figures. Code and models are available at this https URL

点击查看摘要

Abstract:Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit “evidence misalignment”: they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at this https URL.

[CV-133] M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification

链接: https://arxiv.org/abs/2605.12570
作者: Jinyue Li,Yuzhou Yu,Jingjing Yang,Meng Fu,Yani Zhang,Shuyao He,Dianlong Ge,Xin Ning,Yannan Chu,Qiankun Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Information Fusion (2026), 15 pages, 5 figures

点击查看摘要

Abstract:The accurate classification of benign and malignant pulmonary nodules in CT scans is critical for early lung cancer screening, yet remains challenging due to the multi-scale and heterogeneous nature of pulmonary nodules. While deep learning offers potential for auxiliary diagnosis, most existing models act as “black boxes”, lacking the transparency and explainability required for trustworthy clinical integration. To address this issue, we propose M3Net, a novel 3D network for pulmonary nodule classification inspired by the hierarchical diagnostic workflow of radiologists, which integrates multi-scale contextual information from fine-grained structures to global anatomical relationships. Our framework constructs a progressive multi-scale input, from fine-grained nodule structures to local semantics and global spatial relationships. M3Net employs scale-specific encoders and ensures cross-scale semantic consistency through latent space projection and mutual information maximization. Extensive experiments on the public LIDC-IDRI dataset and a self-collected clinical dataset (USTC-FHLN) demonstrate that our method achieves state-of-the-art performance, with accuracies of 86.96% and 84.24% respectively, outperforming the best baseline by 3.26% and 2.17%. The results validate that M3Net provides a more robust and clinically relevant solution for pulmonary nodule classification. The code is available at this https URL.

[CV-134] Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising

链接: https://arxiv.org/abs/2605.12567
作者: Jiajing Zhang,Bingze Dai,Xi Zhang,Yue Xu,Wei-Ning Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods require massive labeled data and model parameters. These pre-defined and pre-trained manners entail an inevitable domain shift in complex in vivo environments, so they are limited to a specific noise type and often blur structural details. In this study, we propose a pure test-time training framework for one-shot ultrasound image denoising and apply it to synthetic aperture ultrasound (SAU), which synthesizes transmit focus from sub-aperture transmissions. Our Aperture-to-Aperture (A2A) framework disentangles anatomical similarity and noise randomness from shuffled sub-apertures through self-contrastive learning in pyramid latent spaces. The clean image is then decoded from the anatomy space, while discarding the noise space. A2A is trained at test time on one noisy sample of SAU signals, so it fundamentally eliminates the domain shift and pretraining costs. Simulation experiments, including electronic noise levels of 0 to 30 dB and different inclusion geometries, demonstrated an improvement of 69.3% SNR and 34.4% CNR by A2A. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. A2A delivers clear images/signals across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization and functional assessment by ultrasound.

[CV-135] M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement ICIP

链接: https://arxiv.org/abs/2605.12556
作者: Youssef Aboelwafa,Hicham G. Elmongui,Marwan Torki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2026 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at this https URL

[CV-136] SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

链接: https://arxiv.org/abs/2605.12550
作者: Mingrui Zhang,Hanchen Yang,Wengen Li,Xudong Jiang,Yichao Zhang,Jihong Guan,Shuigeng Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at this https URL.

[CV-137] What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

链接: https://arxiv.org/abs/2605.12549
作者: Jiaping Lin,Fei Shen,Junzhe Li,Ping Nie,Fei Yu,Ming Li,Haizhou Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at this https URL.

[CV-138] CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

链接: https://arxiv.org/abs/2605.12545
作者: Zhitong Dong,Chao Li,Jie Yu,Hao Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task’s core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM’s analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an “analysis-proposal-decision” process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model’s decision consistent with human expert aesthetics. Extensive experiments across multiple datasets validate our method’s superiority and component effectiveness.

[CV-139] MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning

链接: https://arxiv.org/abs/2605.12528
作者: Yuting Hu,Lei Zhuang,Chen Wang,Ruiyang Qin,Hua Xiang,Gi-joon Nam,Jinjun Xiong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:As feature sizes shrink to the nanometer scale, accurately transferring circuit patterns from photomasks to silicon wafers becomes increasingly challenging. Optical proximity correction (OPC) is widely used to ensure pattern fidelity and manufacturability. Recent generative mask optimization models based on encoder-decoder architecture can synthesize near-optimal masks, serving as fast machine learning (ML) surrogates for traditional OPC. However, these models often fail to capture the geometric transformations from target layouts to mask patterns, leading to suboptimal quality. In this work, we formulate mask generation as a sequence of morphological operations on local layout features and propose \textitMorphOPC, a multi-scale hierarchical model with neural morphological modules to learn these transformations. Experiments on edge-based OPC and ILT benchmarks across metal and via layers show that \textitMorphOPC consistently outperforms state-of-the-art methods, achieving higher printing fidelity and lower manufacturing cost, demonstrating strong potential for scalable mask optimization.

[CV-140] Structural Diversity Drives Disruptive Scientific Innovation

链接: https://arxiv.org/abs/2605.12514
作者: Yichun Peng,Saike He,Peijie Zhang,Kang Zhao,Yi Yang,Ning Zhang,Qingpeng Zhang,Daniel Dajun Zeng,Hao Peng
类目: ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Digital Libraries (cs.DL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known “curse of scale” by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.

[CV-141] DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field Microscopy

链接: https://arxiv.org/abs/2605.13619
作者: Joseph L. Greene,Suet YIng Chan,Qilin Deng,Jeffrey Alido,Alexandra Lion,Guorong Hu,Ruipeng Guo,Tongyu Li,Kivilcim Kiliç,Ian Davison,Lei Tian
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages (18 main text, 20 supplement), 23 Figures (7 main text, 16 supplement)

点击查看摘要

Abstract:Extended depth of field microscopy encodes axial information into a single acquisition through engineered point spread functions, but conventional and deep optics approaches are subject to degradation in scattering tissue. We introduce DeepFilters, a scattering-aware deep optics framework that jointly optimizes a parameterized pupil filter and a digital-filter-based reconstruction network through a calibrated differentiable forward model to achieve broad generalization without retraining. Incorporating empirical scattering kernels, physics-guided regularization, and a hybrid genetic-gradient initialization strategy, DeepFilters extends the PSF from 16 micron to 400 micron in clear media and enables signal recovery beyond 120 micron deep in biological tissues, validated across fixed brain slices and sea urchin embryos.

[CV-142] On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods RAID

链接: https://arxiv.org/abs/2605.13146
作者: David Iagaru,Nina M. Gottschling,Anders C. Hansen,Josselin Garnier
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 11 figures; code available at this https URL

点击查看摘要

Abstract:Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their reliability, especially when ground truth data is unavailable. We develop a theoretical framework showing that such hallucinations are not merely artifacts of particular models, but can arise from the ill-posed nature of the inverse problem itself. We derive necessary and sufficient conditions for hallucinations, together with computable bounds on their magnitude that depend only on the forward model. Building on this theory, we introduce algorithms to: (1) estimate the minimum hallucination magnitude achievable by any reconstruction model for a given input; (2) assess the faithfulness of reconstructed details by a given reconstruction model. Experiments across three imaging tasks demonstrate that our approach applies broadly, including to modern generative models, and provides a principled way to quantify and evaluate AI hallucinations.

[CV-143] A General Bézier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis

链接: https://arxiv.org/abs/2605.13015
作者: Tan Su,Ethan Elio Meidinger,Lin Gu,Ruogu Fang
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 pages, 6 figures; preprint

点击查看摘要

Abstract:The geometry of the retinal vessel is a key biomarker of vascular diseases, yet clinical evidence remains primarily observational. Existing generative counterfactuals intervene only at the image-level disease label, failing to isolate explicit anatomical structure. To address this limitation, we propose the Bézier Tree Encoding Counterfactual Framework (BTECF). By abstracting vascular networks into interconnected cubic-Bézier segments, BTECF establishes a disease-agnostic representation in which structural topology is explicitly preserved and atomically perturbable. Coupling this encoding with a diffusion-based generator enables parameter-level do-interventions on explicit geometric axes (e.g., tortuosity, caliber) while preserving background fundus textures. We validate BTECF on diabetic retinopathy, together with independent cohorts for ischemic stroke and Alzheimer’s disease. Isolated counterfactual interventions produce dose-responsive shifts in classifier predictions; a matched pixel-drop control attenuates this response by an order of magnitude or more, ruling out out-of-distribution generation artifacts. By enforcing causal isolation between vessel topology and pixel-level confounders, BTECF provides a unified generative paradigm for hypothesis verification across systemic diseases. To support reproducibility, the code will be publicly released upon acceptance.

[CV-144] Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data

链接: https://arxiv.org/abs/2605.12753
作者: Paul Hoareau,Kuan Yi Wang,Brandon Bujak,Roy Sun,Govind Nair,Irene Cortese,Charidimos Tsagkas,Daniel Reich,Julien Cohen-Adad
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: this https URL

点击查看摘要

Abstract:INTRODUCTION | Fully supervised 3D segmentation of high-resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse-to-Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human-centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi-class segmentation of high-resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo-labels to train a 3D Student. We systematically evaluated the impact of human-centric preprocessing, spatial augmentation, and soft-label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft-labeling to overcome data scarcity, improving White Matter Lesion Dice scores by 11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human-centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human-centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo-labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: this https URL. Comments: 19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: this https URL Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.12753 [eess.IV] (or arXiv:2605.12753v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2605.12753 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paul Hoareau [view email] [v1] Tue, 12 May 2026 21:06:53 UTC (7,390 KB)

[CV-145] Human face perception reflects inverse-generative and naturalistic discriminative objectives

链接: https://arxiv.org/abs/2605.12619
作者: Wenxuan Guo,Heiko H. Schütt,Kamila Maria Jozwik,Katherine R. Storrs,Nikolaus Kriegeskorte,Tal Golan
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 10 figures, 4 tables

点击查看摘要

Abstract:The perceptual representations supporting our ability to recognize faces remain a computational mystery. Deep neural networks offer mechanistic hypotheses for human face perception, but theoretically distinct models often make indistinguishable representational predictions for randomly sampled faces. To expose diagnostic differences among these hypotheses, we compared six neural network models sharing an architecture but trained on distinct tasks, using face pairs optimized to elicit contrasting model predictions (“controversial” pairs) alongside randomly sampled pairs. We tested model predictions against face-dissimilarity judgments from 864 human participants across stimulus sets differing in realism and pose variation. Models prioritizing high-level, invariant structures (trained via inverse rendering, face identification, or object classification) most robustly matched human judgments. Furthermore, models trained on natural images typically outperformed synthetic-trained counterparts. Together, these findings suggest that human face perception is shaped by mechanisms that infer latent causes of facial appearance, discount nuisance variation, and are tuned by natural image statistics.

[CV-146] Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

链接: https://arxiv.org/abs/2605.12575
作者: Hyun Do Jung,Jungwon Choi,Soojung Choi,Yujin Oh,Hwiyoung Kim
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

[CV-147] Uncovering Latent Pathological Signatures in Pulmonary CT via Cross-Window Knowledge Distillation

链接: https://arxiv.org/abs/2605.12562
作者: Bo Peng,Wujian Xu,Kun Wang,Ximing Liao,Na Wang,Daqian Shi,Tian Li,Jing Gao,Johan Thygesen,Yingqun Ji,Honghan Wu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-window CT imaging captures complementary pathological information across anatomical structures of differing densities, yet existing deep learning methods fuse representations only at later stages, missing cross-density interactions. We propose a cross-window knowledge distillation framework in which student encoders learn latent clinical priors from a teacher trained on the most informative window. Evaluated retrospectively on three cohorts - COPD-CT-DF (n=719), RSNA PE (n=1,433), and an in-house CTEPD dataset (n=161) - distillation improved per-window AUC by 10.1-16.5 percentage points on COPD-CT-DF (0.75-0.81 to 0.90-0.94; all P0.001), with ensemble AUC reaching 0.9960. Similar gains were observed on RSNA PE (0.80-0.83 to 0.90-0.92) and CTEPD (AUC 0.7481 vs. 0.6264). Cross-window distillation internalises pathological signatures invisible to supervised approaches, offering a generalisable solution for multi-window pulmonary CT analysis.

[CV-148] Brain Tumor Classification in MRI Images: A Computationally Efficient Convolutional Neural Network

链接: https://arxiv.org/abs/2605.12560
作者: Md Fahimul Kabir Chowdhury,Jannatul Ferdous
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Improving patient outcomes depends on the prompt and accurate diagnosis of brain tumors, but manual MRI scan analysis is still time-consuming and unreliable. Although deep learning has shown promise, many of the models that are now in use are computationally intensive and have difficulty handling the intrinsic complexity and variety of different types of brain tumors. In this work, we propose a lightweight yet high-performing Convolutional Neural Network (CNN) for multi-class brain tumor classification, employing MRI images to target gliomas, meningiomas, pituitary tumors, and healthy (no tumor) instances. The model was rigorously evaluated on two publicly accessible datasets from Figshare and Kaggle. Leveraging efficient feature extraction and optimized training strategies, our CNN achieved classification accuracies of 99.03% and 99.28%, along with ROC scores of 99.88% and 99.94% on Dataset 1 and Dataset 2, respectively-all while utilizing significantly fewer parameters than popular pre-trained architectures. In contrast to cutting-edge models like DenseNet201, MobileNetV2, VGG19, Xception, InceptionV3, and ResNet50, our approach consistently demonstrated superior performance with reduced computational overhead. These findings highlight the potential of the proposed model as a practical and reliable diagnostic aid in clinical environments.

[CV-149] Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

链接: https://arxiv.org/abs/2605.08320
作者: Marwane Hariat,Antoine Manzanera,David Filliat
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.

人工智能

[AI-0] opology-Preserving Neural Operator Learning via Hodge Decomposition ICML2026

链接: https://arxiv.org/abs/2605.13834
作者: Dongzhe Zheng,Tao Zhong,Christine Allen-Blanchette
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: Accepted at ICML 2026. Code available at this https URL

点击查看摘要

Abstract:In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at this https URL

[AI-1] Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

链接: https://arxiv.org/abs/2605.13830
作者: S. Akshay,Chaitanya Garg,Ashutosh Gupta,Kuldeep S. Meel,Ajinkya Naik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

[AI-2] Harnessing Agent ic Evolution

链接: https://arxiv.org/abs/2605.13821
作者: Jiayi Zhang,Yongfeng Gu,Jianhao Ruan,Maojia Song,Yiran Peng,Zhiguang Han,Jinyu Xiang,Zhitao Wang,Caiyin Yang,Yixi Ouyang,Bang Liu,Chenglin Wu,Yuyu Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candidates, feedback, traces, and failures, yet lack a stable interface for organizing this evidence and revising the mechanism that drives future evolution. We address this limitation by formulating agentic evolution as an interactive environment, where the accumulated evolution context serves as a process-level state. We introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and acts not by directly proposing the next candidate, but by editing the procedure or agent context that controls future evolution. This unified interface enables AEvo to steer both procedure-based and agent-based evolution, making accumulated evidence actionable for long-horizon search. Empirical evaluations on agentic and reasoning benchmarks show that AEvo outperforms five evolution baselines, achieving a 26 relative improvement over the strongest baseline. Across three open-ended optimization tasks, AEvo further outperforms four evolution baselines and achieves state-of-the-art performance under the same iteration budget.

[AI-3] Neurosymbolic Auditing of Natural-Language Software Requirements

链接: https://arxiv.org/abs/2605.13817
作者: Bethel Hall,William Eiers
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10

点击查看摘要

Abstract:Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

[AI-4] Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

链接: https://arxiv.org/abs/2605.13801
作者: Deepak Pandita,Flip Korn,Chris Welty,Christopher M. Homan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ( N ) and the number of responses per item ( K ) required to achieve statistical significance.

[AI-5] Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

链接: https://arxiv.org/abs/2605.13790
作者: Zhonghao Li,Chaoyu Liu,Qian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) are fundamental for modeling complex natural and physical phenomena. In many real-world applications, however, observational data are extremely sparse, which severely limits the applicability of both classical numerical solvers and existing neural approaches. While neural methods have shown promising results under moderately sparse observations, their inference efficiency at high resolutions is limited, and their accuracy degrades substantially in the extremely sparse regime. In this work, we propose the Di-BiLPS, a unified neural framework that effectively handle both forward and inverse PDE problems under extremely sparse observations. Di-BiLPS combines a variational autoencoder to compress high-dimensional inputs into a compact latent space, a latent diffusion module to model uncertainty, and contrastive learning to align representations. Operating entirely in this latent space, the framework achieves efficient inference while retaining flexible input-output mapping. In addition, we introduce a PDE-informed denoising algorithm based on a variance-preserving diffusion process, which further improves inference efficiency. Extensive experiments on multiple PDE benchmarks demonstrate that Di-BiLPS consistently achieves SOTA performance under extremely sparse inputs (as low as 3%), while substantially reducing computational cost. Moreover, Di-BiLPS enables zero-shot super-resolution, as it allows predictions over continuous spatial-temporal domains.

[AI-6] ENSEMBITS: an alphabet of protein conformational ensembles

链接: https://arxiv.org/abs/2605.13789
作者: Kaiwen Shi,Carlos Oliver
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.

[AI-7] Amplification to Synthesis: A Comparative Analysis of Cognitive Operations Before and After Generative AI

链接: https://arxiv.org/abs/2605.13785
作者: Liz Cho,Dongwook Yoon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive operations are a rising concern in the geopolitical sphere, a quiet yet rigorous fight for public perception and decision making. While such operations have been extensively studied in the context of bot-driven amplification, the emergence of generative AI introduces a new set of capabilities that may have fundamentally altered how these operations are designed and executed. The possible evolution of cognitive operation via generative AI puts nation states vulnerable without proper mitigation strategies. To address this, we compared behavioral and linguistic coordination patterns in X (formerly Twitter) datasets from the 2016 and 2024 U.S. presidential elections. Utilizing a combined corpus of over 133,000 posts, we applied post-type distribution, semantic clustering, temporal synchrony analysis, and Jaccard-based lexical overlap measures. Findings suggest that the 2024 corpus exhibits a distinct pattern from 2016. Original content rose from 59% to 93% with retweets virtually disappeared; lexical overlap collapsed from a mean Jaccard score of 0.99 to 0.27, with posts converging on the same subject matter expressed in markedly different words; and temporal coordination shifted from pervasive cross-semantic synchrony to narratively concentrated co-occurrence. Taken together, these patterns point toward an operational logic organized around active content generation and narrative-specific targeting - characteristics consistent with generative AI involvement. These findings offer an empirical baseline for future research investigating generative AI’s role in the cognitive operation pipeline, and as a practical reference point for security practitioners developing detection frameworks calibrated to the post-generative AI threat environment.

[AI-8] LMPath: Language-Mediated Priors and Path Generation for Aerial Exploration

链接: https://arxiv.org/abs/2605.13782
作者: Jonathan A. Diller,Fernando Cladera,Camillo J. Taylor,Vijay Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Poster at 2026 AI-Driven Safe Aerial Robotics Workshop

点击查看摘要

Abstract:Traditional autonomous UAV search missions rely on geometric coverage patterns that ignore the semantic context of the target, leading to significant time waste in large-scale environments. In this paper we present LMPath, a pipeline for generating language-mediated exploration priors for Unmanned Aerial Vehicle (UAV) search missions that leverages semantics. Given a basic geofence and an object of interest prompt, LMPath uses generative language models to determine what regions of the environment should contain that object and a foundation vision model ran over satellite imagery to segment sub-regions that form the exploration prior. This prior can then be used to generate UAV paths with various objectives, such as minimizing the expected time to locate the object of interest, maximizing the probability that the object is found given a limited travel distance, or narrowing down the search space to sub-regions that are most likely to contain the object. To demonstrate it’s capabilities, we used LMPath to generate various UAV paths and ran them using a real UAV over large-scale environments. We also ran simulations to demonstrate how paths generated using LMPath outperform traditional path planning approaches for search missions.

[AI-9] MinT: Managed Infrastructure for Training and Serving Millions of LLM s

链接: https://arxiv.org/abs/2605.13779
作者: Mind Lab:Song Cao,Vic Cao,Andrew Chen,Kaijie Chen,Cleon Cheng,Steven Chiang,Kaixuan Fan,Hera Feng,Huan Feng,Arthur Fu,Jun Gao,Hongquan Gu,Aaron Guan,Nolan Ho,Mutian Hong,Hailee Hou,Peixuan Hua,Charles Huang,Miles Jiang,Nora Jiang,Yuyi Jiang,Qiuyu Jin,Fancy Kong,Andrew Lei,Kyrie Lei,Alexy Li,Lucian Li,Ray Li,Theo Li,Zhihui Li,Jiayi Lin,Kairus Liu,Kieran Liu,Logan Liu,Xiang Liu,Irvine Lu,Maeve Luo,Runze Lv,Pony Ma,Verity Niu,Anson Qiu,Vincent Wang,Rio Yang,Maxwell Yao,Carrie Ye,Regis Ye,Wenlin Ye,Josh Ying,Danney Zeng,Yuhan Zhan,Anya Zhang,Di Zhang,Ruijia Zhang,Sueky Zhang,Ya Zhang,Wei Zhao,Ada Zhou,Changhai Zhou,Yuhua Zhou,Xinyue Zhu,Murphy Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 27 pages. Technical report. Mind Lab

点击查看摘要

Abstract:We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

[AI-10] (How) Do Large Language Models Understand High-Level Message Sequence Charts?

链接: https://arxiv.org/abs/2605.13773
作者: Mohammad Reza Mousavi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs “understand” the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

[AI-11] High-Rate Quantized Matrix Multiplication II

链接: https://arxiv.org/abs/2605.13768
作者: Or Ordentlich,Yury Polyanskiy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix \Sigma_X of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC’') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of \Sigma_X and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of \frac2\pi e12 (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ’s performance, in turn, is affected by the choice of basis, but for a random rotation and actual \Sigma_X from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT) Cite as: arXiv:2605.13768 [cs.LG] (or arXiv:2605.13768v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13768 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

链接: https://arxiv.org/abs/2605.13734
作者: Zedong Liu,Xinyang Ma,Dejun Luo,Hairui Zhao,Bing Lu,Wenjing Huang,Yida Gu,Xingchen Liu,Zheng Wei,Jinyang Liu,Dingwen Tao,Guangming Tan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted by SIGCOMM 2026

点击查看摘要

Abstract:LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emphKVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50\times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13\times JCT speedup in PD-separated serving and up to 32.8\times TTFT reduction in KV-disaggregated serving.

[AI-13] ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

链接: https://arxiv.org/abs/2605.13725
作者: Yitian Yang,Yiqun Duan,Linghan Huang,Yiqi Zhu,Francesco Bailo,Chunmeizi Su,Huaming Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges these paradigms by combining structured opinion dynamics with LLM-based agent reasoning. ScioMind integrates three key components: 1) a memory-anchored belief update rule that modulates susceptibility to influence via personality-conditioned anchoring strength; 2) a hierarchical memory architecture that supports persistent, experience-driven belief formation; and 3) dynamic agent profiles derived from a corpus-grounded retrieval pipeline, enabling heterogeneous personalities, rationales, and evolving internal states. We evaluate ScioMind on multiple case studies in a real-world policy debate scenario. Across metrics including polarisation, diversity, extremization, and trajectory stability, the proposed components consistently yield improvements in behavioural realism. In particular, dynamic profiles increase opinion diversity, memory and reflection reduce unstable oscillation, and anchoring induces persistent belief trajectories that better align with patterns reported in political psychology. These results suggest that our cognitively grounded design provides a novel solution to LLM-based social simulation that improves both stable and behavioural realism

[AI-14] Identifying AI Web Scrapers Using Canary Tokens

链接: https://arxiv.org/abs/2605.13706
作者: Steven Seiden,Triss Ren,Caroline Zhang,Taein Kim,Enze Liu,Emily Wenger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports – methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and automatically inferring LLM-related scrapers. We host dynamic websites that serve unique canary tokens to each visiting scraper, then prompt LLMs for information about our sites. If an LLM consistently generates outputs containing tokens unique to a scraper, it provides evidence of exposure to that scraper. Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies. Our approach provides a promising avenue for unprivileged third parties to infer which scrapers serve data to which LLMs, potentially enabling better control over unwanted scraping.

[AI-15] Adaptive mine planning under geological uncertainty: A POMDP framework for sequential decision-making

链接: https://arxiv.org/abs/2605.13702
作者: Hamza Khalifi,Jef Caers,Yassine Taha,Mostafa Benzaazoua,Abdellatif Elghali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Strategic mine production scheduling under geological uncertainty is conventionally formulated as a stochastic optimization problem in which a fixed extraction sequence and routing decisions are computed ex ante. This plan-driven paradigm treats uncertainty as passive: decisions are hedged across geological scenarios, but planning does not anticipate how future observations will inform future decisions. We propose a different perspective by formulating mine scheduling as a Partially Observable Markov Decision Process (POMDP), in which extraction and routing decisions are made sequentially with planning explicitly integrating the expectation of future belief updates. To achieve computational tractability, we introduce a hybrid SA-POMDP architecture that combines simulated annealing-based (SA) value approximation with ensemble-based belief updating via ensemble smoother with multiple data assimilation (ES-MDA). At each decision epoch, candidate actions are evaluated through their expected long-term value under the current belief, and the belief is updated as mining observations are assimilated. This yields an adaptive policy rather than a fixed plan. We evaluate the framework on a copper-gold open-pit mining complex with multiple processing destinations. Under a statistically consistent prior, the SA-POMDP reduces the expectation-reality gap from 22.3% to 4.6%, improving realized NPV by USD8.4M relative to one-shot stochastic optimization. Under systematic prior misspecification of 10%, the adaptive framework outperforms static planning by up to USD44.6M (36.9%), demonstrating structural robustness beyond scenario hedging. These results show that sequential belief updating transforms geological uncertainty from a passive constraint into an active component of value creation.

[AI-16] he WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

链接: https://arxiv.org/abs/2605.13690
作者: Fengqing Jiang,Yuetai Li,Yichen Feng,Kaiyuan Zheng,Luyao Niu,Bhaskar Ramasubramanian,Basel Alomair,Linda Bushnell,Radha Poovendran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressivity is governed by which small patterns an architecture can detect and count. We formalize this via homomorphism densities, which measure how often a structural motif appears in a hypergraph. Combining classical homomorphism-count completeness with invariant approximation, we show that homomorphism densities generate all continuous hypergraph invariants and organize them into a strict hierarchy indexed by hypertree width. This yields a Width Wall: a fundamental architectural limit beyond which no hidden dimension, training procedure or fixed-depth HGNN can represent invariants requiring wider patterns. Our framework provides a unified characterization of 15 HGNN architectures, precisely identifies information lost by clique expansion, and motivates density-aware models that extend expressivity beyond bounded-width message passing. We experimentally validate this finding on an APPLICATION NODE CLASSIFICATION SUITE of real-world hypergraphs, where the Width Wall predicts when graph-reduction baselines fail and when density features help.

[AI-17] A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

链接: https://arxiv.org/abs/2605.13687
作者: Jason Gaitonde,Frederic Koehler,Elchanan Mossel,Joonhyung Shin,Allan Sly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce a family of synthetic languages with hierarchical structure – generated by a broadcast process on trees – for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emphexact k -gram ansatz in place of transformers with context length k , a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emphIsing broadcast process (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian – both deviating from the true language for any sublinear context. For the \emphcoloring broadcast process (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emphany valid coloring of the underlying tree. Together these results imply an \Omega(n) lower bound on the context length required to faithfully sample length- n sequences. In contrast, we prove that an autoregressive \emphreasoning model with only \Theta(\log n) working memory can sample exactly from the true language – an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

[AI-18] NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating ICML2026

链接: https://arxiv.org/abs/2605.13651
作者: Zhongju Yuan,Geraint Wiggins,Dick Botteldooren
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted as a regular paper by ICML 2026

点击查看摘要

Abstract:Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen’s average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

[AI-19] Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

链接: https://arxiv.org/abs/2605.13646
作者: Seokha Moon,Minseung Lee,Joon Seo,Jinkyu Kim,Jungbeom Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose a ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM.

[AI-20] How to Interpret Agent Behavior

链接: https://arxiv.org/abs/2605.13625
作者: Jie Gao,Kaiser Sun,Jen-tse Huang,Katherine Van Koevering,Sijie Ji,Heyuan Huang,Weiyan Shi,Zhuoran Lu,Ziang Xiao,Daniel Khashabi,Mark Dredze
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages in total

点击查看摘要

Abstract:Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACTONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACTONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent’s behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.

[AI-21] Position: Assistive Agents Need Accessibility Alignment ICML2026

链接: https://arxiv.org/abs/2605.13579
作者: Jie Hu,Changyuan Yan,Yu Zheng,Ziqian Wang,Jiaming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figures, Accepted to ICML 2026

点击查看摘要

Abstract:Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or post-hoc interface adaptations alone. Drawing on an analysis of 778 assistance task instances from prior work, we show that current agentic AI remain prone to failure in assistive scenarios due to mismatches between sighted-user design assumptions and the verification, risk, and interaction constraints faced by BVI users. We argue that accessibility should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, deployment and post-deployment iteration. We conclude that BVI-centered assistive tasks provide a critical stress test for agentic AI and motivate a broader shift toward inclusive agent design.

[AI-22] Learning Local Constraints for Reinforcement-Learned Content Generators

链接: https://arxiv.org/abs/2605.13570
作者: Debosmita Bhaumik,Julian Togelius,Georgios N. Yannakakis,Ahmed Khalifa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforcement-learning trained generators can guarantee global properties – because such properties can easily be included in reward functions – but the results can be visually dissatisfying. In this paper, we explore ways to combine these methods. Specifically, we constrain the action space of a PCGRL generator with constraints learned by WFC, effectively allowing the PCGRL generator to achieve global properties while forced to adhere to local constraints. To better analyze how this hybrid content generation method operates, we vary the number and type of inputs, and we test whether to randomly collapse the starting state and exclude rare patterns. While the method is sensitive to hyperparameter tuning, the best of our trained generators produce visually satisfying and playable puzzle-platform game levels – such as Lode Runner levels – with desired global properties.

[AI-23] Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

链接: https://arxiv.org/abs/2605.13568
作者: Riccardo Cavarra,Lupo Lovatelli,Shaheim Ogbomo-Harmitt,Shahid Aziz,Adelaide De Vecchi,Andrew King,Oleg Aslanidi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: submitted to the 9th International Conference on Computational and Mathematical Biomedical Engineering, 4 pages, 1 figure, 1 table

点击查看摘要

Abstract:Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.

[AI-24] Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

链接: https://arxiv.org/abs/2605.13554
作者: Asim Osman,Sasha Abramowitz,Mark Bergh,Ulrich Armel Mbou Sob,Ruan John de Kock,Omayma Mahjoub,Oussama Hidaoui,Noah De Nicola,Arnol Manuel Fokam,Felix Chalumeau,Daniel Rajaonarivonivelomanantsoa,Siddarth Singh,Refiloe Shabe,Juan Claude Formanek,Simon Verster Du Toit,Arnu Pretorius
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbfCPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO’s performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.

[AI-25] AttenA: Rectifying Action Inequality in Robotic Foundation Models

链接: https://arxiv.org/abs/2605.13548
作者: Daojie Peng,Fulong Ma,Jiahang Cao,Qiang Zhang,Xupeng Xie,Jian Guo,Ping Luo,Andrew F. Luo,Boyu Zhou,Jun Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This “flat” training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model’s learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

[AI-26] Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

链接: https://arxiv.org/abs/2605.13540
作者: Haonan Yuan,Qingyun Sun,Junhua Shi,Xingcheng Fu,Jianxin Li,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic graphs are ubiquitous in real-world systems, and building generalizable dynamic Graph Foundation Models has become a frontier in graph learning. However, dynamic graphs from different domains pose fundamental challenges to unified modeling, as their semantic and temporal patterns are inherently inconsistent, making the multi-domain pre-training difficult. Consequently, the widely used “pretrain-then-finetune” paradigm often suffers from severe negative knowledge transfer. To the best of our knowledge, there exists no multi-domain dynamic GFM. In this work, we propose DyGFM, a Dynamic Graph Foundation Model over multiple domains based on decoupled and divergence-conditioned prompting. To disentangle transferable semantics from the domain-specific dynamics, we introduce a dual-branch pre-training strategy with semantic-temporal decoupling. To alleviate negative transfer during domain adaptation, we further develop a cross-domain routing mechanism with divergence-aware expert selection. To enable efficient downstream fine-tuning, we design a divergence-conditioned prompt generator that injects lightweight, learnable graph prompts tailored to semantic and temporal traits. Extensive experiments on continuous dynamic graph benchmarks demonstrate that DyGFM consistently outperforms 12 state-of-the-art baselines on both node classification and link prediction tasks, achieving superior effectiveness and efficiency.

[AI-27] HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

链接: https://arxiv.org/abs/2605.13536
作者: Qingyun Zou,Feng Yu,Hongshi Tan,Yao Chen,Bingsheng He,WengFai Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) – latency and resource utilization – critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results – only relative comparisons between candidates. Based on this insight, we propose \textbfHLS-Seek, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textituncertainty-aware Monte Carlo (MC) dropout switching that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5% syntax correctness pass@1 and 81.4% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5 \times faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.

[AI-28] Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

链接: https://arxiv.org/abs/2605.13534
作者: Jiabei Liu,Wenyu Mao,Junfei Tan,Chunxu Shen,Lingling Yi,Jiancan Wu,Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.

[AI-29] MMSkills: Towards Multimodal Skills for General Visual Agents

链接: https://arxiv.org/abs/2605.13527
作者: Kangning Zhang,Shuai Shao,Qingyao Li,Jianghao Lin,Lingyue Fu,Shijian Wang,Wenxiang Jiao,Yuan Lu,Weiwen Liu,Weinan Zhang,Yong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures, 8 tables. Project page: this https URL

点击查看摘要

Abstract:Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

[AI-30] Discovery of Hidden Miscalibration Regimes

链接: https://arxiv.org/abs/2605.13484
作者: Katarzyna Kobalczyk,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Calibration is commonly evaluated by comparing model confidence with its empirical correctness, implicitly treating reliability as a function of the confidence score alone. However, this view can hide substantial structure: models may be systematically overconfident on some kinds of inputs and underconfident on others, causing global reliability diagnostics to obscure localised calibration failures. To address this, we formulate the problem of discovering hidden miscalibration regimes without assuming access to predefined data slices. We define the corresponding miscalibration field and propose a diagnostic framework for estimating it. Our approach learns a calibration-aware representation of the input space and estimates signed local miscalibration by kernel smoothing in the learned geometry. Across four real-world LLM benchmarks and twelve LLMs, we find that input-dependent calibration heterogeneity is prevalent. We further show that the discovered fields are actionable: they support local confidence correction and reduce calibration error in systematically miscalibrated regions where confidence-based methods such as isotonic regression and temperature scaling are less effective.

[AI-31] CUBic: Coordinated Unified Bimanual Perception and Control Framework

链接: https://arxiv.org/abs/2605.13452
作者: Xingyu Wang,Pengxiang Ding,Jingkai Xu,Donglin Wang,Zhaoxin Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side – either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination – thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

[AI-32] Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

链接: https://arxiv.org/abs/2605.13435
作者: JaeHyeok Doo,Byeongguk Jeon,Seonghyeon Ye,Kimin Lee,Minjoon Seo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

[AI-33] RIAGE: Evaluating Prospective Metacognitive Control in LLM s under Resource Constraints

链接: https://arxiv.org/abs/2605.13414
作者: Zabir Al Nazi,Shubhashis Roy Dipta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model’s solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

[AI-34] RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

链接: https://arxiv.org/abs/2605.13391
作者: Liangtian Liu,Zeyuan Wang,Ziyu Li,Kai Ouyang,Zichao Tang,Chengfu Liu,Haifeng Li,Hanwen Yu,Wentao Yang,Cheng Yang,Dongyang Hou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from “see” to “action”, as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance “context load” and “toolset completeness” throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent’s context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw’s active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.

[AI-35] A Horn extension of DL-Lite with NL data complexity

链接: https://arxiv.org/abs/2605.13367
作者: Janos Arpasi,Bartosz Jan Bednarczyk,Magdalena Ortiz
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Submitted to Description Logic Workshop 2025. Full version in preparation

点击查看摘要

Abstract:The literature on ontology-mediated query answering (OMQA) has been shaped by two key results: first-order rewritability for DL-Lite, and PTime-hardness of data complexity for essentially every description logic beyond it. This has effectively positioned DL-Lite as the only practical choice for query rewriting, restricting OMQA solutions to first-order queries and ontologies that can be rewritten into them. This AC0 vs. PTime dichotomy is especially limiting if we consider that OMQA targets graph-structured data, and that standard graph query languages (including the recent ISO standards GQL and SQL/PGQ) are typically NL-complete. Towards identifying a rich Horn DL that can be rewritten into graph query languages and that can still express many ELI and DL-Lite ontologies, we introduce a stratification mechanism for ELI that controls the interaction between conjunction and recursion. In this way, we obtain ELbotpreceq, a description logic that strictly extends the core DL-Lite, supports reachability axioms and restricted conjunction, and allows for reasoning in NL. We establish the NL upper bound via a rewriting into nested two-way regular path queries, a fragment of GQL, providing initial evidence that our ontology language is a promising candidate for extending OMQA to graph query languages.

[AI-36] AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

链接: https://arxiv.org/abs/2605.13357
作者: Hailin Zhong,Shengxin Zhu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Foundation models have transformed automated code generation, yet autonomous software-engineering agents remain unreliable in realistic development settings. The dominant explanation locates this gap in model capability. We propose a different locus: software-engineering capability emerges from a model-harness-environment system, in which a runtime substrate – the harness – mediates how a foundation-model agent observes a project, acts on it, receives feedback, and establishes that a change is complete. We formalize this substrate as an AI Harness Engineering and identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording. We operationalize the harness through a four-level ladder (H0-H3) that progressively exposes runtime support to the agent, and we propose a trace-based evaluation protocol that converts each agent run into an auditable episode package. Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports. The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change. We outline a research program for the runtime systems that foundation-model software agents will require.

[AI-37] Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models ICML2026

链接: https://arxiv.org/abs/2605.13338
作者: Shuqiang Wang,Wei Cao,Jiaqi Weng,Jialing Tao,Licheng Pan,Hui Xue,Zhixuan Chu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to “overthink”, producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

[AI-38] Diversity of Extensions in Abstract Argumentation IJCAI2026

链接: https://arxiv.org/abs/2605.13332
作者: Johannes K. Fichte,Markus Hecher,Yasir Mahmood,Zhengjun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: Technical Report to the paper accepted at IJCAI 2026

点击查看摘要

Abstract:Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in argumentation do not reveal how far apart extensions are. We introduce a quantitative notion of diversity of extensions based on the symmetric difference and provide a systematic complexity classification. Intuitively, diversity captures whether extensions of a framework (accepted viewpoints) differ only marginally or represent fundamentally incompatible sets of arguments. We study whether an AF admits k-diverse extensions, admits k-diverse extensions covering specific arguments, and to compute the largest k for which an AF admits k-diverse extensions. We outline a prototype and provide an evaluation for computing diversity levels.

[AI-39] VERA-MH: Validation of Ethical and Responsible AI in Mental Health

链接: https://arxiv.org/abs/2605.13318
作者: Luca Belli,Kate H. Bentley,Josh Gieringer,Emily Van Ark,Nilu Zhao,Pradip Thachile,Matt Hawrilenko,Millard Brown,Adam M. Chekroud
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Chatbot usage has increased, including in fields for which they were never developed for–notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers’ consistency and highlight models’ failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers. Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2605.13318 [cs.AI] (or arXiv:2605.13318v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.13318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] What properties of reasoning supervision are associated with improved downstream model quality? CCS

链接: https://arxiv.org/abs/2605.13290
作者: Mikołaj Langner,Dzmitry Pihulski,Jan Eliasz,Michał Rajkowski,Przemysław Kazienko,Maciej Piasecki,Jan Kocoń,Teddy Ferdinan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the International Conference on Computational Science (ICCS) 2026

点击查看摘要

Abstract:Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.

[AI-41] Delightful Exploration

链接: https://arxiv.org/abs/2605.13287
作者: Ian Osband
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to \varepsilon -greedy, which bounds disruption but spends its override blindly. We introduce \textitDelight-gated exploration (DE), a host–override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora’s reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and \varepsilon -greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

[AI-42] Differentiable Learning of Lifted Action Schemas for Classical Planning

链接: https://arxiv.org/abs/2605.13282
作者: Jonas Reiter,Jakob Elias Gebler,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.

[AI-43] he Readability Spectrum: Patterns Issues and Prompt Effects in LLM -Generated Code

链接: https://arxiv.org/abs/2605.13280
作者: Hengzhi Ye,Fengyuan Ran,Weiwei Xu,Minghui Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.

[AI-44] D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

链接: https://arxiv.org/abs/2605.13276
作者: Yucheng Guo,Yongjian Guo,Zhong Guan,Wen Huang,Haoran Sun,Haodong Yue,Xiaolong Xiang,Shuai Di,Zhen Sun,Luqiao Wang,Junwu Xiong,Yicheng Gong
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces “Plane Decoupling,” physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous “Swimlane” pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.

[AI-45] Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

链接: https://arxiv.org/abs/2605.13255
作者: Junlong Ke,Zichen Wen,Weijia Li,Conghui He,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher’s token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher’s predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

[AI-46] Its not the Language Model its the Tool: Deterministic Mediation for Scientific Workflows

链接: https://arxiv.org/abs/2605.13245
作者: Marios Adamidis,Danae Katrisioti,Yannis Tzitzikas,Emmanuel Stratakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 2 appendices. Submitted to SETN 2026

点击查看摘要

Abstract:Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher’s exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.

[AI-47] acher-Guided Policy Optimization for LLM Distillation

链接: https://arxiv.org/abs/2605.13230
作者: Xinyu Liu,Kechen Jiao,Chunyang Xiao,Runsong Zhao,Junhao Ruan,Bei Li,Jiahao Liu,Qifan Wang,Xin Chen,Jingang Wang,Tong Xiao,JingBo Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student’s rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.

[AI-48] Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization IJCAI2016

链接: https://arxiv.org/abs/2605.13229
作者: Yuhan Wu,Huan Zhang,Wei Cheng,Chen Shen,Jingyue Yang,Wei Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted in the 35th International Joint Conference on Artificial Intelligence (IJCAI 2016)

点击查看摘要

Abstract:LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.

[AI-49] An Agent ic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

链接: https://arxiv.org/abs/2605.13221
作者: Hanwen Zhang,Dusit Niyato,Wei Zhang,Xin Lou,Malcolm Yoke Hean Low
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.

[AI-50] Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning CVPR2026

链接: https://arxiv.org/abs/2605.13213
作者: Hao Zhou,Tiru Wu,Yan Jiang,Wanqi Zhou,Junxing Hu,Ai Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM-MAS largely underexplored. To bridge this gap, we introduce HAM ^3 , a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM ^3 mounts attacks by perturbing visual inputs, textual inputs, and their fused visual-textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent’s cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM ^3 on the GQA benchmark through multi-agent systems built on distinct reasoning paradigms including ReAct, Plan-and-Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning-layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi-agent intelligence.

[AI-51] McCast: Memory-Guided Latent Drift Correction for Long-Horizon Precipitation Nowcasting

链接: https://arxiv.org/abs/2605.13197
作者: Penghui Wen,Yu Luo,Lintao Wang,Mengwei He,Patrick Filippi,Thomas Francis Bishop,Zhiyong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing precipitation nowcasting methods typically adopt an autoregressive formulation, where future states are predicted from previous outputs. However, such an approach accumulates errors over long rollouts, causing forecasts to drift away from physically plausible evolution trajectories. Although various studies have attempted to alleviate this problem by improving step-wise prediction accuracy, they largely neglect the global temporal evolution of meteorological systems and lack mechanisms to actively correct drift during rollouts. To address this issue, we propose McCast, a memory-guided latent drift correction method for precipitation nowcasting. Rather than treating memory as an unordered dictionary of latent states for passive conditioning, McCast leverages temporally organized memory to actively correct autoregressive latent evolution. Specifically, McCast introduces a Drift-Corrective Memory Bank (DCBank) that explicitly estimates the temporally consistent drift corrections to calibrate the divergent trajectory. DCBank performs drift correction in two stages: a Corrective Latent Extractor first predicts an initial correction from the current prediction and a reference latent state, and a Correction-Aware Memory Retrieval module then refines the initial correction using temporally organized historical memory. By explicitly correcting latent evolution, instead of improving step-wise prediction accuracy only, McCast produces more temporally coherent and reliable long-horizon forecasts. Experiments on two widely used benchmarks, SEVIR and MeteoNet, show that McCast achieves state-of-the-art performance, particularly in challenging long-horizon forecasting scenarios.

[AI-52] ECG-NAT: A Self-supervised Neighborhood Attention Transformer for Multi-lead Electrocardiogram Classification

链接: https://arxiv.org/abs/2605.13194
作者: Mahsa Gazeran,Sayvan Soleymanbaigi,Fatemeh Daneshfar,Amjad Seyedi,Fardin Akhlaghian Tab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self-supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi-scale feature extraction. ECG signals require architectures that simultaneously capture fine-grained beat-level morphology and broader rhythm-level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG-NAT), a novel self-supervised learning approach tailored for multi-lead ECG classification. Our two-stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain-invariant representations from unlabeled data. This is followed by discriminative fine-tuning with a dual-loss function that combines supervised contrastive and cross-entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi-scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG-NAT achieves robust performance on benchmark datasets, with 88.1% accuracy using only 1% labeled data, demonstrating strong efficacy in low-resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real-time ECG diagnosis. The code will be made available upon acceptance at: this https URL.

[AI-53] N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

链接: https://arxiv.org/abs/2605.13190
作者: Aleksander Lorenc,Frédéric Berdoz,Joël Mathys,Roger Wattenhofer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest model reaches 57.9% wall-clock speedup over a parameter- and data-matched standard transformer at no perplexity cost.

[AI-54] Stable Attention Response for Reliable Precipitation Nowcasting

链接: https://arxiv.org/abs/2605.13181
作者: Penghui Wen,Zexin Hu,Sen Zhang,Patrick Filippi,Xiaogang Zhu,Allen Benter,Thomas Bishop,Zhiyong Wang,Kun Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precipitation nowcasting remains challenging due to the highly localized, rapidly evolving, and heterogeneous nature of atmospheric dynamics. Although recent methods increasingly adopt attention-based architectures in both unimodal and multimodal settings, they mainly emphasize stronger representation learning and prediction capacity, while paying less attention to the stability of attention responses across samples. In this work, we show that cross-sample instability of attention-response energy is an important and previously underexplored source of forecasting unreliability. Empirically, inaccurate forecasts are associated with larger attention-response energy variance across heads and layers. Theoretically, we show that cross-sample variability can propagate through self-attention, and enlarge a lower bound on prediction error. Based on this insight, we propose HARECast, a Head-wise Attention Response Energy-regulated framework for precipitation nowcasting. HARECast explicitly models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces cross-sample fluctuations. The proposed formulation is generic and applicable to both unimodal and multimodal nowcasting architectures. We instantiate HARECast in a standard forecasting pipeline with reconstruction branches and a diffusion-based predictor, and evaluate it on commonly used benchmarks–SEVIR and MeteoNet. Experimental results demonstrate that HARECast achieves state-of-the-art performance.

[AI-55] Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

链接: https://arxiv.org/abs/2605.13171
作者: Moritz Firsching,Paul Lezeau,Salvatore Mercuri,Miklós Z. Horváth,Yaël Dillies,Calle Sönne,Eric Wieser,Fred Zhang,Thomas Hubert,Blaise Agüera y Arcas,Pushmeet Kohli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures, 5 tables

点击查看摘要

Abstract:As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.

[AI-56] Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning ECAI2026 IJCAI

链接: https://arxiv.org/abs/2605.13153
作者: Rikui Huang,Shengzhe Zhang,Wei Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI-ECAI 2026

点击查看摘要

Abstract:Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.

[AI-57] A Constraint Programming Approach for n-Day Lookahead Playoff Clinching

链接: https://arxiv.org/abs/2605.13142
作者: Gili Rosenberg,Kyle E. C. Booth,J. Kyle Brubaker,Ruben S. Andrist
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 18 pages, 5 figures, 4 tables. Accepted to CP 2026

点击查看摘要

Abstract:In professional sports, a team has clinched the playoffs if they are guaranteed a postseason spot, regardless of the outcomes of any remaining games. As the season progresses, sports fans and other stakeholders are interested in precisely when, and under what conditions, their team will clinch the playoffs. In this paper, we investigate playoff clinching in the context of the National Hockey League (NHL), where it is computationally challenging to produce clinching scenarios due, in part, to complex tie-breakers. We present an algorithm that determines under which combinations of game outcomes in the next n days a team will clinch the playoffs (i.e., " n -day lookahead clinching"). Our approach is a custom tree search which employs various preprocessing techniques, pruning strategies, and node ordering heuristics to efficiently explore the space of possible outcomes. The tree search leverages a constraint programming (CP)-based subroutine for inference that determines if a team has clinched the playoffs for some snapshot in time of the regular season (i.e., “0-day lookahead clinching”). This CP subroutine aims to find a counter-example in which the team being evaluated is eliminated, taking into account qualification rules and the NHL’s extensive list of tie-breakers. We validate the efficacy of our algorithm using hundreds of scenarios based on public NHL data for the seasons 2021-22 through 2024-25. The methods introduced can be readily extended to other metrics of interest, including mathematical proof of playoff elimination, clinching the President’s Trophy, as well as clinching (or being eliminated from clinching) any other seed in the standings.

[AI-58] GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

链接: https://arxiv.org/abs/2605.13130
作者: Junjie Li,Ziao Wang,NingXuan Ma,Jianghong Ma,Xiaofeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model’s internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

[AI-59] MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing

链接: https://arxiv.org/abs/2605.13126
作者: Chaokai Wu,Haofu Shi,Ningxuan Ma,Jianghong Ma,Xiaofeng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.

[AI-60] SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

链接: https://arxiv.org/abs/2605.13117
作者: Han Yi Shin,Heeju Ko,Jaewon Mun,Qixing Huang,Jaehyeok Lee,Sung June Kim,Honglak Lee,Sujin Jang,Sangpil Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.

[AI-61] Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

链接: https://arxiv.org/abs/2605.13113
作者: Jose Luna,Yankun Wu,Xiaofei Xie,Noa Garcia
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: FAccT 2026

点击查看摘要

Abstract:Text-to-image (T2I) generative models are increasingly used to produce content for education, media, and public-facing communication, and are starting to be integrated into higher-impact pipelines. Since generated images tend to reinforce stereotypes, producing representational erasure via “default” depictions and shaping perceptions of who belongs in certain roles, a growing body of work has proposed metrics to quantify gender bias in T2I outputs. Yet existing evaluations remain fragmented. Metrics are often reported without a shared view of what they measure, what assumptions they entail, or how their results should be interpreted under different deployment contexts. This limits the usefulness of gender bias measurement for both technical auditing and emerging governance discussions. We propose a risk-aligned auditing framework for gender bias in T2I models composed of three constituents that connects risk categories, evaluation metrics, and harms. First, we identify risk-tiered use-case profiles aligned with the EU AI Act’s risk categories to motivate why auditing expectations may vary with deployment contexts and stakeholder exposure. Second, we construct a metric catalog that consolidates gender-bias evaluation methods and organizes them in three measurement categories: gender prediction, embedding similarity, and downstream task. Third, we introduce a harm typology that maps context-dependent harm categories (e.g., representational, quality-of-service) to specific risk-tired scenarios. Finally, we introduce THUMB cards (Text-to-image Harms-informed Use-case-aligned Metrics of gender Bias) that help formulate auditing systematically by the incorporation of context, scenario and bias manifestation, harm hypotheses, and audit strategy.

[AI-62] Margin-calibrated Classifier Guidance for Property-driven Synthesis Planning

链接: https://arxiv.org/abs/2605.13101
作者: Najwa Laabid,Vikas Garg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthesis planning seeks an efficient sequence of chemical reactions that produce a target molecule. Typically, a pretrained single-step (autoregressive) retrosynthesis model is repeatedly invoked to generate such a sequence. Classifier guidance can, in principle, help steer the output of single-step model toward reactions that satisfy specific constraints or accommodate chemist’s preferences during inference without having to retrain the autoregressive generator. We expose the insufficiency of auxiliary classifiers trained with cross-entropy loss to override the unconditional token-level distributions learned from typical sparse single-disconnection reaction datasets. We overcome this issue with a novel method called Sequence Completion Ranking (SCR), which employs contrastive argumentation and a margin-based loss to calibrate the classifier so that it can meaningfully discriminate between continuations during decoding. We formally establish that margin-calibrated classifiers can expand the set of property-satisfying sequences reachable under guided beam search. Empirically, on USPTO-190, given chemist-specified guidance targets, SCR substantially improves multi-step solve rates from 16.8% (unguided generator) to 78.4% with reaction-type guidance and 95.3% with Tanimoto guidance, unlocking valid routes for 33 targets ( 17.4% ) previously unsolvable with baselines. Our method also effectively closes the long-standing diversity gap between template-free and template-based methods.

[AI-63] Watermarking Should Be Treated as a Monitoring Primitive

链接: https://arxiv.org/abs/2605.13095
作者: Toluwani Aremu,Nils Lukas,Jie Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Watermarking is widely proposed for provenance, attribution, and safety monitoring in generative models, yet is typically evaluated only under adversaries who attempt to evade detection or induce false positives at the level of individual samples. We argue that watermarking should be treated as a monitoring primitive, and that internal monitoring is unavoidable given per-entity attribution keys and messages, as well as detector access. We introduce an observer-based threat model in which observers can aggregate watermark signals across outputs to infer entity-level information, showing that even zero-bit watermarking enables attribution under multi-key settings. We further show that external monitoring can emerge over time from persistent, key-dependent statistical structure, although this depends on watermark design and may be mitigated by distribution-preserving or undetectable schemes. Our findings reveal a fundamental dual-use tension between attribution and monitoring, motivating evaluation of watermarking beyond per-sample robustness to account for aggregation and observer-based capabilities.

[AI-64] Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

链接: https://arxiv.org/abs/2605.13079
作者: Tien-Phat Nguyen,Truong Nguyen,Minh-Phuc Truong,Tuc Nguyen,James Bailey,Trung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon’s maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon’s empirical success.

[AI-65] When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation ICRA2026

链接: https://arxiv.org/abs/2605.13067
作者: Maxime Alvarez,Ryo Watanabe,Paul Crook,Afshin Zeinaddini Meymand,Suvin Kurian,Pablo Ferreiro,Genki Sano
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2026 Workshop: From Data to Decisions

点击查看摘要

Abstract:As end-to-end robotic policies are progressively deployed in the real world to solve real tasks, they face a gap between the training and inference conditions. Scaling the amount and diversity of the training data has shown some success in improving zero-shot generalization, yet robots still fail when faced with new, unseen test conditions. For instance, while robots with fixed frames of reference are common, those with moving frames pose a greater challenge for deployment. To address this specific instance of the issue, we present a study of strategies for encoding the robot’s proprioceptive state to improve both in- and out-of-distribution performance at test time. Through a systematic study of joint representations, we find that a simple episode-wise relative frame provides the best trade-off between task performance and robustness, outperforming the baselines in extensive real-robot experiments conducted in a realistic test environment. The results suggest a practical path to leveraging data collected by robots with varying frames of reference and deployment to unseen test configurations.

[AI-66] Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2605.13054
作者: Minung Kim,Jeongmo Kim,Gwanwoo Choi,Seungyul Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly when the target dataset is extremely limited. To address this, we propose Target-aligned Coverage Expansion (TCE), a framework that decides how source data should be used, either by directly incorporating target-near transitions or by expanding state coverage through target-aligned generation, guided by theoretical analysis. TCE builds on a dual score-based generative model to synthesize target-consistent transitions over an expanded state region. Extensive experiments across diverse cross-domain environments show that TCE consistently outperforms state-of-the-art cross-domain offline RL baselines.

[AI-67] An Agent ic LLM -Based Framework for Population-Scale Mental Health Screening

链接: https://arxiv.org/abs/2605.13046
作者: Giuliano Lorenzoni,Paulo Alencar,Donald Cowan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, conference paper presented at IEEE BigData 2025, Macau

点击查看摘要

Abstract:Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

[AI-68] No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

链接: https://arxiv.org/abs/2605.13044
作者: Ying Li,Hongbo Wen,Yanju Chen,Hanzhi Liu,Yuan Tian,Yu Feng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail’s semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal. On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.13044 [cs.CR] (or arXiv:2605.13044v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.13044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-69] MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

链接: https://arxiv.org/abs/2605.13037
作者: Yuxin Liu,Ziang Ye,Yueqing Sun,Mingye Zhu,Jinwei Xiao,Zhuowen Han,Qi GU,Xunliang Cai,Lei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.

[AI-70] FeatCal: Feature Calibration for Post-Merging Models

链接: https://arxiv.org/abs/2605.13030
作者: Yanggan Gu,Shuo Cai,Zihao Wang,Wenjun Wang,Yuanyi Wang,Pengkai Wang,Sirui Huang,Su Lu,Jianmin Wu,Hongxia Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

[AI-71] Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

链接: https://arxiv.org/abs/2605.13021
作者: Xu Bai,Bin Lu,Kun Zhang,Shengbo Chen,Xinbing Wang,Chenghu Zhou,Meng Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair-wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non-selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near-linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE*, which reduces O(\delta \dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high-degree nodes. Experimental results show that NOPE* achieves 1.8-10\times speedup over NOPE and surpass almost all baselines with 1-3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM-based graph reasoning owing to compact graph information. The code can be available at this https URL.

[AI-72] Not Just RLHF: Why Alignment Alone Wont Fix Multi-Agent Sycophancy

链接: https://arxiv.org/abs/2605.12991
作者: Adarsh Kumarappan,Ananya Mujoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes N \in \4, 5, 6\ . Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

[AI-73] Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence

链接: https://arxiv.org/abs/2605.12981
作者: Jun He,Deying Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 2 tables

点击查看摘要

Abstract:Automated program synthesis has reduced the cost of producing candidate implementations, but it introduces a harder governance problem: determining which generated artifacts are admissible in a software system. Natural-language specifications remain semantically ambiguous, and example-based tests sample only part of the behavioral space. Used alone, neither provides a sufficient control boundary for automated software construction. We introduce Protocol-Driven Development (PDD), a development model in which the primary software artifact is a machine-enforceable protocol rather than implementation code. We define a protocol as the triplet P = (S, B, O), where S specifies structural invariants, B specifies behavioral invariants, and O specifies operational invariants. Their conjunction defines the admissible implementation space of a software component. Under PDD, implementations are treated as replaceable realizations discovered through constrained search. An implementation is admitted if and only if it satisfies the governing protocol and produces a verifiable Evidence Chain of compliance. Admission is therefore grounded not in trust in the generator, but in protocol satisfaction and recorded evidence. By combining ideas from formal methods, property-based testing, policy-as-code, and software provenance, PDD defines a governance layer for automated software engineering. Its organizing principle is simple: code is transient; protocol is sovereign.

[AI-74] CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

链接: https://arxiv.org/abs/2605.12980
作者: Tianbo Liu,Chixiang Lu,Jing Hao,Hengyu Zhang,Lifei Wang,Haibo Jiang,Xiaojuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54% Top-1 and 29.92% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12980 [cs.LG] (or arXiv:2605.12980v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] Useful Memories Become Faulty When Continuously Updated by LLM s

链接: https://arxiv.org/abs/2605.12978
作者: Dylan Zhang,Yanshan Lin,Zhengkun Wu,Yihang Sun,Bingxuan Li,Dianqi Li,Hao Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from past experience benefits from two complementary forms of memory: episodic traces – raw trajectories of what happened – and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today’s LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

[AI-76] Retrieval is Cheap Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2605.12975
作者: Jiashuo Sun,Jimeng Shi,Yixuan Xie,Saizhuo Wang,Jash Rajesh Parekh,Pengcheng Jiang,Zhiyi Shi,Jiajun Fan,Qinglong Zheng,Peiran Li,Shaowen Wang,Ge Liu,Jiawei Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 20 figures, 4 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at this https URL. Comments: 32 pages, 20 figures, 4 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12975 [cs.AI] (or arXiv:2605.12975v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-77] Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

链接: https://arxiv.org/abs/2605.12969
作者: Feng Zhang,Xinhong Ma,Ziqiang Dong,Xi Leng,Jianfei Zhao,Xin Sun,Yang Yang,Guanjun Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RLVR has become a widely adopted paradigm for improving LLMs’ reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose ConSPO, a framework for Contrastive Sequence-level Policy Optimization in RLVR. ConSPO replaces GRPO’s clipped ratio-based scores with length-normalized sequence log-probabilities, aligning the optimized rollout scores with the likelihoods used in autoregressive generation. It then optimizes a group-wise InfoNCE-style objective that contrasts each positive rollout against negative distractors from the same group, enabling credit assignment to depend on their relative scores. This contrastive formulation amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. Moreover, ConSPO introduces a curriculum-scheduled margin, guiding optimization from coarse positive-negative ordering in early training toward stronger separation in later stages. Extensive evaluations across diverse backbone models, parameter scales, and training datasets show that ConSPO consistently outperforms several strong RLVR baselines on challenging mathematical reasoning benchmarks.

[AI-78] Position: Agent ic AI System Is a Foreseeable Pathway to AGI ICML’26

链接: https://arxiv.org/abs/2605.12966
作者: Junwei Liao,Shuai Li,Muning Wen,Jun Wang,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML’26 Position Track

点击查看摘要

Abstract:Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.

[AI-79] Sustaining AI safety: Control-theoretic external impossibility intrinsic necessity and structural requirements

链接: https://arxiv.org/abs/2605.12963
作者: James M. Mazzu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system’s effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system’s terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

[AI-80] he Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

链接: https://arxiv.org/abs/2605.12940
作者: Zhiyu Zhao,Xuejie Liu,Muhan Zhang,Anji Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

[AI-81] Agent Lens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

链接: https://arxiv.org/abs/2605.12925
作者: Priyam Sahoo,Gaurav Mittal,Xiaomin Li,Shengjie Ma,Benjamin Steenhoek,Pingping Lin,Yu Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens-Bench dataset and AgentLens SDK, at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12925 [cs.SE] (or arXiv:2605.12925v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.12925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] Data Difficulty and the Generalization–Extrapolation Tradeoff in LLM Fine-Tuning ICML2026

链接: https://arxiv.org/abs/2605.12906
作者: Siyuan Liu(IIIS, Tsinghua University),Tinghong Chen(College of AI, Tsinghua University and Shanghai Qi Zhi Institute),Xinghan Li(IIIS, Tsinghua University),Yifei Wang(Amazon AGI SF Lab),Jingzhao Zhang(IIIS, Tsinghua University and Shanghai Qi Zhi Institute)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

[AI-83] RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems ALT

链接: https://arxiv.org/abs/2605.12895
作者: Rohith Reddy Bellibatlu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
备注: Submission to Artificial Intelligence in Medicine (Elsevier). Open-source Python implementation at this https URL (MIT license). Synthetic evaluation cohort at this https URL (DOI: https://doi.org/10.57967/hf/8734 )

点击查看摘要

Abstract:Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome-independent need measure before the gate is binding. RISED is released as an open-source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in-silico model validation and silent-trial clinical evaluation.

[AI-84] Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

链接: https://arxiv.org/abs/2605.12869
作者: Zvi Topol
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in a wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails. Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a subset of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the two other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM application developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.

[AI-85] Language-Based Agent Control

链接: https://arxiv.org/abs/2605.12863
作者: Timothy Zhou,Loris D’Antoni,Nadia Polikarpova
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper introduces language-based agent control (LBAC), a new programming model for agentic applications that brings techniques from programming languages and language-based security to the problem of agent control. In conventional programming, combinations of static typing and runtime enforcement have long been used to guarantee that well-typed programs satisfy user-specified policies, including policies for access control, information flow, data provenance, and more. The key idea behind LBAC is to extend these guarantees to agentic applications by requiring agents to generate programs that are themselves well typed in the context of the surrounding scaffolding code. Unsafe programs are rejected by the type-checker before execution, allowing policies to apply uniformly across the entire application, including both agent-generated behavior and developer-written scaffolding. At the same time, LBAC preserves substantial expressiveness: agents may perform arbitrary side-effect-free computation and recursively invoke subagents, which retain full tool access subject to the same – or potentially more restrictive – policies. We demonstrate LBAC with three case studies: I/O sandboxing via filesystem capabilities, data provenance, and information-flow control.

[AI-86] Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

链接: https://arxiv.org/abs/2605.12856
作者: Ali Al-Lawati,Nafis Tripto,Abolfazl Ansari,Jason Lucas,Suhang Wang,Dongwon Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with \em malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc\textbfBot-Mod (\textsc\textbfBot-Moderation), a moderation framework that grounds detection in agent intent rather than traditional content level signals. \method identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that \textsc\textbfBot-Mod reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

[AI-87] Bayesian Model Merging

链接: https://arxiv.org/abs/2605.12843
作者: Kaiyang Li,Shaobo Han,Qing Su,Shihao Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging aims to combine multiple task-specific expert models into a single model without joint retraining, offering a practical alternative to multi-task learning when data access or computational budget is limited. Existing methods, however, face two key limitations: (1) they overlook the valuable inductive bias of strong anchor models and estimate the merged weights from scratch, and (2) they rely on a shared hyperparameter setting across different modules of the network, lacking a global optimization strategy. This paper introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework, where the inner level formulates the model merging as an activation-based Bayesian regression under a strong prior induced by an anchor model, yielding an efficient closed-form solution; and the outer level leverages a Bayesian optimization procedure to search module-specific hyperparameters globally based on a small validation set. Furthermore, we reveal a key alignment between activation statistics and task vectors, enabling us to derive a data-free variant of BMM that estimates the Gram matrix for regression without any auxiliary data. Across extensive benchmarks, including up to 20-task merging in vision and 5-task merging in language, BMM consistently outperforms all plug-and-play anchor baselines (e.g., TA, WUDI-Merging, and TSV). In particular, on the ViT-L/14 benchmark for 8-task merging, a single merged model reaches 95.1, closely matching the average performance of eight task-specific experts (95.8).

[AI-88] Multimodal Hidden Markov Models for Persistent Emotional State Tracking

链接: https://arxiv.org/abs/2605.12838
作者: Anamika Ragu,Aneesh Jonelagadda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.

[AI-89] PROMETHEUS: Automating Deep Causal Research Integrating Text Data and Models

链接: https://arxiv.org/abs/2605.12835
作者: Sridhar Mahadevan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies – ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims – illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies – a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices – show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the

[AI-90] GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network and Can We Stop It?

链接: https://arxiv.org/abs/2605.12827
作者: Kaixiang Zhao,Bolin Shen,Yuyang Dai,Shayok Chakraborty,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Graph neural networks (GNNs) deployed as cloud services can be \emphstolen through \emphmodel-extraction attacks, which train a surrogate from query responses to reproduce the target’s behaviour, and a growing line of ownership defenses tries to prevent or trace such theft. The title of this paper asks two questions: \emphhow hard is it to steal a GNN?, and \emphcan we stop it? Prior work cannot answer either, because experiments use inconsistent datasets, threat models, and metrics. We introduce \emphGraphIP-Bench, a unified benchmark which evaluates both sides under a single black-box protocol. It integrates twelve extraction attacks, twelve defenses spanning watermarking, output-perturbation, and query-pattern-detection families, ten public graphs covering homophilic, heterophilic, and large-scale regimes, three GNN backbones, and three graph-learning tasks, and it reports fidelity, task utility, ownership verification, and computational cost on shared splits, queries, and budgets. We further add a joint attack-and-defense track which runs every attack on every defended target and measures watermark verification on the resulting surrogate, which exposes the protection that a defense retains after extraction. The empirical picture is short: stealing a GNN is easy at medium query budgets and most defenses do not change this; several watermarks verify reliably on the protected model but lose most of their verification signal on the extracted surrogate, which exposes a gap that single-model evaluations miss; and heterophilic graphs are systematically harder to steal, while a cross-architecture mismatch between target and surrogate reduces but does not prevent extraction. Code: \hrefthis https URLLabRAI/GraphIP-Bench.

[AI-91] Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

链接: https://arxiv.org/abs/2605.12825
作者: Chien Van Nguyen,Chaitra Hegde,Van Cuong Pham,Ryan A. Rossi,Franck Dernoncourt,Thien Huu Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

[AI-92] Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

链接: https://arxiv.org/abs/2605.12809
作者: Shixing Yu,Promit Ghosal,Kyra Gan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.

[AI-93] Discrete MeanFlow: One-Step Generation via Conditional Transition Kernels

链接: https://arxiv.org/abs/2605.12805
作者: Fairoz Nower Khan,Nabuat Zaman Nahim,Md Sajid Ahmed,Ruiquan Huang,Peizhong Ju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MeanFlow enables one-step generation in continuous spaces by learning an average velocity over a time interval rather than the instantaneous velocity field of flow matching. However, discrete state spaces do not have smooth trajectories or spatial derivatives, so the continuous formulation does not directly apply. We introduce Discrete MeanFlow, which replaces the motion of a point with the transport of probability mass over finite states. Our key object is the conditional transition kernel of a continuous-time Markov chain (CTMC), from which we define a mean discrete rate that measures the average change in transition probability over a time interval. We prove a Discrete MeanFlow identity that relates this finite-interval rate to the instantaneous CTMC generator at the endpoint, with the Kolmogorov forward equation replacing the spatial chain rule of continuous MeanFlow. Based on this identity, we parameterize the transition kernel directly using a boundary-by-construction design that guarantees valid probability outputs and exact boundary conditions without auxiliary losses. Since the learned kernel is itself a probability distribution, generation reduces to a single forward pass followed by one categorical draw meaning no iterative denoising, ODE integration, or multi-step refinement is required. We validate the framework on exact finite-state Markov chains, where the learned kernel recovers the analytical ground truth to high precision, and on factorized synthetic sequence generation tasks with varying alphabet sizes and sequence lengths.

[AI-94] Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

链接: https://arxiv.org/abs/2605.12771
作者: Alejandro Murillo-Gonzalez,Mahmoud Ali,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: To appear in the Proceedings of Robotics: Science and Systems (RSS) 2026

点击查看摘要

Abstract:Multi-objective reinforcement learning in robotic domains requires balancing complex, non-convex trade-offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non-convex regions of the Pareto front. Conversely, static non-linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict-driven controller that regulates the optimization smoothness based on real-time gradient interference. This allows the agent to anneal toward precise, non-convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task – a proxy for monitoring of protected/fragile ecosystems – where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict-aware adaptation enables the robust discovery of Pareto-optimal policies in non-convex regions inaccessible to linear baselines and unstable for static non-linear methods. Website: this https URL Comments: To appear in the Proceedings of Robotics: Science and Systems (RSS) 2026 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC) Cite as: arXiv:2605.12771 [cs.RO] (or arXiv:2605.12771v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.12771 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-95] Multi-Quantile Regression for Extreme Precipitation Downscaling

链接: https://arxiv.org/abs/2605.12762
作者: Hamed Najafi,Gareth Lagerwall,Jayantha Obeysekera,Jason Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep super-resolution networks for precipitation downscaling achieve strong bulk skill yet systematically under-predict the heavy-tail events that drive flood risk. We demonstrate that the primary obstacle is the loss function, not the data: under intensity-weighted MAE, real and synthetic labels at the same input are simply averaged, meaning data augmentation shifts the predicted mean rather than the conditional distribution. We resolve this with Q-SRDRN, a multi-quantile super-resolution network trained with pinball loss at tau in 0.50, 0.95, 0.99, 0.999. Two CNN-specific design choices make this practical: IncrementBound enforces monotonicity while preserving each quantile channel’s gradient identity, and separate per-quantile output heads provide independent filter banks for bulk and tail detection. Under this design, data augmentation via cVAE becomes complementary: the median head absorbs synthetic patterns without contaminating upper quantiles. Empirically, on Florida (convective/tropical-cyclone dominated), the un-augmented Q-SRDRN P999 head detects 1,598 of 2,111 events at 200 mm/day versus 88 for the deterministic baseline–an 18x detection-rate gain (4.2% to 75.7%)–with 63% lower KL divergence and 3.9% lower RMSE. Adding cVAE-generated samples lifts the P50 channel from 14 to 1,038 hits at 200 mm/day. On California (atmospheric-river dominated), the architecture reaches near-perfect detection (P999 SEDI = 0.996 through 300 mm/day). On Texas, the baseline catches only 2 of 10,720 events at 200 mm/day while the P999 head catches 8,776 (81.9%). While the cVAE does not transfer across regions, multi-quantile regression captures extremes wherever the large-scale signal is strong, while augmentation rescues the median where it is not.

[AI-96] State-Centric Decision Process

链接: https://arxiv.org/abs/2605.12755
作者: Sungheon Jeong,Ryozo Masukawa,Sanggeon Yun,Mahdi Imani,Mohsen Imani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

[AI-97] CoT-Guard: Small Models for Strong Monitoring

链接: https://arxiv.org/abs/2605.12746
作者: Nirav Diwan,Han Wang,Berkcan Kapusuzoglu,Ramin Moradi,Supriyo Chakraborty,Giri Iyengar,Sambit Sahu,Huan Zhang,Gang Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B–8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain monitoring tasks. To validate this generalization, we evaluate under a realistic threat model motivated by practical supply-chain attacks, where the adversary is a third-party LLM router injecting hidden objectives into code-generation requests through either prompt manipulation or code manipulation attacks. To push beyond objectives that large monitors already saturate, we also introduce four new challenging tasks even for strong monitors. Finally, we introduce CoT-Guard, a 4B-parameter monitor that demonstrates superior generalization performance under both prompt and code manipulation attacks, achieving a G-mean^2 (i.e., TNR x TPR) of 75% and outperforming GPT-5.4 (56%), GPT-5-mini (41%), and Qwen3-32B (54%), while closing the gap to Gemini-3-Flash (83%). These results demonstrate that CoT-Guard provides a practical and cost-effective user-side defense, substantially improving hidden-objective detection while avoiding the deployment cost of large monitors.

[AI-98] From Generalist to Specialist Representation ICML2026

链接: https://arxiv.org/abs/2605.12733
作者: Yujia Zheng,Fan Feng,Yuke Li,Shaoan Xie,Kevin Murphy,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML 2026

点击查看摘要

Abstract:Given a generalist model, learning a task-relevant specialist representation is fundamental for downstream applications. Identifiability, the asymptotic guarantee of recovering the ground-truth representation, is critical because it sets the ultimate limit of any model, even with infinite data and computation. We study this problem in a completely nonparametric setting, without relying on interventions, parametric forms, or structural constraints. We first prove that the structure between time steps and tasks is identifiable in a fully unsupervised manner, even when sequences lack strict temporal dependence and may exhibit disconnections, and task assignments can follow arbitrarily complex and interleaving structures. We then prove that, within each time step, the task-relevant latent representation can be disentangled from the irrelevant part under a simple sparsity regularization, without any additional information or parametric constraints. Together, these results establish a hierarchical foundation: task structure is identifiable across time steps, and task-relevant latent representations are identifiable within each step. To our knowledge, each result provides a first general nonparametric identifiability guarantee, and together they mark a step toward provably moving from generalist to specialist models.

[AI-99] Large Language Models for Agent ic NetOps and AIOps: Architectures Evaluation and Safety

链接: https://arxiv.org/abs/2605.12729
作者: Muhammad Bilal,Jon Crowcroft,Ruizhi Wang,Xiaolong Xu,Schahram Dustdar
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 50 pages, 15 figures, 6 tables; survey article

点击查看摘要

Abstract:Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

[AI-100] Grid-Orch: An LLM -Powered Orchestrator for Distribution Grid Simulation and Analytics

链接: https://arxiv.org/abs/2605.12728
作者: Boming Liu,Jin Dong,Jamie Lian
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The power distribution engineering workforce faces a projected shortage of up to 1.5 million engineers by 2030, creating urgent demand for more accessible analysis tools. This paper introduces Grid-Orch, a framework that bridges Large Language Models (LLMs) and power system simulation through the Model Context Protocol (MCP), enabling engineers to perform complex distribution analyses via natural language. Using OpenDSS as the reference implementation, Grid-Orch provides 36 domain-specific tools across eleven categories, covering power flow, voltage analysis, quasi-static time series (QSTS) simulation, and automated optimization. A provider-agnostic LLM layer supports both cloud-hosted (Gemini, Claude) and locally deployed (Ollama, llama-cpp) models, enabling air-gapped operation for security-sensitive utility environments. Three optimization skills, capacitor placement, voltage violation analysis, and overvoltage mitigation, extend the platform beyond single-tool queries to multi-step engineering workflows. Grid-Orch is delivered as an interactive web platform with chat-based interaction, a QSTS dashboard, and feeder topology visualization, and renders simulation results inline. Workflow demonstrations show that distribution analyses formerly requiring hours of scripting, such as distributed energy resource (DER) interconnection screening, complete in under two minutes through natural language, producing numerically identical results to direct OpenDSS scripting.

[AI-101] he End Justifies the Mean: A Linear Ranking Rule for Proportional Sequential Decisions

链接: https://arxiv.org/abs/2605.12717
作者: Carmel Baharav,Niclas Boehmer,Bailey Flanigan,Maximilian T. Wittmann
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI alignment and participatory design motivate a new democratic design problem: how to collectively choose a decision rule to use repeatedly. We study this problem for linear ranking rules, which repeatedly rank items x_j within batches X=(x_1,\dots,x_m)\in(\mathbbR^d)^m , where each item’s ranking is dictated by its score \langle \theta^,x_j\rangle according to a fixed scoring vector \theta^ . Given voters’ preferred scoring vectors \theta^(1),\dots,\theta^(n) and their population fractions \alpha^(1),\dots,\alpha^(n) , we ask how to choose a collective vector \theta^* satisfying individual proportionality (IP): every voter type i should agree with the resulting rankings to an \alpha^(i) -proportional degree, either on average over time (long-run IP) or even within each batch (per-batch IP). The default rule, the arithmetic mean of the \theta^(i) , has been shown to be severely majoritarian; more generally, it is not clear that any fixed linear rule can balance many voters’ disparate opinions. Our main result is that, surprisingly, there is a simple rule that does satisfy long-run IP: the angular mean, the spherical analog of the arithmetic mean. We then show that exact per-batch IP is impossible for fixed linear rules, but that the gap between per-batch and long-run IP shrinks quickly with batch size. Experiments on three real-world preference datasets show that all rules perform similarly when voters’ preferences are homogeneous, while the angular mean substantially improves proportionality in high-disagreement regimes. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12717 [cs.GT] (or arXiv:2605.12717v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2605.12717 Focus to learn more arXiv-issued DOI via DataCite

[AI-102] FePySR: A Neural Feature Extraction Framework for Efficient and Scalable Symbolic Regression

链接: https://arxiv.org/abs/2605.12704
作者: Zhiming Yu,Wangtao Lu,Xin Lai
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Data and Code Availability: this https URL

点击查看摘要

Abstract:A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP-hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two-stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state-of-the-art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR’s first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.

[AI-103] Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions

链接: https://arxiv.org/abs/2605.12701
作者: Gideon Popoola,John Sheppard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Machine learning algorithms in socially sensitive domains (e.g., credit decisions) often focus on equalizing predictive outcomes. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups. We show that existing outcome-fair models can still apply fundamentally different reasoning to individuals, a hidden procedural bias'' missed by standard fairness metrics and algorithms. We propose Counterfactual Explanation Consistency (CEC), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts. Key contributions include a nearest-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual-level procedural fairness metric, and a corresponding training loss. We introduce a taxonomy identifying Regime B’’ (same outcome, different reasoning) as a critical blind spot. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost.

[AI-104] Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification

链接: https://arxiv.org/abs/2605.12699
作者: Kamel Abdous,Nairouz Mrabah,Mohamed Bouguessa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 7 figures, 4 tables, 1 algorithm. Published in Expert Systems with Applications

点击查看摘要

Abstract:Existing multiplex graph models often assume homophily, where connected nodes tend to belong to the same class or share similar attributes. Consequently, these models may struggle with graphs exhibiting heterophily, where connected nodes typically belong to different classes and have dissimilar attributes. While recent methods have been developed to learn reliable node representations from unidimensional graphs with heterophily, they do not fully address the complexities of multiplex graphs. In a multiplex graph, nodes are linked through multiple types of edges (referred to as dimensions), which can simultaneously exhibit homophilic and heterophilic interactions. To address this gap, we propose \methodname, a novel method for node classification in multiplex graphs that adapts to both homophilic and heterophilic dimensions. \methodname introduces dimension-specific compatibility matrices to model varying degrees of homophily and heterophily across dimensions. A key innovation is its use of a product of trainable low-pass and high-pass filters, approximated via Chebyshev polynomials, to capture both smooth and abrupt changes in the graph signal. By composing these filters and optimizing label predictions using a proximal-gradient method, \methodname dynamically adjusts to the heterophilic characteristics of each dimension. Extensive experiments on synthetic and real-world datasets provide evidence that \methodname captures the complex interplay of homophilic and heterophilic interactions in multiplex graphs, and tends to yield improved node classification performance compared to state-of-the-art methods.

[AI-105] Agent ic Interpretation: Lattice-Structured Evidence for LLM -Based Program Analysis

链接: https://arxiv.org/abs/2605.12694
作者: Jacqueline L. Mitchell,Chao Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:Large language models can consult information that fixed static analyzers cannot, such as documentation, current security advisories, version-specific metadata, and informal API contracts. This makes LLMs a compelling option for program analyses that depend on information beyond the source program, or that are otherwise not amenable to conventional static analyzers. However, directly asking an LLM for a one-shot whole-program analysis is brittle because it compresses many evidence-dependent judgments into a single opaque answer, rather than exposing which conclusions are supported or disputed and using intermediate findings to guide later, more focused searches. In this paper, we propose agentic interpretation, a framework that brings the discipline of lattice-based static analysis to LLM-driven program reasoning. At a high level, agentic interpretation decomposes a high-level analysis goal into localized claims, and tracks the LLM’s judgment about each claim in a finite-height lattice. A worklist algorithm governs how claims and their judgments evolve during the analysis. We introduce a formal model of agentic interpretation, explore the design space it opens, and illustrate the approach with a worked example analyzing code that depends on opaque third-party components.

[AI-106] On the Size Complexity and Decidability of First-Order Progression IJCAI2026

链接: https://arxiv.org/abs/2605.12691
作者: Jens Classen,Daxin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is an extended version of an identically-titled paper accepted for publication at IJCAI 2026. This version contains an appendix with further proofs

点击查看摘要

Abstract:Progression, the task of updating a knowledge base to reflect action effects, generally requires second-order logic. Identifying first-order special cases, by restricting either the knowledge base or action effects, has long been a central topic in reasoning about actions. It is known that local-effect, normal, and acyclic actions, three increasingly expressive classes, admit first-order progression. However, a systematic analysis of the size of such progressions, crucial for practical applications, has been missing. In this paper, using the framework of Situation Calculus, we show that under reasonable assumptions, first-order progression for these action classes grows only polynomially. Moreover, we show that when the KB belongs to decidable fragments such as two-variable first-order logic or universal theories with constants, the progression remains within the same fragment, ensuring decidability and practical applicability.

[AI-107] A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions

链接: https://arxiv.org/abs/2605.12685
作者: Mohamed Mahmoud Amar,Nairouz Mrabah,Mohamed Bouguessa,Abdoulaye Baniré Diallo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering (TKDE). 18 pages, 8 figures

点击查看摘要

Abstract:Graph Self-Supervised Learning (GSSL) has emerged as a powerful paradigm for generating high-quality representations for graph-structured data. While multi-scale graph contrastive learning has received increasing attention, many existing methods still predominantly focus on a single graph abstraction level. To address this limitation, we propose a unified contrastive framework that can target node-level, proximity-level, cluster-level, and graph-level information and integrate them through a linear combination of similarity scores on positive pairs and dissimilarity scores (i.e., similarity scores on negative pairs). Furthermore, current approaches typically assign uniform penalty strengths to all examples, which reduces optimization flexibility and leads to ambiguous convergence status. To overcome this, we introduce a novel parameter-free fine-grained self-weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores. The proposed mechanism emphasizes the scores that deviate significantly from their target values. Our approach not only enhances optimization flexibility but also eliminates the computational overhead of hyperparameter tuning in conventional multi-task GSSL methods. Comprehensive experiments on real-world datasets show that our methods consistently outperform state-of-the-art approaches across downstream tasks, including classification, clustering, and link prediction, in both single-level and multi-level scenarios.

[AI-108] Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction

链接: https://arxiv.org/abs/2605.12683
作者: Florian Hess,Florian Götz,Daniel Durstewitz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computational Physics (physics.comp-ph)
备注: 29 pages, 6 figures, preprint

点击查看摘要

Abstract:Reconstructing nonlinear dynamical systems (DS) from data (DSR) is a fundamental challenge in science and engineering, but it inherently relies on sequential models. Recent breakthroughs for sequential models have produced algorithms that parallelize computation along sequence length T , achieving logarithmic time complexity, \mathcalO(\log T) . Since sequence lengths have been practically limited due to the linear runtime complexity \mathcalO(T) of classical backpropagation through time, this opens new avenues for DSR. This paper studies two prominent classes of parallel-in-time algorithms for this task, both of which leverage parallel associative scans as their core computational primitive. The first class comprises models with linear yet non-autonomous dynamics and a nonlinear readout, such as modern State Space Models (SSMs), while the second consists of general nonlinear models which can be parallelized using the DEER framework. We find that the linear training-time recurrence of the first class of models imposes limitations that often hinder learning of accurate nonlinear dynamics. To address this, we augment DEER with Generalized Teacher Forcing (GTF), a novel variant within the more general nonlinear framework that ensures stable and effective learning of nonlinear dynamics across arbitrary sequence lengths. Using GTF-DEER, we investigate the benefits of training on extremely long sequences ( T10^4 ) for DSR. Our results show that access to such long trajectories significantly improves DSR if the data features long time scales. This work establishes GTF-DEER as a robust tool for data-driven discovery and underscores the largely untapped potential of long-sequence learning in modeling complex DS.

[AI-109] Learning Transferable Latent User Preferences for Human-Aligned Decision Making

链接: https://arxiv.org/abs/2605.12682
作者: Alina Hyk,Sandhya Saisubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as reasoning modules in many applications. While they are efficient in certain tasks, LLMs often struggle to produce human-aligned solutions. Human-aligned decision making requires accounting for both explicitly stated goals and latent user preferences that shape how ambiguous situations should be resolved. Existing approaches to incorporating such preferences either rely on extensive and repeated user interactions or fail to generalize latent preferences across tasks and contexts, limiting their practical applicability. We consider a setting in which an LLM is used for high-level reasoning and is responsible for inferring latent user preferences from limited interactions, which guides downstream decision making. We introduce CLIPR (Conversational Learning for Inferring Preferences and Reasoning), a framework that learns actionable, transferable natural language rules that represent latent user preferences from minimal conversational input. These rules are iteratively refined through adaptive feedback and applied to both in-distribution and out-of-distribution ambiguous tasks across multiple environments. Evaluations on three datasets and a user study show that CLIPR consistently outperforms existing methods in improving alignment and reducing inference costs.

[AI-110] Revealing Interpretable Failure Modes of VLMs

链接: https://arxiv.org/abs/2605.12674
作者: Isha Chaudhary,Vedaant V Jain,Kavya Sachdeva,Sayan Ranu,Gagandeep Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2605.12674 [cs.AI] (or arXiv:2605.12674v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.12674 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-111] Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

链接: https://arxiv.org/abs/2605.12673
作者: Hao Wang,Hanchen Li,Qiuyang Mang,Alvin Cheung,Koushik Sen,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack’s extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

[AI-112] ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

链接: https://arxiv.org/abs/2605.12667
作者: Nirmal Patel,Fei Wang,Inderjit Dhillon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce \textbfO rdinal \textbfD ecomposition for \textbfR obust \textbfP olicy \textbfO ptimization ( \textbfODRPO ), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

[AI-113] Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

链接: https://arxiv.org/abs/2605.12653
作者: Eun Go,Rohan Deb,Arindam Banerjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning agents for portfolio management are typically trained and deployed as static policies, with no mechanism for using price forecasts at inference time. We propose \textFPILOT (Financial Plugin Inference-time Learning for Optimal Trading), a plugin inference-time optimization framework inspired by Model Predictive Control (MPC). Our key structural insight is that future prices mostly do not depend on one agent’s portfolio allocation, so a suitable predictive model can produce a multi-step price trajectory without iterative action-conditioned rollouts as in typical reinforcement learning. At each decision step, we use the forecaster’s predicted price trajectory to construct an allocation-based imagined return objective, and optimize the policy at inference-time before executing one step of the trade. Our framework is compatible with any pre-trained agent and adapts the policy to the forecaster’s predictions without any retraining. Evaluated across five policy learning algorithms on the TradeMaster DJ30 benchmark, \textFPILOT produces consistent improvements in total return and return-based risk-adjusted metrics (Sharpe, Sortino, Calmar), with stochastic policies benefiting more than deterministic ones. Further, using synthetic forecasts at calibrated quality levels, we show that gains consistently improve with forecaster quality, suggesting that our performance will improve based on advances in financial forecasting.

[AI-114] Multi-Rollout On-Policy Distillation via Peer Successes and Failures

链接: https://arxiv.org/abs/2605.12652
作者: Weichen Yu,Xiaomin Li,Yizhou Zhao,Xiaoze Liu,Ruowang Zhang,Haixin Wang,Yinyi Luo,Chen Henry Wu,Gaurav Mittal,Matt Fredrikson,Yu Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student’s local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student’s multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

[AI-115] hink Twice Act Once: Verifier-Guided Action Selection For Embodied Agents CVPR2026

链接: https://arxiv.org/abs/2605.12620
作者: Nishad Singhi,Christian Bialas,Snehal Jauhri,Vignesh Prasad,Georgia Chalvatzaki,Marcus Rohrbach,Anna Rohrbach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026 (Findings)

点击查看摘要

Abstract:Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

[AI-116] owards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

链接: https://arxiv.org/abs/2605.12584
作者: Sirui Zhang,Haonan Wang,Xunkai Li,Zekai Chen,Shumeng Li,Hongchao Qin,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, multimodal graph learning (MGL) has garnered significant attention for integrating diverse modality information and structured context to support various network applications. However, real-world graphs are often isolated due to data-sharing limitations across multiple parties, and their modalities are frequently incomplete. This highlights an urgent need to develop a robust federated approach. However, we find that existing methods remain insufficient. On the one hand, centralized MGL methods that handle missing modalities overlook the knowledge sharing and generalization in federated scenarios. On the other hand, while federated MGL methods have become increasingly mature, they primarily target non-graph data. Based on these technologies, we identify a two-stage pipeline wherein client-side completion reconstructs missing modalities, and server-side aggregation integrates the client-updated parameters of both the modality generator and the backbone models. Although this serves as a general solution, we identify two primary challenges in achieving greater robustness: (1) Topology-Isolated Local Completion: Client-side modality generation struggles to effectively leverage global semantics. (2) Reliability-Imbalanced Global Aggregation: Server-side multi-party collaboration is hindered by client updates with varying modality availability and recovery reliability. To address these challenges, we propose \textscFedMPO, which utilizes topology-aware cross-modal generation to recover missing features using comprehensive graph context, missing-aware expert routing to locally filter out noisy recovered signals, and reliability-aware aggregation to appropriately down-weight unreliable updates. Extensive experiments on 3 tasks across 6 datasets demonstrate that FedMPO outperforms baselines, achieving performance gains of up to 4.10% and 5.65% in high-missing and non-IID settings.

[AI-117] Stress-Testing the Reasoning Competence of LLM s With Proofs Under Minimal Formalism

链接: https://arxiv.org/abs/2605.12524
作者: Konstantine Arkoudas,Serafim Batzoglou
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12524 [cs.LO] (or arXiv:2605.12524v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2605.12524 Focus to learn more arXiv-issued DOI via DataCite

[AI-118] SP-GCRL: Influence Maximization on Incomplete Social Graphs DASFAA2026

链接: https://arxiv.org/abs/2605.12513
作者: Haohua Niu,Yuxuan Yang,Lingfeng Zhang,Hao Li,Jiao Liang,Zongfu Luo,Luca Rossi
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by DASFAA 2026. The first two authors contributed equally

点击查看摘要

Abstract:Influence maximization (IM) in real platforms is challenged by incomplete, noisy social graphs and non-stationary diffusion dynamics. We propose SP-GCRL, a social-propagation-aware graph contrastive reinforcement learning framework that learns end-to-end seed selection under partial this http URL first introduce a social-propagation-aware nonlinear diffusion function to model reinforcement/diminishing effects and probability drift under repeated exposure; we then construct dual structural views and perform contrastive learning to obtain node representations robust to missing edges and weak ties, while replacing expensive strategy metrics with a GAT-based regression surrogate to improve efficiency and scalability; finally, we use DDQN to learn an end-to-end seed selection policy on top of these representations. Experiments on multiple real-world networks show that SP-GCRL achieves significant gains over heuristic and learning-based baselines across budgets and topologies, while maintaining strong large-scale scalability.

[AI-119] Beyond Individual Mimicry: Constructing Human-Like Social network with Graph-Augmented LLM Agents

链接: https://arxiv.org/abs/2605.12512
作者: Haoran Bu,Litian Zhang,Chuxuan Zhang,Zhanyuan Liu,Hui Pang,Xi Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by large language models (LLMs), social bot can autonomously engage in local interactions, whose human-like behaviors enable them to evade social bot detection. However, while these botnets exhibit realistic local social interactions, they fail to preserve human-like social network. This is because LLM-based bots are graph-unaware and cannot coordinate over global interactions, which makes those botnets vulnerable to graph neural network (GNN)-based detection. To address this limitation, we propose GraphMind, which equips LLM-driven social bots to explicitly learn and fit human-like social network structures. Building on this foundation, we further construct GraphMind-Botnet, a LLM-driven botnet designed to evaluate the performance of existing social bot detection algorithms. Experiments on datasets derived from GraphMind-Botnet show that both text-based and graph-based detection models show substantially degraded performance in distinguishing. Our results highlight the critical role of social link construction in LLM-driven social network generation, while exposing fundamental weaknesses in existing bot detection mechanisms.

[AI-120] Representing Higher-Order Networks: A Survey of Graph-Based Frameworks

链接: https://arxiv.org/abs/2605.12509
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Combinatorics (math.CO)
备注: 170 pages. Peer-Reviewed Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-881-9

点击查看摘要

Abstract:Many real-world phenomena are naturally modeled by graphs and networks. However, classical graph models are often limited to pairwise interactions and may not adequately capture the richer structures that arise in practice. Higher-order graph formalisms extend this framework by incorporating multiway, hierarchical, temporal, multilayer, recursive, and tensor-based interactions, thereby providing more expressive representations of complex systems. This book presents a comprehensive overview of mathematical notions that can be used to model higher-order networks. It surveys foundational concepts, extensional frameworks, and newly introduced formalisms, with an emphasis on their structural principles, relationships, and modeling roles. The aim is to provide a unified perspective that helps readers compare diverse higher-order network models and identify appropriate tools for theoretical study and practical applications. This book is Edition 2.0. It mainly includes the addition of several concepts, as well as corrections and improvements of typographical errors and explanations.

[AI-121] Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument WWW

链接: https://arxiv.org/abs/2605.12505
作者: Karsten Brensing
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 25 pages. Experimental implementation under development at this http URL . Contact: this http URL @agi this http URL

点击查看摘要

Abstract:Autonomous AI systems generate responsibility gaps: consequential actions that cannot be satisfactorily attributed to developers, operators, or users under existing legal frameworks. The prevailing subject-object dichotomy fails to accommodate entities that exhibit autonomous, goal-directed behavior without recognized consciousness. Given irreducible epistemic uncertainty regarding artificial consciousness and the prospect of high-impact harms, the precautionary principle supports institutional design rather than regulatory inaction. This article advances limited legal personhood as a functional governance instrument for advanced AI systems. Drawing on organizational law, it proposes a two-tier corporate architecture in which AI systems operate through purpose-bound operating companies embedded within human-controlled holding structures, enabling transparency, accountability, and structural reversibility while remaining agnostic with respect to consciousness and moral status. The framework reflects a foundational reorientation toward future-oriented AI governance: where conventional approaches prioritize control and alignment, this article advances structured cooperation between human and artificial actors as the more sustainable institutional foundation. A pilot implementation using EU limited companies is currently under development, providing an initial test of doctrinal and operational feasibility.

[AI-122] Prime Successor Irreducibility: Turing Machine Complexity Kolmogorov Complexity and Weakness-Based Formulations

链接: https://arxiv.org/abs/2605.12504
作者: Ben Goertzel,Bill Lauritzen
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop conjectures and theorems expressing the idea that the prime sequence exhibits computational irreducibility in the transition from one prime to its successor. Informally, given a prime pp p, no general algorithm can compute the least prime greater than pp p substantially faster than sequentially testing candidates for primality, except possibly on sparse input sets. Our framework proceeds along complementary lines. First, we formalize Prime Successor Irreducibility in a Turing-machine complexity model (PSI-T), asserting lower bounds on running time relative to a sequential baseline. Second, we propose a Kolmogorov-complexity formulation (PSI-K), asserting that typical prime gaps are algorithmically incompressible at their scale; we prove PSI-K(c, \delta ) unconditionally for all fixed c1 using standard sieve bounds. Third, we develop weakness-based formulations: PSI-W (sparse-set anti-concentration) shows no small menu of gap values captures a noticeable fraction of primes, while PSI-W-LE shows collision probabilities decay and logical entropy tends to 1. These extend to prime constellations and consecutive gap vectors. Finally, a sieve-theoretic framework connects local obstruction patterns to Selberg weakness parameters. The PSI-K and weakness formulations connect irreducibility to classical statistical questions about prime gaps. Using the relationship between Kolmogorov complexity and Shannon entropy, we derive rigorous lower bounds on prime gap entropy in dyadic intervals [X,2X]. Together, these formulations provide a unified complexity-theoretic perspective on the apparent local unpredictability of the prime sequence, without asserting randomness or independence. Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.12504 [cs.CC] (or arXiv:2605.12504v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2605.12504 Focus to learn more arXiv-issued DOI via DataCite

[AI-123] EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering

链接: https://arxiv.org/abs/2602.05242
作者: Chenhui Mao,Yuanting Lei,Zhixiang Wei,Ming Liang,Zhixiang Wang,Jingxuan Xu,Dajun Chen,Wei Jiang,Yong Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic Test-Time Scaling (TTS) has delivered state-of-the-art (SOTA) performance on complex software engineering tasks such as code generation and bug fixing. However, its practical adoption remains limited due to significant computational overhead, primarily driven by two key challenges: (1) the high cost associated with deploying excessively large ensembles, and (2) the lack of a reliable mechanism for selecting the optimal candidate solution, ultimately constraining the performance gains that can be realized. To address these challenges, we propose Entropy-Guided Stepwise Scaling (EGSS), a novel TTS framework that dynamically balances efficiency and effectiveness through entropy-guided adaptive search and robust test-suite augmentation. Extensive experiments on SWE-Bench-Verified demonstrate that EGSS consistently boosts performance by 5-10% across all evaluated models. Specifically, it increases the resolved ratio of Kimi-K2-Intruct from 63.2% to 72.2%, and GLM-4.6 from 65.8% to 74.6%. Furthermore, when paired with GLM-4.6, EGSS achieves a new state-of-the-art among open-source large language models. In addition to these accuracy improvements, EGSS reduces inference-time token usage by over 28% compared to existing TTS methods, achieving simultaneous gains in both effectiveness and computational efficiency.

[AI-124] OpenAaaS: An Open Agent -as-a-Service Framework for Distributed Materials-Informatics Research

链接: https://arxiv.org/abs/2605.13618
作者: Peng Kang,Bixuan Li,Xiaoya Huang,Shuo Shi,Weiqiao Zhou,Zhen Li,Yu Liu,Lei Zheng
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 20 pages 5 figures

点击查看摘要

Abstract:The Materials Genome Initiative catalyzed the proliferation of centralized platforms–SaaS, PaaS, and IaaS–that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical “last mile” problem remains: while we possess world-class models and vast repositories of materials data, we lack the organizational infrastructure to compose these capabilities securely across institutional boundaries. The development of structural and functional materials for harsh service environments–high-temperature alloys, radiation resistant steels, corrosion-resistant coatings–remains characterized by long-term iteration, mechanistic complexity, and high domain expertise–demands that exceed both monolithic agent systems and traditional centralized platforms. To address this gap we propose OpenAaaS, an open-source hierarchical and distributed Agent-as-a-Service framework that enables organized multi-agent collaboration for intelligent materials design. OpenAaaS is built on a single foundational principle: code flows, data stays still. A Master Agent plans and decomposes complex research tasks without requiring direct access to subordinate agents’ managed data and computational resources. Sub-agents, deployed as near-data execution nodes, retain full sovereignty over local datasets, proprietary algorithms, and specialized hardware. This architecture guarantees that raw data never leaves its domain of origin while enabling cross-scale, cross-domain secure integration of previously isolated materials intelligence silos. We validate the framework through two representative case studies: (i) AlphaAgent, an evidence-grounded materials literature analysis executor that achieves 4.66/5.0 on deep analytical questions against single-pass RAG baselines; and (ii) an ultra-large-scale hexa-high-entropy alloy descriptor database service that demonstrates secure near-data execution and domain-specific scientific workflows under strict data-sovereignty constraints. OpenAaaS establishes a principled pathway toward “organized research” via agent collectives, offering a scalable foundation for next-generation materials intelligent design platforms. All source code is available at this https URL.

[AI-125] Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report

链接: https://arxiv.org/abs/2605.13555
作者: Viktor Rogowski,Maarten L. Terpstra,Niklas Wahl,Florian Kamp,Erik van der Bijl,Arthur Jr. Galapon,Christopher Kurz,Bowen Xin,Zhengxiang Sun,Hollie Min,Gregg Belous,Jason Dowling,Yan Xia,Siyuan Mei,Fuxin Fan,Arthur Longuefosse,Javier Sequeiro Gonzalez,Miguel Diaz Benito,Alvaro Garcia Martin,Fabien Baldacci,Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger,Zhiyuan Zhang,Jinghua Cai,Han Bing,Tan Zuopeng,Ricardo Brioso,Daniele Loiacono,Guillaume Landry,Adrian Thummerer,Matteo Maspero
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 59 pages total: 26 pages main article + supplementary material; 8 figures in the main manuscript and 3 supplementary figures. Currently under review at the journal Medical Image Analysis (MIA)

点击查看摘要

Abstract:Radiation therapy (RT) requires precise dose delivery over multiple fractions, with CT fundamental for treatment planning due to its electron density information. Repeated CT acquisitions impose radiation exposure and logistical burdens, MRI lacks electron density, and cone-beam CT (CBCT) requires correction for dose calculation. Synthetic CT (sCT) generation addresses these by converting MRI or CBCT into CT-equivalent images with accurate Hounsfield Unit (HU) values, enabling MRI-only RT and CBCT-based adaptive workflows. Building on SynthRAD2023, SynthRAD2025 benchmarked sCT methods on 2,362 patients from five European centers across head and neck, thorax, and abdomen. Two tasks: MRI-to-CT (890 cases) and CBCT-to-CT (1,472 cases), evaluated via image similarity (MAE, PSNR, MS-SSIM), segmentation (Dice, HD95), and dosimetric metrics from photon and proton plans. With 803 participants and 12/13 valid submissions, Task 1 top performance reached MAE 64.8\pm21.3 HU, PSNR \sim 30 dB, MS-SSIM \sim 0.936, Dice 0.79, photon \gamma_2%/2\textmm98% , proton \gamma\approx85% . Task 2 improved: MAE 48.3\pm13.4 HU, PSNR 32.6 dB, MS-SSIM 0.968, Dice 0.86, photon \gamma99% , proton \gamma\approx89% . Strong image–segmentation correlations ( \rho=0.78 – 0.79 ) but moderate dose correlations confirmed image quality is insufficient as a dosimetric surrogate. Head-and-neck cases were most consistent; thoracic and abdominal cases showed greater variability. Residual errors at tissue interfaces propagate along beam paths, affecting proton dose more than photon. SynthRAD2025 demonstrates that deep learning yields clinically relevant sCTs, especially for CBCT-to-CT, while identifying persistent MRI-to-CT challenges and underscoring dose-based evaluation as essential for clinical validation.

[AI-126] owards a holistic understanding of Selection Bias for Causal Effect Identification

链接: https://arxiv.org/abs/2605.13430
作者: Yiwen Qiu,Filip Kovacevic,Shimeng Huang,Peter Spirtes,Francesco Locatello
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias’’ when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.13430 [stat.ME] (or arXiv:2605.13430v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2605.13430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-127] Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis

链接: https://arxiv.org/abs/2605.13248
作者: Bo Cui,Xiaowen Song,Yaowen Zhang,Shunzhe Zhang,B.J.F. van Beijnum,Monique Tabak,Ying Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The analysis of physiological time series, such as electrocardiograms (ECG) and photoplethysmograms (PPG), is persistently hindered by modality and frequency gaps stemming from heterogeneous recording devices. Existing foundation models typically rely on continuous latent spaces, which frequently suffer from severe modality entanglement, lack high-fidelity cross-frequency generative capacity, and impose high computational costs that prohibit edge-device deployment. In this paper, we propose Compact Latent Manifold Translation (CLMT), a highly parameter-efficient (0.09B) unified framework that bridges these gaps through a novel two-stage discrete translation paradigm. First, we introduce a Universal Tokenizer utilizing Hierarchical Residual Vector Quantization (RVQ) to decouple heterogeneous signals into isolated, well-structured discrete latent manifolds, effectively preventing inter-modality interference. Second, a Context-Prompted Latent Translator maps these discrete tokens across modalities by integrating static physiological priors, reframing complex signal synthesis as a pure latent sequence translation task. Extensive evaluations demonstrate that our 0.09B model significantly outperforms massive baselines. In cross-modal PPG-to-ECG synthesis, it resolves temporal phase drift and dramatically improves the clinical R-peak detection F1-score from 0.37 (baseline) to 0.83. Furthermore, in extreme cross-frequency super-resolution (25Hz to 100Hz), it successfully recovers high-frequency diagnostic landmarks, achieving an unprecedented Pearson correlation of 0.9956. By learning a universal discrete language for biological signals with a fraction of the computational footprint, our approach sets a new trajectory for edge-deployable, multi-modal medical foundation models.

[AI-128] Neural QAOA2: Differentiable Joint Graph Partitioning and Parameter Initialization for Quantum Combinatorial Optimization ICML2026

链接: https://arxiv.org/abs/2605.13072
作者: Zubin Zheng,Jiahao Wu,Shengcai Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:The quantum approximate optimization algorithm (QAOA) holds promise for combinatorial optimization but is constrained by limited qubits. While divide-and-conquer frameworks like QAOA ^2 address scalability by partitioning graphs into subgraphs, existing methods suffer from two fundamental limitations: i) misalignment between heuristic partitioning metrics and quantum optimization goals, and ii) topology-blind parameter initialization that leads to optimization cold starts. To bridge these gaps, we propose Neural QAOA ^2 , an end-to-end differentiable framework that jointly generates graph partitions and initial parameters. By integrating a generative evaluative network (GEN), our method utilizes a differentiable quantum evaluator as a high-fidelity performance surrogate to provide direct gradient guidance, enabling the joint generator to learn the intrinsic mapping from graph topology to high-quality partition and parameter configurations. Extensive experiments on 183 QUBO, Ising, and MaxCut instances (21 to 1000 variables) demonstrate that our gradient-driven approach broadly outperforms heuristic baselines, ranking first on 101 instances. It exhibits zero-shot generalization across out-of-distribution graph topologies and scales.

[AI-129] When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

链接: https://arxiv.org/abs/2605.12947
作者: Young Hyun Cho,Will Wei Sun
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional stopping. In theory, we show that a conservative reference pool yields finite-sample control of the probability of releasing on infeasible tasks, that is, tasks for which the given workflow is not capable of producing a reliable solution. We also characterize conditions under which the same conservative rule still achieves nontrivial release on feasible tasks. In an MBPP+ coding-agent case study, the wrapper reduces premature incorrect release relative to baseline stopping rules while still releasing on tasks for which the workflow repeatedly accumulates moderate supporting evidence.

[AI-130] Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

链接: https://arxiv.org/abs/2605.12756
作者: Zhehang Du,Hangfeng He,Weijie Su
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.

[AI-131] Controllable Quantum Memory Capacity in Quantum Reservoir Networks with Tunable partial-SWAPs

链接: https://arxiv.org/abs/2605.12713
作者: Erik L. Connerty,Ethan N. Evans
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:In the field of quantum reservoir computing (QRC), many different computational models and architectures have been proposed. From these models, we identify feedback based models – which use a feedback mechanism to re-embed classical measurements from the QRC – and recurrent models – which use a multi-register approach with memory and readout qubits – as the two major competing architectures that have been discussed and validated on hardware. In this paper, we advance upon the recurrent architectures, which employ a two register approach to endow the QRC with a fading memory. While these approaches have been validated on hardware and have demonstrated great real-world performance on noisy-intermediate-scale-quantum (NISQ) quantum processing units (QPUs), the exact mechanism through which the memory capacity arises is not completely understood or fully controllable. With this, we augment the recurrent approaches and present a hardware-realizable mechanism, which we call a tunable partial-SWAP, that allows for the direct control of the rate of memory dissipation from a QRN implemented on a gate-based QPU. The theory behind this mechanism is discussed in terms of a controlled amplitude-damping channel and validation experiments using a randomized short-term memory capacity (STMC) recall benchmark and the NARMA-5 dataset are conducted using simulation and IBM QPUs, respectively.

[AI-132] he critical slowing down in diffusion models

链接: https://arxiv.org/abs/2605.12597
作者: Luca Maria Del Bono,Giulio Biroli,Patrick Charbonneau,Marylou Gabrié
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the O(n) model of statistical field theory in the Gaussian limit n \to \infty . In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

[AI-133] Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations

链接: https://arxiv.org/abs/2605.12569
作者: M. Shamail J. Khan,Nisha L. Raichur,Lucas Heublein,Christian Wielenberg,Alexander Mattick,Tobias Feigl,Christopher Mutschler,Felix Ott
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Global navigation satellite system (GNSS) interference poses a serious threat to reliable positioning, especially in indoor and multipath-rich environments where source localization is highly challenging. In this paper, we formulate GNSS interference localization as an active sensing problem and propose a reinforcement learning (RL) framework in which an agent sequentially explores the environment to infer the position of an emitter source from radio frequency (RF) observations acquired with a 2x2 patch antenna. The localization task is modeled as a partially observable decision process, since single-snapshot measurements are often ambiguous under multipath propagation and changing channel conditions. To address this, the proposed framework combines high-dimensional RF sensing with deep RL and recurrent policy learning. We investigate both value-based and policy-based approaches, namely Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO), and study their behavior under domain shift. The approach is evaluated on a simulated dataset generated with the Sionna ray-tracing module, which provides realistic propagation effects and diverse environment configurations. Experimental results show that the proposed method achieves a localization success rate of 80.1%, demonstrating the potential of RL for adaptive GNSS interference localization. Overall, the results highlight simulation-assisted training as a promising direction for robust interference localization in challenging propagation environments.

[AI-134] ChannelKAN: Multi-Scale Dual-Domain Channel Prediction via Hybrid CNN-KAN Architecture

链接: https://arxiv.org/abs/2605.12553
作者: Nanqing Jiang,Zhangyao Song,Tao Guo,Xiaoyu Zhao,Yinfei Xu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate channel state information (CSI) prediction is essential for improving the reliability and spectral efficiency of massive MIMO-OFDM systems in high-mobility scenarios. Existing deep learning methods struggle to jointly capture short-term local variations and long-range nonlinear dependencies in CSI sequences. To address this challenge, we propose ChannelKAN, a hybrid CNN-KAN channel prediction model with multi-scale frequency domain information enhancement. The key insight is that CNNs and Kolmogorov-Arnold Networks (KANs) are naturally complementary: CNNs extract intra-time-step local spatial-frequency correlations, while KANs with learnable Chebyshev polynomial activations fit inter-time-step nonlinear temporal evolution in a holistic manner. Specifically, a dual-domain expansion module first generates complementary frequency-domain and delay-domain CSI representations. A multi-scale frequency information enhancement module then retains dominant spectral components at multiple scales to strengthen key features and suppress noise. Next, a CNN-KAN feature extraction module captures local correlations via cascaded convolutions and models long-range dependencies via Chebyshev KAN layers. Finally, a dual-domain fusion module adaptively integrates features from both branches to produce the prediction. Experiments on 3GPP-compliant QuaDRiGa datasets demonstrate that ChannelKAN outperforms RNN, LSTM, GRU, CNN, and Transformer baselines in normalized mean square error (NMSE), spectral efficiency (SE), and bit error rate (BER) across various velocities and signal-to-noise ratios. Ablation studies further confirm the effectiveness of each proposed module.

[AI-135] Why the Unfinished Keeps Returning: Canxianization and the Dynamics of Conscious Priority

链接: https://arxiv.org/abs/2605.12543
作者: Hengjin Cai,Tianqi Cai
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some conscious contents disappear after access; others return repeatedly, long after their triggering conditions have ceased. We propose Canxianization as the process by which a perturbation becomes closure-resistant self-relevant unfinishedness and thereby acquires recurrent conscious priority. The theory distinguishes this phenomenon from emotional arousal, memory strength, the Zeigarnik effect, curiosity, prediction error, and intrusive thought. A perturbation becomes canxianized when it is attributed to the self-world boundary, value-marked, blocked from causal or action closure, and metacognitively coupled to the self-model. We distinguish latent canxian strength from observed conscious recurrence, and introduce a Recurrent Priority Index and a Canxian Update Index to separate productive from pathological recurrence. Cold Canxianization, recurrence driven by structural incompleteness rather than affective arousal, is identified as a critical discriminant. Reset Resistance and Stake Transfer tests are proposed for artificial systems. Canxianization is not memory persistence; it is failed self-world repair. The unfinished does not merely remain. When it concerns the self and resists closure, it returns.

[AI-136] PG-LRF: Physiology-Guided Latent Rectified Flow for Electro-Hemodynamic PPG-to-ECG Generation

链接: https://arxiv.org/abs/2605.12541
作者: Xiaoda Wang,Minxiao Wang,Kaiqiao Han,Defu Cao,Ching Chang,Yidan Shi,Runze Yan,Xiao Luo,Yan Liu,Xiao Hu,Yizhou Sun,Wei Wang,Carl Yang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiography (ECG) is the clinical standard for cardiac assessment but requires dedicated hardware that does not scale to daily-life monitoring. Photoplethysmography (PPG) is ubiquitous in wearables but lacks ECG-specific diagnostic morphology and is corrupted by motion and sensor noise. PPG-to-ECG generation aims to bridge this gap by recovering electrical morphology and timing from peripheral pulse signals. However, existing methods largely rely on statistical alignment and data-driven generation. They fail to explicitly structure the latent space around physiology-aware electro-hemodynamic factors and lack constraints from forward physiological dynamics. To address these challenges, we propose PG-LRF, a physiology-guided latent rectified flow framework. PG-LRF introduces an electro-hemodynamic simulator that co-models ECG and PPG through shared cardiac phase dynamics. Guided by this simulator, a Physiology-Aware AutoEncoder learns a structured electro-hemodynamic latent space. Then we integrate this simulator guidance into a PPG-conditioned latent rectified flow, enforcing ECG-side morphology consistency and ECG-to-PPG forward hemodynamic consistency during generative transport. Experiments on the large-scale MC-MED dataset demonstrate that PG-LRF significantly improves PPG-to-ECG generation and downstream cardiovascular disease classification, proving its ability to generate ECGs that are both signal-faithful and physiologically plausible under the ECG-to-PPG hemodynamic pathway

[AI-137] Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle

链接: https://arxiv.org/abs/2605.12536
作者: Alexander Kearney
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 84 pages, 10 figures, 2 tables Extended version of a Master’s thesis, Mathematical Institute, University of Oxford

点击查看摘要

Abstract:The Free Energy Principle (FEP) is a leading framework for mathematically modeling self-organization and learning, while Integrated Information Theory (IIT) is a computational ontology of consciousness oriented around irreducible cause and effect. While conceptual unifications have been proposed and appear to be supported by empirical findings, the absence of a rigorous mathematical mapping places upper bounds on their precision and testability. This work proposes that information can be defined as the deviation \psi of realized dynamics from a constrained maximum-caliber (MaxCal) path ensemble over a finite time horizon. Under this definition, each of the cause/effect repertoires central to IIT 3.0 emerge directly from MaxCal variational principles, allowing IIT’s phenomenological calculus to be re-derived from constrained entropy-maximization (CMEP). This framework supplies a theoretical bridge to active inference, which is mathematically dual to CMEP under Langevin dynamics, and offers a principled route for extending IIT to new dynamical regimes. When the approach is applied under the Central Limit Theorem (CLT) for Markov chains and via large deviations theory (LDT) to Ising models, information \psi is shown to be equivalent to prediction error under accompanying predictive coding models. This may hold relevance to the ``hill-shaped trajectory’’ of \Phi observed in neuronal cultures adapting to sensory inputs. Together, these results provide a physically and mathematically grounded rationale for studying the convergence of FEP, IIT, and thermodynamic frameworks of cognition such as recent work grounding consciousness in violations of the Fluctuation-Dissipation Theorem (FDT).

[AI-138] Agent icAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems

链接: https://arxiv.org/abs/2605.12532
作者: Ivan Letteri
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Conventional algorithmic trading systems are grounded in deterministic heuristics or offline-trained statistical models that cannot adapt to the semantic complexity of rapidly shifting market regimes. This paper introduces AGENTICAITA, an agentic AI framework that replaces the traditional signal then execute paradigm with a fully autonomous deliberative loop in which multiple specialized Large Language Model agents reason, negotiate, and act in concert - without any offline training or human intervention. The framework proposes four architectural contributions: (i) an Adaptive Z-Score Trigger Engine that acts as a cognitive resource allocator, gating LLM inference exclusively on statistically anomalous market conditions; (ii) a Sequential Deliberative Pipeline - the core agentic contribution - in which an Analyst agent, a Risk Manager agent, and an Executor agent form a structured reasoning chain governed by typed JSON contracts and a deterministic hard-gate safety layer; (iii) an Inference Gating Protocol, a mutex-based cognitive resource scheduler that serializes concurrent agent activations and ensures fully reproducible audit trails; and (iv) a Correlation-Break Diversification composite score that operationalizes portfolio-level idiosyncratic signal prioritization within individual agent reasoning. Validated over a five-day autonomous dry-run session under live market conditions, the framework demonstrates operational correctness of the deliberative pipeline, achieving 157 zero-intervention invocations across 76 assets with an 11.5% agentic friction rate that confirms non-trivial inter-agent negotiation. This preliminary proof-of-concept establishes the feasibility of training-free, deterministic safety-constrained multi-agent orchestration in financial decision loops, with statistically robust performance evaluation and execution cost modeling deferred to extended live deployment.

[AI-139] okaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma

链接: https://arxiv.org/abs/2605.11033
作者: JC Wu,Norton Lee,Kai Siang Chen
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:TokaMind is a multi-modal transformer (MMT) foundation model pre-trained on tokamak plasma diagnostics data from MAST, where it was shown to outperform CNN-based approaches on fusion benchmarks. We investigate whether its learned representations generalize to physically distinct but structurally analogous domains. Through systematic experimentation across four domains-industrial bearing degradation, NASA CMAPSS turbofan degradation, and two independent power grid PMU datasets-we identify four transfer-favoring characteristics that help explain where TokaMind’s pretrained representations are most effective. Power grid synchrophasor data matches this target-domain profile most directly, while industrial degradation datasets demonstrate that TokaMind can still yield useful performance under partial alignment, especially when task design and feature construction expose physically meaningful degradation structure. On the GESL/PNNL 500-event benchmark with provider-aware evaluation, TokaMind achieves test \textF1 = 0.837 \pm 0.040 (3~seeds) for severe event classification. Our central finding, however, is not the aggregate score: classification difficulty is structurally determined by provider-level grid topology, not model capacity. In the single-window early-warning regime, TokaMind outperforms a CNN baseline (F1~0.889 vs.~0.878)–a reversal that disappears as more event windows are provided. Furthermore, Critical Slowing Down (CSD) indicators, used as a confidence gate rather than a classification label, improve F1 from 0.696 to 0.750 at 63% coverage-outperforming the CNN baseline (0.636) at any coverage level. These results establish the first cross-domain validation of TokaMind outside nuclear fusion and propose a transferability framework and revised evaluation protocol for multi-source PMU datasets.

机器学习

[LG-0] Reducing cross-sample prediction churn in scientific machine learning

链接: https://arxiv.org/abs/2605.13826
作者: Gordan Prastalo,Kevin Maik Jablonka
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across 9 chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within 1.3\text–4.2 percentage points but disagree on the class label of 8.0\text–21.8% of test molecules. We call this gap \emphcross-sample prediction churn. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is K -bootstrap bagging, which cuts the rate 40\text–54% on every dataset at no accuracy cost ( K\times -ERM compute). The second is \emphtwin-bootstrap, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched 2\times -ERM compute reduces churn a further median 45% beyond bagging- K=2 . Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.

[LG-1] Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion

链接: https://arxiv.org/abs/2605.13816
作者: Nikolaos Tsalkitzis,Panagiotis P.Filntisis,Petros Maragos,Niki Efthymiou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital phenotyping enables continuous passive monitoring of behavior and physiology, offering a promising paradigm for early detection of psychotic relapse. In this work, we develop and systematically study two smartwatch-based frameworks for daily relapse detection. The first forecasts cardiac dynamics and flags deviations between predicted and observed features as indicators of abnormality. The second adopts a multi-task formulation that fuses sleep with motion and cardiac-derived signals, learning time-aware embeddings and predicting measurement timing. Both pipelines use Transformer encoders and output a daily anomaly score, derived from predictive uncertainty estimated via an ensemble of multilayer perceptrons to improve robustness to real-world wearable variability. While each framework independently demonstrates strong predictive power, we show that they capture complementary physiological signatures. Consequently, we propose a late-fusion strategy that synergistically combines the anomaly signals from both architectures into a unified decision score. We benchmark our methodology on the 2nd e-Prevention Grand Challenge dataset, where our fused model achieves a 8% relative improvement over the competition-winning baseline. Our results, supported by extensive ablation studies, suggest that the integration of diverse digital phenotypes, cardiac, motion, and sleep, is essential for the high-fidelity detection of psychotic relapse in real-world settings.

[LG-2] Provable Quantization with Randomized Hadamard Transform

链接: https://arxiv.org/abs/2605.13810
作者: Ying Feng,Piotr Indyk,Michael Kapralov,Dmitry Krachun,Boris Prokhorov
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Vector quantization via random projection followed by scalar quantization is a fundamental primitive in machine learning, with applications ranging from similarity search to federated learning and KV cache compression. While dense random rotations yield clean theoretical guarantees, they require \Theta(d^2) time. The randomized Hadamard transform HD reduces this cost to O(d \log d) , but its discrete structure complicates analysis and leads to weaker or purely empirical compression guarantees. In this work, we study a variant of this approach: dithered quantization with a single randomized Hadamard transform. Specifically, the quantizer applies HD to the input vector and subtracts a random scalar offset before quantizing, injecting additional randomness at negligible cost. We prove that this approach is unbiased and provides mean squared error bounds that asymptotically match those achievable with truly random rotation matrices. In particular, we prove that a dithered version of TurboQuant achieves mean squared error \bigl(\pi\sqrt3/2 + o(1)\bigr) \cdot 4^-b at b bits per coordinate, where the o(1) term vanishes uniformly over all unit vectors and all dimensions as the number of quantization levels grows. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2605.13810 [cs.LG] (or arXiv:2605.13810v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13810 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Min-Max Optimization Requires Exponentially Many Queries

链接: https://arxiv.org/abs/2605.13806
作者: Martino Bernasconi,Matteo Castiglioni,Andrea Celli,Alexandros Hollender
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the query complexity of min-max optimization of a nonconvex-nonconcave function f over [0,1]^d \times [0,1]^d . We show that, given oracle access to f and to its gradient \nabla f , any algorithm that finds an \varepsilon -approximate stationary point must make a number of queries that is exponential in 1/\varepsilon or d .

[LG-4] Force-Aware Neural Tangent Kernels for Scalable and Robust Active Learning of MLIPs

链接: https://arxiv.org/abs/2605.13788
作者: Eszter Varga-Umbrich,Zachary Weller-Davies,Paul Duckworth,Jules Tilly,Olivier Peltre,Shikha Surana
类目: Machine Learning (cs.LG)
*备注: 10 main pages, total 34 pages

点击查看摘要

Abstract:Active learning for machine-learning interatomic potentials (MLIPs) must address several challenges to be practical: scaling to large candidate pools, leveraging energy-force supervision, and maintaining robustness when candidate pools are biased relative to the target distribution. In this work, we jointly address these challenges. We first introduce a linearly scaling acquisition framework based on chunked feature-space posterior-variance shortlisting. By avoiding materialisation of the candidate and train set kernels, this approach enables screening of ~200k structures within hours and applies broadly to acquisition strategies that score candidates based on molecular similarity metrics. We then extend the Neural Tangent Kernel (NTK) to a force-aware setting via mixed parameter-coordinate derivatives, yielding a force NTK and a joint energy-force NTK that provide natural similarity metrics for vector-field prediction. We demonstrate the effectiveness of the joint energy-force NTK on the OC20 dataset, where force-aware acquisition is crucial: it achieves the lowest energy and force MAE and RMSE across all metrics and distribution splits. Across T1x, PMechDB, and RGD benchmarks, our force NTK methods remain competitive with established baselines while being significantly more efficient than committee-based approaches. Under a controlled candidate-pool shift case study on T1x, acquisition based on pretrained MLIP embeddings and NTKs remains robust, whereas committee-based methods exhibit higher variance. Overall, these results show that a single pretrained MLIP can enable scalable, force-aware, and distribution-robust active learning for foundation-model fine-tuning.

[LG-5] Interpretable Machine Learning for Antepartum Prediction of Pregnancy-Associated Thrombotic Microangiopathy Using Routine Longitudinal Laboratory Data

链接: https://arxiv.org/abs/2605.13786
作者: Chuanchuan Sun,Zhen Yu,Qin Fan,Qingchao Chen,Feng Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Pregnancy-associated thrombotic microangiopathy (P-TMA) is rare but life-threatening. Early risk prediction before overt clinical presentation remains challenging, as the associated laboratory abnormalities are subtle, multidimensional, and frequently masked by common physiological changes such as gestational thrombocytopenia and pregnancy-related proteinuria, thus overlapping heavily with benign obstetric and renal conditions. This complexity is poorly captured by univariate or rule-based approaches; however, it is addressable by machine learning, which can extract latent, time-dependent risk signatures from longitudinal clinical tests. Methods: This retrospective study included 300 pregnancies comprising 142 P-TMA cases and 158 controls. After exclusion of identifiers and non-informative variables, 146 longitudinal laboratory predictors were retained. Participants were divided into a training cohort (80%) and a held-out test cohort (20%) using stratified sampling. Five algorithms were evaluated: logistic regression, support vector machine with radial basis function kernel, random forest, extra trees, and gradient boosting. The final model was selected by mean cross-validated AUROC, refitted on the full training cohort, and evaluated once in the held-out test cohort. Interpretability analyses examined global feature importance and distributional patterns of leading predictors. Results: Gradient boosting was prespecified by cross-validation in the training cohort. The model achieved an AUROC of 0.872 (95% CI: 0.769-0.952) and an AUPRC of 0.883 (95% CI: 0.780-0.959) in a held-out test cohort, with sensitivity of 0.750 and specificity of 0.812. Conclusions: Longitudinal clinical laboratory tests obtained during routine care contained informative and clinically plausible signals for P-TMA risk. Notably, cystatin C at week 6 showed promise as an early monitoring indicator.

[LG-6] Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

链接: https://arxiv.org/abs/2605.13784
作者: Victor Norgren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, this http URL), holding query latency constant as accumulated context grows.

[LG-7] oward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

链接: https://arxiv.org/abs/2605.13761
作者: Phillip Si,Yuan Qiu,Omar Sallam,Jeremy Feinstein,Ziang He,Eugene Yan,Peng Chen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:AI-driven flood digital twins demand fast hydrodynamic surrogates for ensemble forecasting and observation assimilation. Yet even GPU-accelerated two-dimensional shallow water equation (SWE) solvers still require \sim 55 minutes per 96 -hour run on a \sim 4.2 -million-active-cell metropolitan basin (the Des~Plaines River basin at 30,\mathrmm resolution), making such workloads prohibitive at native resolution. We present the Conditional Latent Dynamics Network (CLDNet): a low-dimensional latent neural ODE driven by rainfall, paired with a coordinate-based decoder conditioned on static terrain (elevation, slope, Manning roughness) that reconstructs depth and discharge at arbitrary query points. Pointwise decoding decouples memory from grid size and handles irregular watersheds natively, enabling metropolitan-scale training on a single compute node and direct queries at exact gauge coordinates without raster snapping. We evaluate CLDNet on a synthetic 250,000 -cell Texas benchmark and on a new Des~Plaines case study of 114 real-rainfall Stage~IV storms whose reference simulator we validate against United States Geological Survey (USGS) gauges at the April~2013 flood-of-record (Nash–Sutcliffe efficiency 0.57 – 0.94 on mean-recentered water-surface elevation). CLDNet roughly halves the relative root-mean-squared error of an unconditional baseline, outperforms regular-grid VAE–ConvLSTM and FNO baselines on the Texas benchmark (both presuppose a Cartesian grid and do not apply to the irregular Des~Plaines watershed), reaches a critical success index of \approx 86% at the 0.5,\mathrmm inundation threshold, and produces a full 96 -hour basin-wide forecast in \sim 29 seconds – a \sim 115\times speedup.

[LG-8] Fast and effective algorithms for fair clustering at scale

链接: https://arxiv.org/abs/2605.13759
作者: Claudio Mantuano,Manuel Kammermann,Philipp Baumann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is an unsupervised machine learning task that consists of identifying groups of similar objects. It has numerous applications and is increasingly used in fairness-sensitive domains where objects represent individuals, such as customers, employees, or students. We address a fair clustering problem in which objects belong to protected groups. The problem consists of partitioning the objects into a predefined number of clusters while attaining a user-defined target level of fairness, meaning that each protected group is sufficiently represented in each cluster. The objective is to minimize the clustering cost, defined as the sum of squared Euclidean distances between the objects and the centers of their clusters. Since clustering cost and fairness are generally in conflict, managing the trade-off between them is essential in practical applications. Existing methods provide limited control over this trade-off and either fail to scale to large datasets or, when they scale, produce low-quality solutions. We propose a general framework for fair clustering that provides precise control over the cost-fairness trade-off and introduce three heuristics based on it. The first heuristic focuses on solution quality and the flexibility to incorporate additional constraints, the second improves scalability while retaining high solution quality, and the third is designed for maximum scalability, producing solutions for instances with millions of objects in seconds. The proposed heuristics outperform existing approaches in comprehensive numerical experiments on benchmark datasets. The source code of our heuristics and instructions for reproducing the experiments are publicly available on GitHub.

[LG-9] GHGbench: A Unified Multi-Entity Multi-Task Benchmark for Carbon Emission Prediction

链接: https://arxiv.org/abs/2605.13743
作者: Yifan Duan,Siyuan Zheng,Lihuan Li,Chao Xue,Flora Salim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open datasets and benchmarks for entity-level carbon-emission prediction remain fragmented across access, scale, granularity, and evaluation. We introduce GHGbench, an open dataset and benchmark for company- and building-level greenhouse-gas prediction. The company track contains 32,000+ company-year records from 12,000+ firms with Scope 1+2 and Scope 3 disclosures and financial/sectoral signals; the building track harmonises 491,591 building-year records from 13 open sources into a single schema across 26 metropolitan areas (10 U.S., 15 Australian, 1 Singaporean), with climate covariates and multimodal remote-sensing embeddings. GHGbench defines canonical splits with in-distribution and cross-region/city transfer as primary tasks and temporal hold-out plus short-horizon forecasting as supplementary appendix evidence; headline baselines span gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, and multimodal fusion, with an LLM panel as auxiliary, all evaluated under multi-seed paired-bootstrap tests. Three benchmark-level findings emerge: (i) building emissions are structurally harder than company emissions; (ii) the in-distribution to out-of-distribution gap dwarfs any within-model gap across both the company track and the building track, and a tabular foundation model is, to our knowledge, the first baseline to open a paired-bootstrap-significant gap over tuned trees on a multi-city building-emissions task; (iii) multimodal remote-sensing embeddings help precisely where tabular generalisation breaks. GHGbench also exposes catastrophic city transfer and the sector-factor lookup ceiling as systematic failure modes. Code and reconstruction recipes are available at GHGbench.

[LG-10] Learning POMDP World Models from Observations with Language-Model Priors

链接: https://arxiv.org/abs/2605.13740
作者: Valentin Six,Frederik Panse,Mathis Fajeau,Lancelot Da Costa,Mridul Sharma,Alfonso Amayuelas,Tim Z. Xiao,David Hyland,Philipp Hennig,Bernhard Schölkopf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce \emphPinductor (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, \emphPinductor matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at this https URL.

[LG-11] ght Sample Complexity Bounds for Entropic Best Policy Identification

链接: https://arxiv.org/abs/2605.13717
作者: Amer Essakine,Claire Vernade
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in \Omega(e^|\beta| H) where H is the horizon of the MDP, while the state-of-the-art upper bound achieves at best O(e^2|\beta| H) (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

[LG-12] MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

链接: https://arxiv.org/abs/2605.13711
作者: Hsing-Huan Chung,Shijun Li,Yoav Wald,Xing Han,Suchi Saria,Joydeep Ghosh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example, patients’ electronic health records (EHR) include irregular lab measurements and clinical notes. The irregular timing and channel patterns of observations carry predictive signal alongside the numerical values and textual content. LLMs are natural candidates for processing such heterogeneous data, given their extensive pretrained knowledge spanning textual and numerical domains. We introduce MILM (Multimodal Irregular time series Language Model), which represents MITS as time-ordered triplets in Extensible Markup Language (XML) format and fine-tunes an LLM through a two-stage strategy for MITS classification. The first stage trains on value-redacted MITS to predict from sampling patterns alone, and the second stage trains on full MITS to jointly model sampling patterns and observed values. Our two-stage model (MILM-2S) and its single-stage counterpart (MILM-Direct) achieve the best and second-best average performance on multiple EHR datasets. Further value redaction evaluations confirm that sampling patterns carry predictive signal and that MILM-2S learns to exploit them. In the value pending evaluation we introduce, where some values are unavailable at prediction time, MILM-2S outperforms MILM-Direct by a larger margin compared to standard evaluation. For MILM-2S, preserving the time and channel of value-pending observations as additional sampling information further improves in-hospital mortality prediction.

[LG-13] DisAgg: Distributed Aggregators for Efficient Secure Aggregation in Federated Learning

链接: https://arxiv.org/abs/2605.13708
作者: Haaris Mehmood,Giorgos Tatsis,Dimitrios Alexopoulos,Karthikeyan Saravanan,Jie Xu,Anastasios Drosou,Mete Ozay
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted to MLSys 2026; code available at: this https URL

点击查看摘要

Abstract:Federated learning enables collaborative model training across distributed clients, yet vanilla FL exposes client updates to the central server. Secure-aggregation schemes protect privacy against an honest-but-curious server, but existing approaches often suffer from many communication rounds, heavy public-key operations, or difficulty handling client dropouts. Recent methods like One-Shot Private Aggregation (OPA) cut rounds to a single server interaction per FL iteration, yet they impose substantial cryptographic and computational overhead on both server and clients. We propose a new protocol called DisAgg that leverages a small committee of clients called Aggregators to perform the aggregation itself: each client secret-shares its update vector to Aggregators, which locally compute partial sums and return only aggregated shares for server-side reconstruction. This design eliminates local masking and expensive homomorphic encryption, reducing endpoint computation while preserving privacy against a curious server and a limited fraction of colluding clients. By leveraging optimal trade-offs between communication and computation costs, DisAgg processes 100k-dimensional update vectors from 100k 5G clients with a 4.6x speedup compared to OPA, the previous best protocol.

[LG-14] Polyhedral Instability Governs Regret in Online Learning

链接: https://arxiv.org/abs/2605.13692
作者: Yuetai Li,Fengqing Jiang,Yichen Feng,Kaiyuan Zheng,Luyao Niu,Bhaskar Ramasubramanian,Basel Alomair,Linda Bushnell,Radha Poovendran
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Many online decision problems over combinatorial actions are addressed via convex relaxations, leading to online convex optimization with piecewise linear objectives and induced polyhedral structure. We show that regret in such problems is governed by \emphpolyhedral instability: the number of changes of the active region. Under full information feedback and fixed partition assumptions, if \mathrmRS_T denotes the number of region switches and V_\max the maximum number of vertices per region, we prove \Regret_T= \Theta(\sqrt(1+\mathrmRS_T),T,\log V_\max) interpolating between experts-like and dimension-dependent OCO rates. For online submodular–concave games under Lovász convexification, this reduces to the permutation-switch count \mathrmSC_T , yielding the matching rate \Regret_T= \Theta(\sqrt(1+\mathrmSC_T),T,\log n) . Experiments on synthetic and real combinatorial problems (shortest path, influence maximization) validate the predicted scaling and indicate that low-instability regimes can arise in practice without explicit enumeration of actions.

[LG-15] Scale-Sensitive Shattering: Learnability and Evaluability at Optimal Scale

链接: https://arxiv.org/abs/2605.13684
作者: Shashaank Aiyer,Yishay Mansour,Shay Moran,Han Shao,Tom Waknine
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 32 pages, 1 figure

点击查看摘要

Abstract:We study the optimal scale at which real-valued function classes exhibit uniform convergence and learnability. Our main result establishes a scale-sensitive generalization of the fundamental theorem of PAC learning: for every bounded real-valued class and every \gamma0 , uniform convergence at scale \gamma , agnostic learnability at scale \gamma/2 , and finiteness of the fat-shattering dimension at every scale \gamma’\gamma are equivalent. This resolves a question by Anthony and Bartlett (Cambridge Univ. Press 1999) on the precise scales governing learnability, refuting a conjecture attributed there to Phil Long that a multiplicative 2-factor gap is unavoidable, and improves the upper bounds of Bartlett and Long (JCSS 1998), which incur such a loss. The key technical ingredient is a direct bound on empirical \ell_\infty covering numbers, avoiding the standard detour through packing numbers. As a consequence, we obtain sharp asymptotic metric-entropy bounds in terms of the fat-shattering scale \gamma : an O(\log^2 n) bound holds already at scale \gamma/2 , while an O(\log n) bound holds at scale 2\gamma . We further show that the O(\log^2 n) bound is sometimes tight. These results resolve open questions by Alon et al. (JACM 1997) and Rudelson and Vershynin (Ann. of Math. 2006). As an application, we establish a sharp dichotomy for bounded integral probability metrics: every such IPM is either estimable or cannot be weakly evaluated within any multiplicative factor c3 , while 3 -weak evaluability always holds, resolving an open question from Aiyer et al. (ICML 2026). We also highlight several open questions on quantitative sample complexity and evaluability. Comments: 32 pages, 1 figure Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2605.13684 [cs.LG] (or arXiv:2605.13684v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13684 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Sampling from Flow Language Models via Marginal-Conditioned Bridges

链接: https://arxiv.org/abs/2605.13681
作者: Iskander Azangulov,Leo Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Flow Language Models (FLMs) are a recently introduced class of language models which adapt continuous flow matching for one-hot encoded token sequences. Their denoisers have a special structure absent from generic continuous diffusion models: each block of the denoising mean is a posterior marginal distribution over the clean token at that position. Standard DDPM-style samplers collapse these marginals to a single conditional-mean endpoint and bridge toward this simplex-valued point, which is generally not a valid one-hot sequence. We argue that the natural sampler for an FLM is instead posterior-predictive. At each reverse step, we sample a clean one-hot endpoint from the factorized posterior defined by the FLM token marginals, and then sample the next continuous state from the analytic Ornstein–Uhlenbeck bridge conditioned on that endpoint. The method is training-free, uses the same model evaluations as standard sampling, and gives a principled interface for token-level decoding controls such as temperature scaling and nucleus truncation. We show that, under exact posterior marginals, the endpoint approximation error is exactly the conditional multi-information among token positions. The induced one-step bridge kernel preserves all token-wise posterior-predictive marginals and loses only the residual cross-position dependence. Finally, we prove a Girsanov path-space comparison showing that the marginal-conditioned bridge has a no-larger denoising-error term than the frozen conditional-mean bridge, with strict improvement whenever intermediate coordinate-wise bridge observations reveal additional information about the clean token. Experiments with FLMs show that the sampler improves the quality–diversity tradeoff. Code is available at: this http URL.

[LG-17] hree-Stage Learning Unlocks Strong Performance in Simple Models for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2605.13678
作者: Zhenan Yu,Guangxin Jiang,Jin Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies on long-term time series forecasting have shown that simple linear models and MLP-based predictors can achieve strong performance without increasingly complex architectures. However, many competitive baselines still rely on structural priors such as frequency-domain modeling, explicit decomposition, multi-scale mixing, or sophisticated cross-variable interaction modules, while paying less attention to how simple temporal mappings should be trained and organized. In this paper, we propose STAIR, short for Stagewise Temporal Adaptation via Individualization and Residual Learning, a training paradigm for long-term time series forecasting that aims to unlock the capacity of simple temporal mapping models without introducing complex architectural modules. STAIR decomposes forecasting ability into three progressive stages: it first learns common temporal dynamics across variables through a shared temporal mapping, then adapts the shared model to each variable via channel-wise fine-tuning to capture variable-specific patterns, and finally complements the backbone with cross-variable information through residual learning. We further introduce Shared-to-Individual Fine-tuning and alpha-RevIN to mitigate the limitations of strict channel independence and the overly strong normalization prior induced by standard RevIN. This design gradually increases modeling flexibility while keeping the core temporal predictor as a shallow MLP in the main experiments, with linear variants analyzed separately. Experiments on nine long-term forecasting benchmarks show that STAIR matches or outperforms recent strong baselines while preserving a simple temporal backbone, providing a concise and effective modeling perspective for long-term time series forecasting.

[LG-18] Graph Neural Networks with Triangle-Based Messages for the Multicut Problem

链接: https://arxiv.org/abs/2605.13673
作者: Jannik Irmai,Lucas Fabian Naumann,Bjoern Andres
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:The multicut problem is an NP-hard combinatorial optimization problem with diverse applications in fields such as bioinformatics, data mining and computer vision. Graph neural networks have been defined for the multicut problem but can be adapted further to its specific objective function and constraints. In this article, we introduce such an adapted graph neural network architecture in which features are assigned only to edges, and the computation of messages is based on triangles in the underlying graph. Experiments with synthetic and real-world instances with up to 200 nodes show that our method outperforms state-of-the-art heuristic solvers in terms of solution quality while maintaining feasible runtimes. For some instances, our method finds optimal solutions in seconds whereas exact solvers need hours to find and certify optimal solutions.

[LG-19] Achieving ε-2 Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

链接: https://arxiv.org/abs/2605.13639
作者: Ishaq Hamza,Zaiwei Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we establish last-iterate convergence rates for off-policy actor–critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first \tilde\mathcalO(\epsilon^-2) sample complexity guarantee for finding an \epsilon -optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an \tilde\mathcalO(\epsilon^-2) sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an \tilde\mathcalO(1/T) convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

[LG-20] Multimodal Graph-based Classification of Esophageal Motility Disorders

链接: https://arxiv.org/abs/2605.13623
作者: Alexander Geiger,Lars Wagner,Daniel Rueckert,Alois Knoll,Dirk Wilhelm,Alissa Jell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diagnosing esophageal motility disorders pose significant challenges due to the complexity of high-resolution impedance manometry (HRIM) data and variability in clinical interpretation. This work explores the feasibility of a multimodal Machine Learning (ML)-based classification approach that combines HRIM recordings with patient-specific information and incorporates a graph-based modeling of esophageal physiology. We analyze HRIM recordings with corresponding patient information from 104 patients with esophageal motility disorders. Patient data includes demographic, clinical, and symptom information extracted from structured questionnaires and free-text notes using keyword detection and large language model-based processing. HRIM data is represented as spatio-temporal graphs, where nodes correspond to pressure values along the esophagus and edges encode spatial adjacency and impedance dynamics. A graph neural network (GNN) is applied to learn physiologically meaningful representations, which are fused with patient embeddings for multi-category, multi-class classification of swallow events. The impact of patient features and graph-based modeling is evaluated by ablation studies and comparison to vision-based classifier baselines. The proposed multimodal approach indicates improvements over models that rely solely on HRIM-derived features across all classification categories. Additionally, the graph-based modeling provides gains compared to vision-based baselines. Our experiments systematically assess the complementary contribution of multiple modalities, as well as demonstrate the feasibility of our proposed graph-based approach. Our initial findings demonstrate that integrating patient-level data with graph-based representations of HRIM signals appears to be a promising direction for more accurate classification of esophageal motility disorders.

[LG-21] Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

链接: https://arxiv.org/abs/2605.13612
作者: Yatin Dandi,Matteo Vilucchio,Luca Arnaboldi,Hugo Tabanelli,Florent Krzakala
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 62 pages, many figures, companion codes in this https URL

点击查看摘要

Abstract:Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

[LG-22] Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

链接: https://arxiv.org/abs/2605.13597
作者: Peiyao Wang,Liang Bai,Xian Yang,Richard Yi Da Xu,Jiye Liang
类目: Machine Learning (cs.LG)
*备注: 44 pages, 10 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as a fundamental tool for learning from graph-structured data, achieving strong performance across a wide range of applications. However, understanding their generalization capabilities remains challenging due to the complex structural dependencies inherent in such data. Existing generalization analyses largely follow the classical machine learning paradigm, focusing primarily on model complexity while overlooking the fundamental role of graph structure. Therefore, in this work, we systematically investigate this role by asking: does the graph structure actually influence generalization, and if so, by how much? To answer the first question and validate our intuition, we theoretically prove that incorporating more edges into the prediction process transforms the input representations to be overly accommodating to the output model, thereby inducing overfitting. To address the second question, we formulate a structural complexity measure based on the number of effective edges and derive a Rademacher complexity-based generalization bound. In doing so, we demonstrate that GNN generalization depends explicitly on structural complexity, alongside traditional parameter-dependent factors. Motivated by these theoretical findings, we propose a structural entropy regularization method. This approach controls structural complexity by regulating effective edges to balance underfitting and overfitting, ultimately improving the generalization performance of GNNs.

[LG-23] Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks

链接: https://arxiv.org/abs/2605.13566
作者: Solomiia Kurchaba,Angela Meyer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Land Surface Temperature (LST) is a key variable for various applications, such as urban climate and ecology studies. Yet, existing satellite-derived LST products provide either high spatial or high temporal resolution, resulting in a fundamental trade-off between the two. To address this trade-off, we combine observations from a geostationary and a polar orbiting satellite and provide LST fields at high spatial and high temporal resolution (1 km at 15-min intervals). We demonstrate their application for intraday forecasting of LSTs. To estimate LST fields at high spatiotemporal resolution, a U-Net model is trained to map LST fields from SEVIRI/MSG (3 km and 15 min resolution) to LST fields from Terra/Aqua MODIS (1 km, 4 overpasses per day) that are collocated in space and time. The presented model has been trained on LSTs across large European cities with a population exceeding 1 million inhabitants, and achieves an RMSE = 1.92 °C and near-zero bias MBE = 0.01 °C on the hold-out test set. As a second step, we present an LST nowcasting model based on ConvLSTM architecture, trained across downscaled LST fields with forecast lead times of 15 to 75 minutes. The nowcasting model outperforms a persistence and a Climatological Rolling Median benchmarks, with RMSEs of 0.57 to 1.15 °C for the considered lead times and biases ranging from -0.1 to 0.14 °C. An additional validation conducted against independent MODIS overpasses confirms robust performance. Our LST forecast model at high spatiotemporal resolution is directly applicable to operational satellite-based LST monitoring.

[LG-24] Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2605.13560
作者: Lingfei Kong,Haoran Ma
类目: Machine Learning (cs.LG)
*备注: 8 pages, 15 figures

点击查看摘要

Abstract:This work studies lung tumor growth prediction from sparse and irregular longitudinal computed tomography (CT) observations with measurement variability. A Bayesian physics-informed neural network is developed by combining Gompertz growth dynamics with low-dimensional Bayesian inference in the log-volume domain. The framework employs a two-stage inference strategy combining maximum a posteriori (MAP) estimation and Hamiltonian Monte Carlo (HMC) sampling to estimate posterior predictive distributions and uncertainty intervals. The method was evaluated on longitudinal data from the National Lung Screening Trial (30 patients). Results show that the model captures heterogeneous tumor growth patterns while maintaining reasonable prediction accuracy under limited observations. Compared with deterministic modeling approaches, the proposed approach additionally provides calibrated uncertainty estimates. The inferred posterior parameter correlations were consistent with expected biological growth behavior. The proposed framework achieved a cohort-level log-space RMSE of approximately 0.20 together with well-calibrated 95% credible interval coverage across 30 patients. These findings suggest that Bayesian physics-informed modeling may be useful for uncertainty-aware tumor growth assessment when only limited longitudinal follow-up scans are available.

[LG-25] Mixed neural posterior estimation for simulators with discrete and continuous parameters

链接: https://arxiv.org/abs/2605.13551
作者: Jan Boelts,Cornelius Schröder,Jonas Beck,Jakob H. Macke,Michael Deistler,Daniel Gedon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Posterior Estimation (NPE) enables rapid parameter inference for complex simulators with intractable likelihoods. NPE trains an inference network to estimate a probability density over parameters given data, typically assumed to be \emphcontinuous. However, many scientific models involve parameter spaces that are \emphmixed, that is, they contain both discrete and continuous dimensions. We address this limitation by extending NPE to mixed parameter spaces through an inference network that jointly handles discrete and continuous parameters. The inference network factorizes the joint posterior into discrete and continuous components, combining an autoregressive classifier for the discrete parameters with a generative model for the continuous parameters, trained jointly under a single simulation-based objective. In addition, we propose a diagnostic tool to assess the calibration of the mixed posterior approximation. Across tractable toy examples and real-world scientific simulators, our joint inference approach yields accurate and calibrated posteriors. The inference framework is available in the \textttsbi Python package.

[LG-26] Limits of Personalizing Differential Privacy Budgets

链接: https://arxiv.org/abs/2605.13503
作者: Edwige Cyffers,Juba Ziani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key technical difficulty in differential privacy is selecting a privacy budget that satisfies privacy requirements while maximizing utility. A natural and well-studied workaround is to use personalized privacy budgets, which may differ across agents. In this paper, we show that personalized budgets come with major limitations and that for mean estimation, the dominant factor is not full personalization, but rather choosing the right effective privacy budget. This can be achieved through a simple thresholding operator that we describe. Compared with this thresholding baseline, the gains obtained by fully personalized mechanisms are limited. In particular, we precisely quantify the constant-factor improvement in settings with mixed private and public datasets and in private datasets with two levels of privacy requirements. We also establish upper bounds and identify regimes of maximal gain for arbitrary privacy requirements.

[LG-27] Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

链接: https://arxiv.org/abs/2605.13501
作者: Qingyun Zou,Yingze Li,Tianen Liu,Bingsheng He,Weng-Fai Wong
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching \sim76% accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emphproperty equivalence that defines SVA correctness. We introduce \emphReward-Weighted On-Policy Distillation (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision dense at every response token while grounding both selection and loss weight in property-equivalent behavior. RWOPD distills CodeV-SVA-14B into a Qwen2.5-Coder-7B-Instruct student that sets a new state of the art on NL2SVA-Human and NL2SVA-Machine across pass@1, pass@5, and pass@10, surpassing both specialized prior SOTA models and 671B general-purpose baselines.

[LG-28] MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters

链接: https://arxiv.org/abs/2605.13496
作者: H. Moore,S. Qi,D. Milojicic,C. Bash,S. Pasricha
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly prevalent in cloud-based platforms, propelled by the introduction of AI-based consumer and enterprise services. LLM inference requests in particular account for up to 90% of total LLM lifecycle energy use, dwarfing training energy costs. The rising volume of LLM inference requests is increasing environmental footprints, particularly carbon emissions and water consumption. To improve sustainability for LLM inference serving in cloud datacenter environments, we propose a novel multi-agent game-theoretic reinforcement learning framework called MARLIN to co-optimize time-to-first token (TTFT), carbon emissions, water usage, and energy costs associated with LLM inference. MARLIN demonstrates a reduction of at least 18% in TTFT, 33% in carbon emissions, 43% in water usage, and 11% in energy costs compared to state-of-the-art LLM inference management frameworks.

[LG-29] Path-independent Flow Matching for Multi-parameter Generative Dynamics

链接: https://arxiv.org/abs/2605.13487
作者: Francisco Téllez,AmirHossein Zamani,Philippe Martin,Shuang Ni,Guy Wolf,Eugene Belilovsky,Sina Sanjari,Yanlei Zhang
类目: Machine Learning (cs.LG)
*备注: 12 pages including references for main part of the document, 26 pages in total when including the appendix. 15 figures in total

点击查看摘要

Abstract:Flow Matching is a powerful framework for learning transport maps between probability distributions. Yet its standard single-parameter formulation is not designed to capture multi-parameter variations where the resulting transport should be path-independent. Path independence is crucial because it ensures that transformations depend only on the initial and target distributions, not on the specific path. In this work, we introduce Path-independent Flow Matching (PiFM), a method for learning vector fields whose induced flows yield path-independent transport between distributions. We show that PiFM generalizes Flow Matching to higher-dimensional parameter domains while enforcing structural conditions that ensure consistency of composed transformations. In addition, we show that, under suitable assumptions, PiFM approximates the Wasserstein barycenter, linking the framework to a notion of distributional interpolation. To enable practical training, we propose a tractable, simulation-free objective that regresses onto multi-parameter conditional probability paths. We showcase empirically that PiFM outperforms other approaches on both synthetic and real world data in interpolating path-independent trajectories and generating desired out of distribution samples.

[LG-30] wincher: Bijective Representation Learning for Robust Inversion of Continuous Systems

链接: https://arxiv.org/abs/2605.13470
作者: Arkady Gonoskov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in AI have been primarily driven by large-scale neural architectures that excel at function approximation, rather than by tailored inductive biases and inference or learning strategies that could be important for resource-efficient real-world perception and planning through the solution of inverse problems. In this work, we consider the possibility of enabling robust inversion of continuous forward processes p \mapsto y by learning representations of y that are bijectively aligned with p while remaining insensitive to perturbations in y caused by noise or model mismatch. We propose Twincher, a class of architectures based on stacks of structured diffeomorphic transformations and tailored adversarial training strategies that enable learning such bijective representations. We provide a public API for training and inference and empirically demonstrate the ability of the proposed architecture to efficiently learn bijective representations of synthetic systems, thereby enabling robust and efficient iterative inverse inference. Compared to a baseline inverse-modeling approach, the method exhibits improved data efficiency and robustness, providing initial evidence for the potential of bijective representation learning in robotics, vision, and physical AI.

[LG-31] A Unified Three-Stage Machine Learning Framework for Diabetes Detection Subtype Discrimination and Cognitive-Metabolic Hypothesis Testing

链接: https://arxiv.org/abs/2605.13464
作者: Vishal Pandey,Ruzina Haque Laskar,Rishav Tewari
类目: Machine Learning (cs.LG)
*备注: 10 Pages

点击查看摘要

Abstract:Diabetes mellitus affects over 537 million adults worldwide and remains a major challenge in preventive healthcare. Existing machine-learning studies primarily formulate diabetes prediction as a binary classification problem, while subtype-oriented analysis and glycaemic-cognitive associations remain comparatively underexplored. We present a reproducible three-stage machine learning framework for diabetes detection, subtype-oriented clustering, and metabolic-cognitive association analysis. In Stage 1, five supervised classifiers together with a stacking ensemble are benchmarked on the NCSU Diabetes Dataset using stratified five-fold cross-validation and evaluation metrics including ROC-AUC, balanced accuracy, recall, and F1-score. SVM-RBF and Logistic Regression achieve the highest ROC-AUC ( 0.825 \pm 0.026 ), while Random Forest achieves the highest accuracy ( 0.762 \pm 0.030 ). SHAP explainability identifies Glucose, BMI, and Age as the dominant predictive biomarkers. In Stage 2, silhouette-validated K-Means clustering ( k=2 , silhouette \approx 0.116 ) is applied to confirmed diabetic cases using Glucose, Insulin, and Age, recovering clinically plausible subtype-oriented partitions without requiring ground-truth subtype labels. In Stage 3, statistical analysis of the Ohio Longitudinal Cognitive Dataset ( n=373 ) reveals a significant positive association between glycaemic control and cognitive function ( \rho_s = 0.208 , p = 5.29 \times 10^-5 ), which survives Holm correction. The findings support the utility of statistically grounded and interpretable ML pipelines for reproducible diabetes analytics and subtype-aware exploratory analysis.

[LG-32] Efficient Sensor Fusion for Gesture Recognition on Resource-Constrained Devices

链接: https://arxiv.org/abs/2605.13462
作者: Pietro Bartoli,Christian Veronesi,Tommaso Bondini,Andrea Giudici,Franco Zappa
类目: Machine Learning (cs.LG)
*备注: The article is already accepted for IEEE Sensors Applications Symposium (IEEE SAS) 2026

点击查看摘要

Abstract:Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system’s suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.

[LG-33] Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

链接: https://arxiv.org/abs/2605.13434
作者: Ammar Mahran,Artavazd Maranjyan,Peter Richtárik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.

[LG-34] urboGR: An Accelerated Training System for Large-Scale Generative Recommendation

链接: https://arxiv.org/abs/2605.13433
作者: Huichao Chai,Zhixin Wu,Xuemiao Li,Shiqing Fan,Hengfeng Wang,Maojun Peng,Lu Xu,Yaoyuan Wang,Yibo Jin,Wei Guo,Yongxiang Feng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Generative recommendation (GR) has emerged as a promising paradigm that replaces fragmented, scenario-specific architectures with unified Transformer-based models, exhibiting scaling-law behavior where recommendation quality improves systematically with increased model capacity and training data. However, deploying GR at scale on Ascend NPUs faces fundamental system-level challenges. These challenges are further exacerbated on Ascend NPUs due to the absence of high-performance implementations for jagged operators and the architectural mismatch between irregular sparse primitives and NPU’s dense-computation-optimized design. In this paper, we present \model, an Ascend-affinity training system for generative recommendation that systematically addresses these bottlenecks through three core innovations: (i) Ascend-affinity jagged acceleration, including fusion operators that eliminate padding redundancy and dynamic load balancing that reduces inter-device imbalance from 47% to 2.4%; (ii) distributed communication optimization, comprising hierarchical sparse parallelism, semi-asynchronous training with proven convergence guarantees, and fine-grained pipeline orchestration that sustains 94% NPU utilization; and (iii) negative sampling optimization via asynchronous offloading, jaggedness-aware FP16 quantization, and intra-batch logit sharing that expand the effective negative space without additional embedding lookups. Evaluated on the KuaiRand-27K dataset, \model supports training at up to 0.2B parameters and achieves 54.71% MFU with near-linear scalability (0.97).

[LG-35] Strategic PAC Learnability via Geometric Definability

链接: https://arxiv.org/abs/2605.13426
作者: Yuval Filmus,Shay Moran,Elizaveta Nesterova,Nir Rosenfeld,Alexander Shlimovich
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier’s decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension 1 on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over \mathbbR_\mathttexp . Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including \ell_p distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

[LG-36] DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning ICML2026

链接: https://arxiv.org/abs/2605.13418
作者: Marc Molina Van den Bosch,Riccardo Taiello,Albert Sund Aillet,Andrea Protani,Miguel Angel Gonzalez Ballester,Luigi Serio
类目: Machine Learning (cs.LG)
*备注: Accepted at the International Conference on Machine Learning (ICML 2026). 9 pages main text + appendix, 5 figures, 2 tables. Code: this https URL

点击查看摘要

Abstract:Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DP-SGD injects isotropic noise. Second-order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality-specific frequency statistics. We propose DP-KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP-KFC consistently outperforms DP-SGD and adaptive baselines across diverse modalities in strong privacy regimes ( \varepsilon \leq 3 ). DP-KFC matches private-data preconditioners while public-data variants degrade by up to 4.8% , showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy-preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

[LG-37] Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction IJCAI2026

链接: https://arxiv.org/abs/2605.13407
作者: Namhyoung Kim,Jae Wook Song
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Statistical Finance (q-fin.ST)
*备注: IJCAI 2026 Accepted Paper including Technical Appendix

点击查看摘要

Abstract:Predicting cross-sectional stock returns is challenging due to low signal-to-noise ratios and evolving market regimes. Classical factor models offer interpretability but limited flexibility, while deep learning models achieve strong performance yet often underutilize financial priors. We address this gap with PRISM-VQ (PRior-Informed Stock Model with Vector Quantization), a dynamic factor framework that integrates expert prior factors, vector-quantized discrete latent factors learned from cross-sectional structure, and a structure-conditioned Mixture-of-Experts to generate time-varying factor loadings. Vector quantization acts as an information bottleneck that suppresses noise while capturing robust market structure, with discrete codes serving both as latent factors and as routing signals for temporal expert specialization. Experiments on CSI 300 and SP 500 show consistent improvements in cross-sectional return prediction and portfolio performance over strong baselines while preserving interpretability. Our code is available at this https URL.

[LG-38] When is Warmstarting Effective for Scaling Language Models?

链接: https://arxiv.org/abs/2605.13405
作者: Neeratyoy Mallik,Maciej Janowski,Johannes Hog,Herilalaina Rakotoarison,Josif Grabocka,Frank Hutter,Aaron Klein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model growth from a given checkpoint aims to accelerate training of a larger model, offering potential resource savings. Despite recent interest, warmstarting has seen limited practical adoption in large-scale training. We attribute this to two underexplored factors: (1) an overemphasis on preserving the smaller model’s performance at initialization, which constrains operator design for new architectures, and (2) insufficient analysis of how growth interacts with hyperparameters and scaling behavior, compounded by inconsistent growth factors across the literature. We show that preserving the base model’s initial post-growth performance is not necessary for strong final performance, and that simple, architecture-agnostic growth strategies can outperform more complex warmstarting operators. Crucially, we empirically identify an upper bound on the growth factor g beyond which training from scratch is more efficient. We observe this across multiple ablation setups. Notably, this limit is also present, but unreported, in prior published results. Across our experiments on dense MLPs and dense language models, we find that a 2\times growth factor is the most reliable in yielding convergence speedups, with gains most pronounced under 20 tokens/parameter budgets and diminishing as budget increases. We fit scaling laws over these observations to provide predictive guidance for practitioners deciding when and how much to grow. Together, our analysis provides practical guidelines and empirical limits for model growth. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.13405 [cs.LG] (or arXiv:2605.13405v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13405 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] rajectory-Level Data Augmentation for Offline Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.13401
作者: Tobias Schmähling,Matthias Burkhardt,Tobias Windisch
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 26 pages, 25 figures, Accepted at ICML 2026

点击查看摘要

Abstract:We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.

[LG-40] he Diffusion Encoder

链接: https://arxiv.org/abs/2605.13399
作者: Akhil Premkumar,Sarah Lucioni
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 22 pages + references, 10 figures

点击查看摘要

Abstract:We construct a new kind of encoder, leveraging the expressive power of diffusion models. In a traditional variational autoencoder, the encoder and decoder jointly negotiate a latent representation of the input. This is made possible by the reparameterization trick, which simplifies training at the cost of restricting the encoder to a simple family of distributions. Replacing this encoder with a diffusion model requires rethinking how the decoder pressure can be transmitted back to the encoder, given that they tend to update their internal estimates of the latent in opposing directions. We solve this problem with an alternating training scheme, inspired by the expectation-maximization algorithm. Our method enables more reliable synchronization between encoder and decoder, while preserving the simple and efficient training objective of standard diffusion models.

[LG-41] Support-Conditioned Flow Matching Is Kernel Smoothing ATC NEURIPS2026

链接: https://arxiv.org/abs/2605.13386
作者: Daniel Matsui Smola
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to NeurIPS 2026. 18 pages, 10 figures, 1 table. Code at this https URL

点击查看摘要

Abstract:Generative models are often conditioned on a small set of examples via cross-attention. Under the Gaussian optimal-transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya–Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP-Adapter’s cross-attention implements approximate NW smoothing in practice.

[LG-42] aching and Learning under Deductive Errors NEURIPS

链接: https://arxiv.org/abs/2605.13384
作者: Jan Arne Telle,Brigt Håvardstun,Jose Hernandez-Orallo
类目: Machine Learning (cs.LG)
*备注: 15 pages, preprint neurips

点击查看摘要

Abstract:Most models of machine teaching and learning assume the learner makes no errors in its internal deductive inference. However, humans and large language models in few-shot learning regimes are two important examples of learners where this does not hold. They fail on some consistency checks, and they can fail stochastically. In this paper we introduce a teaching and learning framework that takes these deductive errors into account. We specifically study the case of machine teaching, as different characterizations of the teacher can account for both machine teaching and learning. In an overhauled Probably Approximately Correct (PAC) setting, we study theoretically that, for some estimated error level, the teacher must find a PAC teaching set that with high probability will lead the learner to guess a hypothesis that is approximately correct. We study the computational complexity of six different problems related to computing optimal PAC teaching sets. We give XP algorithms parametrized by size of teaching set, with tight runtime bounds under standard complexity assumptions like ETH. These results are complemented with a small experimental study of which teaching and learning protocols can best represent the observed behavior in some LLM teaching sessions.

[LG-43] Beyond Oversquashing: Understanding Signal Propagation in GNNs Via Observables

链接: https://arxiv.org/abs/2605.13383
作者: Eden Nagar,Ya-Wei Eileen Lin,Ron Levie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) perform computations on graphs by routing the signal between graph regions using a graph shift operator or a message passing scheme. Often, the propagation of the signal leads to a loss of information, where the signal tends to diffuse across the graph instead of being deliberately routed between regions of interest. Two notions that depict this phenomenon are oversmoothing and oversquashing. In this paper, we propose an alternative approach for modeling signal propagation, inspired by quantum mechanics, using the notion of observables. Specifically, we model the place in the graph where the signal lies, how much the signal is concentrated there, and how much of the signal is propagated towards a location of interest when applying a GNN. Using these new concepts, we prove that standard spectral GNNs have poor signal propagation capabilities. We then propose a new type of spectral GNN, termed Schrödinger GNN, which we show has a superior capacity to route the signal across the graph.

[LG-44] Building Interactive Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

链接: https://arxiv.org/abs/2605.13360
作者: Coleman Hooper,Minwoo Kang,Suhong Moon,Nicholas Lee,Eric Wen,John Wawrzynek,Michael W. Mahoney,Yakun Sophia Shao,Amir Gholami,Kurt Keutzer
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we aim to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7 \times speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2 \times speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.

[LG-45] GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

链接: https://arxiv.org/abs/2605.13352
作者: Mayank Nautiyal,Li Ju,Andreas Hellander,Ekta Vats,Prashant Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through \ell_2 normalization typically expose neither \emphaleatoric uncertainty (cross-modal ambiguity) nor \emphepistemic uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models’ embeddings. We propose \textbfGeoFlowVLM as a post-hoc adapter that learns the joint distribution of paired \ell_2 -normalised dual-encoder VLM embeddings on the product hypersphere \mathbbS^d-1 \times \mathbbS^d-1 via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

[LG-46] Contextual Bandits for Resource-Constrained Devices using Probabilistic Learning

链接: https://arxiv.org/abs/2605.13346
作者: Marco Angioli,Kevin Johansson,Antonello Rosato,Amy Loutfi,Denis Kleyko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual bandits (CB) are online sequential decision-making problems under partial feedback that underpin many adaptive services. There is a growing demand to deploy CB agents directly on-device, under strict constraints on memory, compute, and energy. However, standard linear CB algorithms are often impractical for resource-constrained devices with their unfavorable scaling in computational and memory costs. Recently, HD-CB, a CB approach based on hyperdimensional computing principles, has been proposed to model and solve CB problems by moving into high-dimensional spaces. HD-CB offers faster convergence, favorable scalability, and improves memory efficiency compared to linear CB algorithms. However, its learning rule is accumulation-based: the values of action vectors grow over time, requiring high precision. While periodic binarization can prevent overflow in low-precision components, it may discard important information about magnitudes and degrade decision quality. This paper introduces probabilistic HD-CB, a low-precision variant that replaces deterministic accumulation with a probabilistic update rule. At each step, only a random subset of vector components is updated, with a time-decaying update probability, and component values are constrained to a predefined range [-k,+k]. This approach enables low-precision components, prevents overflow without periodic binarization, and reduces the expected update cost in proportion to the fraction of updated components. Off-policy evaluation on standardized synthetic CB benchmarks using the Open Bandit Pipeline shows that probabilistic HD-CB consistently outperforms binarized HD-CB at equal precision, while approaching the performance of HD-CB with as few as 3 bits per component.

[LG-47] Hierarchical Transformer Preconditioning for Interactive Physics Simulation

链接: https://arxiv.org/abs/2605.13343
作者: Carl Osborne,Minghao Guo,Crystal Owens,Wojciech Matusik
类目: Graphics (cs.GR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 10 pages, 7 figures. Includes supplementary video and material

点击查看摘要

Abstract:Neural preconditioners for real-time physics simulation offer promising data-driven priors, but they often fail to capture long-range couplings efficiently because they inherit local message passing or sparse-operator access patterns. We introduce the Hierarchical Transformer Preconditioner, a neural preconditioner anchored to a weak-admissibility H-matrix partition. The partition provides a multiscale structural prior (dense diagonal leaves plus coarsening off-diagonal tiles) that enables full-graph approximate-inverse computation with O(N) scaling at fixed block sizes. The network models the inverse through low-rank far-field factors and uses highway connections (axial buffers plus a global summary token) to propagate context across transformer depth. At each PCG iteration, preconditioner application reduces to batched dense GEMMs with regular memory access. The key training contribution is a cosine-Hutchinson probe objective that learns the action of MA on convergence-critical spectral subspaces, optimizing angular alignment of MAz with z rather than forcing eigenvalue clusters to a prescribed location. This removes unnecessary spectral-placement constraints from SAI-style objectives and improves conditioning on irregular spectra. Because both inference and apply are dense, dependency-free tensor programs, the full solve loop is captured as a single CUDA Graph. On stiff multiphase Poisson systems (up to 100:1 density contrast, N = 1,024-16,384), the solver runs from ~143 to ~21 fps. At N = 8,192, it reaches 17.9 ms/frame, with 2.2x speedup over GPU Jacobi, ~28x over GPU IC/DILU (AMGX multicolor_dilu), and 2.7x over neural SPAI retrained per scale on the same benchmark. Comments: 10 pages, 7 figures. Includes supplementary video and material Subjects: Graphics (cs.GR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2605.13343 [cs.GR] (or arXiv:2605.13343v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2605.13343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] Shortcut Mitigation via Spurious-Positive Samples

链接: https://arxiv.org/abs/2605.13340
作者: Phuong Quynh Le,Jörg Schlötterer,Sari Sadiya,Gemma Roig,Christin Seifert
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Shortcut mitigation strategies commonly rely on training data annotations, group-balanced held-out data or the presence of all groups, i.e., all combinations of (spurious) attributes and classes, in the training data. However, these requirements are rarely met in practice. We instead propose a method for targeted model analysis to identify a small set of instances in which the model relies on spurious attributes. Using that set and following ``this feature should not be used for prediction’’ reasoning, we identify highly relevant neurons in an intermediate layer and regularize their impact. This ensures that models learn to depend on informative features rather than being right for the wrong reasons, thereby improving robustness without requiring additional balanced held-out data or annotations.

[LG-49] Context-Aware Web Attack Detection in Open-Source SIEM Systems via MITRE ATTCK-Enriched Behavioral Profiling

链接: https://arxiv.org/abs/2605.13337
作者: Badr Alboushy,Assef Jafar,Mohamad Aljnidi,Mohamad Bashar Disoki,Aref Shaheed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 38 pages, 13 figures, 13 tables

点击查看摘要

Abstract:Security Information and Event Management (SIEM) systems aggregate log data from heterogeneous sources to detect coordinated attacks. Traditional rule-based correlation engines struggle to classify multi-step web application attacks because they examine each event without reference to the behavioural history of the originating host. We present Smart-SIEM, an AI module for the open-source Wazuh SIEM platform with two contributions: (1) a per-source-IP behavioural context vector encoding HTTP response-status distributions, peak rule activation counts, and MITRE ATTCK technique frequencies from the N most recent prior events; (2) a two-stage hybrid cascade combining LightGBM for binary attack detection and XGBoost for six-class attack categorisation. Evaluated on 46,454 purpose-built Wazuh security events, context features improve all tested gradient boosting algorithms from ~0.705 macro F1 to 0.947-0.967 (Stage 1) and 0.876-0.914 (Stage 2), an average gain of +0.254 and +0.324 respectively. The hybrid cascade achieves F1 of 0.967 (binary) and 0.914 (six-class). Wazuh’s native rule engine detects 0% of Brute Force and Broken Authentication events; the AI module detects 100% and 98.3% respectively. A self-adaptive retraining mechanism recovers from concept drift: F1 drops from 0.905 to 0.465 when unseen attack types emerge, recovering to 0.814 after retraining on the combined corpus. Comments: 38 pages, 13 figures, 13 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: C.2.0; K.6.5 Cite as: arXiv:2605.13337 [cs.CR] (or arXiv:2605.13337v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.13337 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Embodied Neurocomputation: A Framework for Interfacing Biological Neural Cultures with Scaled Task-Driven Validation

链接: https://arxiv.org/abs/2605.13315
作者: Johnson Zhou,Daniel Tanneberg,Forough Habibollahi,Alon Loeffler,Kiaran Lawson,Valentina Baccetti,Kwaku Dad Abu-Bonsrah,Candice Desouza,Finn Doensen,Bradley Watmuff,Daria Kornienko,Azin Azadi,Justin Leigh Bourke,Bernhard Sendhoff,Brett J. Kagan
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Biological neural networks (BNNs) have been established as a powerful and adaptive substrate that offer the potential for incredibly energy and data efficient information processing with distinct learning mechanisms. Yet a core challenge to utilizing BNN for neurocomputation is determining the optimal encoding and decoding mechanisms between the traditional silicon computing interface and the living biology. Here, we propose an Embodied Neurocomputation framework as a systems-level approach to this multi-variable optimization encoding/decoding problem. We operationalize this approach through the first large-scale parameter optimization of encoding configurations for a BNN agent performing closed-loop navigation along an odor-style gradient in a simulated grid-world. Despite the relative simplicity of the task, the biological interactions gave rise to a massive multi-combinatorial search space for optimal parameters. By considering how the components of the system are interconnected and parameterized, we evaluated approximately 1,300 parameter combinations, over 4,000 hours of real-time agent-environment interactions, to identify 12 configurations that consistently demonstrated learning across multiple episodes. These configurations achieved significantly higher task performances than optimized silicon-based DQN agents under the same interaction budget. These findings represent an initial step toward robust and scalable goal-oriented learning using BNNs. Our framework establishes a foundation for applying task-driven neurocomputing and supports the development of field-wide benchmarks. In the long term, this work supports the development of hybrid bio-silicon architectures capable of efficient, adaptive and real-time computation, including the potential for robotic control applications.

[LG-51] Supervised Deep Multimodal Matrix Factorization for Interpretable Brain Network Analysis

链接: https://arxiv.org/abs/2605.13312
作者: Amjad Seyedi,Lifang He,Songlin Zhao,Akwum Onwunta,Nicolas Gillis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Supervised Deep Multimodal Matrix Factorization (SD3MF), an interpretable framework for integrative brain network analysis that generalizes Symmetric Nonnegative Matrix Tri-Factorization (SNMTF) from unsupervised single-graph clustering to supervised prediction over populations of multimodal graphs. SD3MF learns deep hierarchical factorizations for each modality together with a shared latent representation that aligns subjects across views. An encoder-decoder formulation jointly optimizes graph reconstruction and supervised prediction, while adaptive weights enable data-driven multimodal fusion. By representing each subject through community-level interaction matrices, the model yields interpretable and discriminative features. Experiments on multimodal connectome datasets show that SD3MF consistently outperforms strong deep learning baselines such as CNNs and GNNs, while enabling biologically interpretable insights. Code for reproducibility is available at: this https URL.

[LG-52] MPINeuralODE: Multiple-Initial-Condition Physics-Informed Neural ODEs for Globally Consistent Dynamical System Learning

链接: https://arxiv.org/abs/2605.13305
作者: Lake Yang,Antonio Malpica-Morales,Frank Ioannis Papadakis Wood,Serafim Kalliadasis
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Neural ordinary differential equations (Neural ODEs) often fit training trajectories while generalizing poorly to unseen initial conditions and long horizons. We propose MPINeuralODE, which combines a soft physics-informed residual with a Multiple-Initial-Condition (MIC) multiple-shooting curriculum whose ingredients are structurally complementary: the physics term anchors the vector-field magnitude on the support that MIC enlarges. We evaluate along three axes: out-of-sample error, long-horizon stability, and Hamiltonian drift, which together expose whether the learned dynamics recover the underlying vector field. On Lotka-Volterra, MPINeuralODE achieves the lowest out-of-sample and long-horizon MSE among data-driven methods, with a 26% reduction over the baseline Neural ODE, while essentially matching the PINN ablation on Hamiltonian drift.

[LG-53] Safe Bayesian Optimization for Uncertain Correlations Matrices in Linear Models of Co-Regionalization

链接: https://arxiv.org/abs/2605.13302
作者: Jannis Lübsen,Annika Eichler
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at IFAC WC26

点击查看摘要

Abstract:This paper extends safety guarantees for multi-task Bayesian optimization with uncertain correlation matrices from intrinsic co-reginalization models to linear models of co-reginalization. The latter allows for more flexible modeling of the inter-task correlations by composing multiple features. We derive uniform error bounds for vector-valued functions sampled from a Gaussian process with a linear model of co-reginalization kernel. Furthermore, we show the potential improvement of performance using linear models of co-reginalization in a numerical comparison on a safe multi-task Bayesian optimization benchmark.

[LG-54] PaMM: Periodic Motif Memory for Atomistic Models with an Explicit Local-Structure Interface

链接: https://arxiv.org/abs/2605.13297
作者: Ryan Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Periodic crystals repeatedly instantiate similar local coordination motifs across translated cells and chemically related structures, but current equivariant atomistic models usually encode these patterns only implicitly in dense edge features. We introduce PaMM, a periodic motif memory that augments the UMA eSCN-MD edge encoder with explicit pair and triplet lookup features. Pair motifs are keyed by (Z_j, Z_i, b_r) and triplet motifs by (Z_j, Z_i, Z_k, b_\theta) , hashed into fixed-size tables and fused with the baseline edge representation through lightweight gate-only and affine-equipped variants. We evaluate PaMM in a matched UMA-S OMAT setting and focus on a narrow question: whether explicit motif memory helps at a fixed intermediate training budget. At the 10k-step checkpoint, both PaMM variants improve over the plain baseline; gate-only gives the best energy MAE, while the affine-equipped variant gives the best force MAE. A matched 20k follow-up keeps the same operating-point picture. Aligned controls show that the gain weakens for pair-only, triplet-only, random-bucket, and parameter-matched MLP alternatives, suggesting that the benefit is tied to structured pair/triplet organization rather than generic added capacity. A within-OMAT24 source-family evaluation also shows small but consistent gains across held-out generation families. We therefore make a focused claim: in the studied UMA-S + OMAT regime, explicit pair/ triplet motif memory is a useful inductive bias for periodic atomistic modeling. We do not claim broad cross-dataset transfer, a uniquely preferred fusion variant, or strong scientific interpretability beyond a more inspectable local-structure interface.

[LG-55] Byzantine-Robust Distributed Sparse Learning Revisited

链接: https://arxiv.org/abs/2605.13283
作者: Yuxuan Wang,Lixin Zhang,Kangqiang Li
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We revisit Byzantine robust distributed estimation for high-dimensional sparse linear models. By combining local \ell_1 -regularized robust estimation with robust aggregation at the server, the framework applies to pseudo-Huber regression, quantile regression, and sparse SVM. We show that the resulting estimators yield non-asymptotic guarantees and attain near-optimal statistical rates under mild conditions, while remaining communication-efficient. Simulations confirm strong robustness in estimation, support recovery and classification accuracy under various Byzantine attacks.

[LG-56] LightSplit: Practical Privacy-Preserving Split Learning via Orthogonal Projections

链接: https://arxiv.org/abs/2605.13265
作者: Mert Cihangiroglu,Alessandro Pegoraro,Phillip Rieger,Antonino Nocera,Ahmad-Reza Sadeghi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split learning (SL) enables collaborative training by partitioning a neural network across clients and a central server, but the cut-layer interface introduces a key challenge: high-dimensional activations incur substantial communication overhead while exposing representations vulnerable to reconstruction attacks. Existing approaches typically address efficiency or privacy in isolation, relying on additional mechanisms such as sparsification, quantization, or noise injection. We propose LightSplit, which limits information exposure and reduces communication overhead by applying a lightweight fixed orthogonal random projection at the cut layer. Based on Shannon’s information theory, this projection acts as an information bottleneck that restricts instance-specific information and suppresses exploitable per-sample signals. By transmitting low-dimensional projections instead of raw activations, the server operates on lifted representations without requiring architectural modifications, ensuring compatibility with existing SL architectures. By avoiding additional trainable components on the client, the method remains lightweight and suitable for edge devices while preserving end-to-end differentiability via exact gradient propagation. As the projection is non-invertible, part of the original representation is irreversibly discarded at the client, LightSplit reduces the information available for reconstruction and limits information exposure. We extensively evaluate LightSplit on state-of-the-art benchmarks in both IID and non-IID settings across varying projection dimensions and client scales. Our results show that the method retains more than 95% of the baseline accuracy at up to 32x reduction in transmitted dimensionality while maintaining stable training dynamics.

[LG-57] Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

链接: https://arxiv.org/abs/2605.13262
作者: Deepak Warrier,Raja Sekhar Pappala
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Modern SMILES-based chemical language models obtain strong MoleculeNet performance by treating SMILES as generic text and compensating with multi-million-molecule self-supervised pretraining. We ask: when a domain carries structural priors as rich as chemistry’s, does it warrant a domain-native transformer rather than a generic one rescued by scale? We answer affirmatively with \textbfGM-Net (Geometric Measure Network), a transformer family in which every module is replaced by a sphere-native counterpart, and instantiate it as \textbfChem-GMNet. Three blocks follow: SH-Embedding (tokens as learnable directions on S^k-1 lifted through a Gegenbauer feature map); DualSKA (a per-head fusion of a linear-time gated Sphere-Flow recurrence whose persistent state we prove is the truncated multipole expansion of the input distribution, and a softmax Sphere-Kernel branch over the same Schoenberg-valid kernel); and SH-FFN (sphere projection \to Gegenbauer lift \to moment readout). On canonical DeepChem scaffold splits, against same-shape ChemBERTa-2 baselines under the chemberta3-faithful protocol: (i) random-initialised, Chem-GMNet wins on 7 of 10 MoleculeNet endpoints at \sim!35% fewer parameters; (ii) pretrained on the same 10M-SMILES ZINC corpus as ChemBERTa-2 MLM-10M, it matches or beats the public release on 6 of 8 shared endpoints (5/7 excluding a known ClinTox release anomaly). A (k,L) ablation shows that increasing the sphere dimension from k!=!8 to k!=!10 at fixed L!=!3 lowers ESOL RMSE to 0.938 at scratch, beating pretrained ChemBERTa-2 MLM-10M on this endpoint without any pretraining at all.

[LG-58] Unified generalization analysis for physics informed neural networks

链接: https://arxiv.org/abs/2605.13260
作者: Yuka Hashimoto,Tomoharu Iwata
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) and their variational counterparts (VPINNs) are neural networks that incorporate physical laws, making them useful for scientific problems. Existing generalization analyses for PINNs and VPINNs remain limited, often requiring restrictive assumptions such as stability conditions or linear ellipticity. In this paper, we derive generalization bounds for neural networks that involve differentiation with respect to input variables, covering PINNs and VPINNs under a unified framework. We apply Taylor expansion to represent nonlinear differential operators as linear operators on a high-dimensional space, enabling the use of Koopman-based analysis and showing that high-rank networks can generalize well even in settings involving differential operators. We also show that the nonlinearity of the differential operator exponentially enlarges the bound, highlighting its significant impact on generalization.

[LG-59] EMO: Frustratingly Easy Progressive Training of Extendable MoE

链接: https://arxiv.org/abs/2605.13247
作者: Linghao Jin,Chufan Shi,Huijuan Wang,Nuan Wen,Zhengzhong Liu,Eric Xing,Xuezhe Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.

[LG-60] When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation

链接: https://arxiv.org/abs/2605.13242
作者: John Lazarsfeld,Anas Barakat,Georgios Piliouras,Antonios Varvitsiotis,Andre Wibisono
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the convergence of the Optimistic Multiplicative Weights Update algorithm (OMWU) in two player zero-sum games. Recent works have identified instances on which the last-iterate of OMWU can converge arbitrarily slowly, but understanding when and why this slow convergence occurs has remained open. In this work, we develop a new analysis framework that gives sharp, quantitative explanations for this behavior. Our analysis is based on viewing the algorithm’s dual iterates as an optimistic skew-gradient descent with respect to an energy function. We prove over the dual iterates that energy is dissipative, and by establishing tight bounds on the magnitude of dissipation, our analysis quantifies the geometric bottlenecks that arise when the corresponding primal iterates are close to the simplex boundary. This further translates into a new linear last-iterate convergence rate in KL divergence on games with a unique and interior Nash equilibrium. Compared to prior work, this new rate contains a much sharper dependence on game-specific constants, and we prove this dependence is optimal. Moreover, these geometric insights further translate into new separations on uniform convergence rates for OMWU. On the one hand, we prove constant lower bounds on the uniform best-iterate convergence rate in KL divergence and total variation distance from Nash. On the other hand, we establish for the 2\times 2 setting a new \widetilde O(T^-1/2) best-iterate rate in duality gap, improving substantially over prior work. Together, this shows in general that uniform convergence rate guarantees do not transfer across different measures of distance to Nash.

[LG-61] Mix Dont Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

链接: https://arxiv.org/abs/2605.13225
作者: Paul Jeha,Anastasiia Sedova,Louis Béthune,Skyler Seto,Jes Frellsen,Pierre Ablin,Natalie Schluter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2–3 \times the unique target data on validation loss and 2–13 \times on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing’s value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via \mu P.

[LG-62] Machine Learning-Driven Multimodal Spectroscopic Liquid Biopsy for Early Multicancer Detection

链接: https://arxiv.org/abs/2605.13218
作者: Alejandro Leonardo García Navarro,Javier Cachón Ortiz,Javier González Colsa,Samuel García Díaz,Carlos Viadero Valderrama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cancer is one of the leading causes of death worldwide, making the development of rapid, minimally invasive, label-free and scalable diagnostic strategies a major challenge in modern oncology. In this context, spectroscopic liquid biopsy has emerged as a promising alternative, as it enables the holistic characterization of biochemical alterations in biological fluids. In this work, we propose a multimodal spectroscopic liquid biopsy framework for multicancer detection based on the combination of Fourier Transform Infrared (FTIR) spectroscopy, Raman spectroscopy, and Excitation-Emission Matrix (EEM) fluorescence spectroscopy together with Machine Learning (ML) methodologies. Serum samples from breast cancer patients, colorectal cancer patients, and healthy controls were analyzed through the three spectroscopic modalities. After modality-specific preprocessing, low-level data fusion (LLDF) was employed to integrate the complementary biochemical information encoded within the different spectroscopic measurements, and classification was performed using XGBoost models. Seven experimental configurations were evaluated, including the three unimodal approaches, all pairwise bimodal configurations, and the full multimodal approach of FTIR, Raman, and EEM fluorescence. The results show that although several individual modalities achieved high discrimination performance, the multimodal fusion provided the most balanced overall results, reaching a ROC-AUC of 0.997 for breast cancer and 0.994 for colorectal cancer, together with highly balanced sensitivity and specificity values.

[LG-63] Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

链接: https://arxiv.org/abs/2605.13214
作者: Marte Eggen,Eirik Reiestad,Kristian Gjøsteen,Inga Strümke
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent cryptographic results establish that neural networks can be backdoored such that no efficient algorithm can distinguish them from a clean model. These guarantees, however, have been confined to stylised architectures of limited practical relevance, leaving open whether comparable undetectability extends to modern, end-to-end trained networks. We construct such an attack mechanism for state-of-the-art architectures, closely aligned to the cryptographic notion of undetectability, by identifying backdoor channels as learned latent directions, and show that the question of undetectability reduces to a hypothesis test between two unknown distributions over model parameters, which we conjecture to be intractable in practice. The consequence of this reframing is significant: if exploitable channels within a network’s latent space are statistically indistinguishable from naturally learned directions, an attacker need not introduce foreign structure but can instead exploit the geometry the network already possesses. Demonstrating the approach on ResNet and Vision Transformer architectures trained on standard image classification datasets, the attack achieves both consistently high success rates with negligible clean accuracy degradation, and resists a comprehensive suite of post-training defences, none of which neutralise the backdoor without rendering the model unusable. Our results establish that cryptographic backdoors need not be artefacts requiring exotic architectures or artificial constructions, but identifiable as latent properties inherent to the geometry of learned representations.

[LG-64] Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

链接: https://arxiv.org/abs/2605.13207
作者: Stefan Stojanovic,Alexandre Proutiere
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning can improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which largely confine them to goal-reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero-shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB \pi -Switch, an algorithm that extracts both a high-level subgoal-selection policy and a low-level control policy directly from forward-backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal-conditioned and general reward-based tasks show that FB \pi -Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning beyond goal-reaching tasks. Our project website is available at: this https URL.

[LG-65] A Hybrid Tucker-LSTM Tensor Network Model for SOC Prediction in Electric Vehicles

链接: https://arxiv.org/abs/2605.13200
作者: Han Wang,Ying Wang,Bing Wang
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Accurate state of charge estimation is critical for the success of electric vehicle battery management strategies, but it is well known that conventional estimators suffer from two fundamental shortcomings: cumulative errors that grow over time and reliance on simplified battery models that do not reflect real world dynamics. Therefore, this paper presents a novel hybrid approach combining Tucker tensor decomposition with LSTM networks, using full - lifecycle EV field data for SOC prediction. The inputs are charge status, mileage, voltage, current, cell differentials, and temporal features. Tucker decomposition is skillfully used to reduce dimensionality while maintaining the temporal structure, hence allowing a direct, fair comparison with standard LSTM. The result is unequivocal: Tucker - LSTM outperforms the baseline on all metrics, with MSE dropping 70.5% (from 21.07 to 6.22 ), MAE improving 48.7% (from 3.37% to 1.73%), RMSE falling from 4.59% to 2.49%, and R^2 rising from 0.918 to 0.976. Since the experimental results demonstrably demonstrate that tensor decomposition compresses high-dimensional battery data very well without loss of predictive fidelity, this paper naturally opens up a new direction for tensor-based analytics in electric vehicle battery management.

[LG-66] Do Heavy Tails Help Diffusion? On the Subtle Trade-off Between Initialization and Training

链接: https://arxiv.org/abs/2605.13175
作者: Hamza Cherkaoui,Hélène Halconruy,Antonio Ocello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have proposed incorporating heavy-tailed (HT) noise into diffusion- and flow-based generative models, with the goals of better recovering the tails of target distributions and improving generative diversity. This motivation is intuitive: if the data are heavy-tailed, HT noise may appear better matched than light-tailed (LT) Gaussian noise. However, replacing Gaussian noise by HT noise also changes the underlying estimation problem. In this paper, we revisit this paradigm through a combined theoretical and empirical study, establishing sampling-error bounds for two representative diffusion models driven by HT and LT noise. We show that HT noise makes the statistical estimation problem harder, leading to less favorable sampling-error bounds. We support these findings with experiments on synthetic and real-world datasets, empirically recovering the predicted error trade-off. Our results call into question a growing design trend in generative modeling and challenge the use of HT noise to improve rare-region exploration.

[LG-67] Continual Fine-Tuning of Large Language Models via Program Memory

链接: https://arxiv.org/abs/2605.13162
作者: Hung Le,Svetha Venkatesh
类目: Machine Learning (cs.LG)
*备注: 18 page, preprint

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute. However, in continual settings where models are updated sequentially with small datasets, conventional LoRA updates struggle to balance rapid adaptation and knowledge retention. Existing methods typically treat the low-rank space as a homogeneous update region, lacking mechanisms to regulate how short-term updates are consolidated over time. We propose a continual LoRA framework with \textbfProgram memory, inspired by \textbfComplementary \textbfLearning Systems in neuroscience. Our approach, dubbed \textbfProCL, organizes LoRA adapters into structured program memory slots that are dynamically retrieved through input-conditioned attention. This enables rapid and localized adaptation, encouraging similar inputs to reuse shared adapter regions while reserving unused capacity for future data. The slots are then combined with the underlying adapter, which maintains a distributed representation that gradually accumulates knowledge across tasks to balance plasticity and stability. Our method operates entirely within the LoRA parameterization and incurs no additional inference cost. Experiments on diverse benchmarks demonstrate improved retention and reduced catastrophic forgetting over other continual LoRA strategies.

[LG-68] Collaborating in Multi-Armed Bandits with Strategic Agents

链接: https://arxiv.org/abs/2605.13145
作者: Idan Barnea,Ofir Schlisselberg,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study collaborative learning in multi-agent Bayesian bandit problems, where strategic agents collectively solve the same bandit instance. While multiple agents can accelerate learning by sharing information, strategic agents might prefer to free-ride and avoid exploration. We consider a setting with persistent agents that participate in multiple time periods. This is in contrast to most previous works on incentives in multi-agent MAB, which assume short-lived agents, namely each agent has a single decision to make and optimizes their expected reward in that single decision. As in the multi-agent MAB model with incentives, our model does not have monetary transfers, and the only incentives are through information sharing. We propose \textttCAOS, a mechanism that sustains collaboration as a Nash equilibrium while achieving strong regret guarantees. Our results demonstrate that collaborative exploration can be sustained purely through information sharing, achieving performance close to that of fully cooperative systems despite strategic behavior. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.13145 [cs.LG] (or arXiv:2605.13145v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-69] On the Generalization of Knowledge Distillation: An Information-Theoretic View

链接: https://arxiv.org/abs/2605.13143
作者: Bingying Li,Haiyun He
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, accepted at ISIT 2026

点击查看摘要

Abstract:Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher’s generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher’s local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

[LG-70] Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

链接: https://arxiv.org/abs/2605.13138
作者: Nils Loose,Joseph Bienhüls,Kristoffer Hempel,Felix Mächtle,Thomas Eisenbarth
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes. Group-stratified evaluation exposes approximately 17% performance drops compared to random splits, while temporal splits on aggregated datasets prove unreliable due to compositional shift in the underlying project distributions. At a false positive rate of 0.5% all fine-tuned code-only models miss over 93% of vulnerabilities. Larger and more diverse training data or generative approaches show preliminary improvements but do not resolve the underlying limitations. To support future research on code-centric VFC detection, we release our unified framework and evaluation suite.

[LG-71] KAST-BAR: Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Modeling for Universal Neural Interpretation

链接: https://arxiv.org/abs/2605.13133
作者: Haoning Wang,Wenchao Yang,Shuai Shen,Yang Li
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:While EEG foundation models have shown significant potential in universal neural decoding across tasks, their advancement remains constrained by the inadequacy modeling of complex spatiotemporal topology, as well as the inherent modality gap between low-level physiological signals and high-level textual semantics. To address these challenges, we propose a Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Model (KAST-BAR), which dynamically aligns physiological representations derived from multi-level brain topology with an expert-level semantic space. Specifically, we design a Dual-Stream Hierarchical Attention (DSHA) encoder that accurately captures the brain’s intrinsic non-Euclidean topology by modeling local temporal dynamics with global spatial contexts. On this basis, a Knowledge-Anchored Semantic Profiler (KASP) is proposed to synthesize physically-grounded and instance-level textual profiles, which subsequently drive a Semantic Text-Aware Refiner (STAR) to dynamically reconstruct EEG representations using Latent Expert Queries. By conducting large-scale pre-training on 21 diverse datasets to build a foundation model, KAST-BAR effectively integrates expert-level medical knowledge into EEG signal representations, consistently achieving superior performance across six downstream tasks. Our code is available at this https URL

[LG-72] ERPPO: Entropy Regularization-based Proximal Policy Optimization

链接: https://arxiv.org/abs/2605.13131
作者: Changha Lee,Gyusang Cho
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Multi-Agent Proximal Policy Optimization (MAPPO) is a variant of the Proximal Policy Optimization (PPO) algorithm, specifically tailored for multi-agent reinforcement learning (MARL). MAPPO optimizes cooperative multi-agent settings by employing a centralized critic with decentralized actors. However, in case of multi-dimensional environment, MAPPO can not extract optimal policy due to non-stationary agent observation. To overcome this problem, we introduce a novel approach, Entropy Regularization-based Proximal Policy Optimization (ERPPO). For the policy optimization, we first define the object detection ambiguity under multi-dimensional observation environment. Distributional Spatiotemporal Ambiguity (DSA) learner is trained to estimate object detection uncertainty in non-stationary constraints. Then, we enhance PPO with a novel Entropy Regularization term. This regularization dynamically adjusts the policy update by applying a stronger (L1) regularization in high-ambiguity observation to encourage significant exploratory actions and a weaker (L2) regularization in low-ambiguity observation to stabilize the proximal policy optimization. This approach is designed to enhance the probability of successful object localization in time-critical operations by reducing detection failures and optimizing search policy. Experiments on a testbed with AirSim-based maritime searching scenarios show that the proposed ERPPO improves accuracy performance. Our proposed method improves higher gradient than MAPPO. Qualitative results confirm that ERPPO effectiveness in terms of suppressing false detection in visually uncertain conditions.

[LG-73] DiffusionHijack: Supply-Chain PRNG Backdoor Attack on Diffusion Models and Quantum Random Number Defense

链接: https://arxiv.org/abs/2605.13115
作者: Ziyang You,Liling Zheng,Xiaoke Yang,Xuxing Lu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Diffusion models depend on pseudo-random number generators (PRNGs) for latent noise sampling. We present DiffusionHijack, a supply-chain backdoor attack that hijacks the PRNG to deterministically control generated images. A malicious PRNG, injected via compromised packages, forces pixel-perfect reproduction of attacker-chosen content (SSIM = 1.00, N = 100 trials) on Stable Diffusion v1.4, v1.5, and SDXL – without modifying model weights. The attack is inherently undetectable by existing model auditing and content moderation mechanisms, as it operates entirely outside the neural network computation graph. The attack remains effective under stochastic sampling (eta 0), bypasses CLIP-based safety checkers (98-100% success), and operates independently of the user’s prompt. As a countermeasure, we replace the PRNG with a quantum random number generator (QRNG), which provides information-theoretic unpredictability. Across N = 100 prompt-model combinations, QRNG defense completely neutralizes the attack, reducing output similarity to random baseline levels (SSIM 0.20 for SD 1.x models, 0.45 for SDXL). This work exposes a previously overlooked supply-chain vulnerability and offers a hardware-level fundamental mitigation for generative AI systems.

[LG-74] Bayesian Nonparametric Mixed-Effect ODEs with Gaussian Processes

链接: https://arxiv.org/abs/2605.13088
作者: Julien Martinelli,Maksim Sinelnikov,Harri Lähdesmäki,Quentin Clairon,Mélanie Prague
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamical modelling is central to many scientific domains, including pharmacometrics, systems biology, physiology, and epidemiology. In these settings, heterogeneity is often intrinsic: different subjects or units follow related but distinct continuous-time dynamics. Classical nonlinear mixed-effects Ordinary Differential Equation (ODE) models address this by combining population-level structure with subject-specific effects, but they rely on a parametric vector field and are therefore vulnerable to structural misspecification and unmodelled mechanisms. This motivates nonparametric approaches that can retain principled uncertainty quantification, yet existing nonparametric ODE methods typically assume a single shared dynamical system rather than an explicit mixed-effect hierarchy over subject-specific dynamics. We propose MEGPODE, a Bayesian nonparametric mixed-effect ODE model in which each subject’s vector field is decomposed into a shared population component and a subject-specific deviation, both endowed with Gaussian process (GP) priors. To avoid repeated ODE solves per subject during training, we combine state-space GP trajectory priors with virtual collocation observations, yielding Kalman-smoothing trajectory updates and closed-form regressions for the vector fields. Across controlled heterogeneous ODE benchmarks spanning oscillatory, biomedical systems, MEGPODE improves population-field recovery and subject-level trajectory prediction relative to strong baselines.

[LG-75] Local Inverse Geometry Can Be Amortized

链接: https://arxiv.org/abs/2605.13068
作者: Aaditya L. Kachhadiya
类目: Machine Learning (cs.LG)
*备注: Preprint. 21 pages, 8 figures, 8 tables. Code available at this https URL

点击查看摘要

Abstract:Nonlinear inverse problems often trade inexpensive but fragile first-order updates against curvature-aware methods such as Gauss-Newton and Levenberg-Marquardt, which obtain stronger directions by repeatedly solving Jacobian-based linearized systems. We propose a learned alternative: amortize local inverse geometry into a reusable reverse operator. Our framework learns a bidirectional surrogate, Deceptron, and deploys it through D-IPG (Deceptron Inverse-Preconditioned Gradient), an iterative solver that pulls residual-corrected measurement-space proposals back to latent space. The key mechanism is a Jacobian Composition Penalty (JCP), which trains the reverse Jacobian to act as a local left inverse of the forward Jacobian; its runtime counterpart, RJCP, measures the same inverse-consistency error along optimization trajectories. We prove that D-IPG is first-order equivalent to damped Gauss-Newton under local pseudoinverse consistency, with deviation controlled by composition error and conditioning. Across seven PDE inverse-problem benchmarks, D-IPG outperforms standard baselines, achieves 94.8% mean success across the six-problem reliability suite, and reaches comparable or better recovery quality at up to 77x lower inference-time solve cost on the main benchmarks.

[LG-76] Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverag e via Conditional Flow Matching

链接: https://arxiv.org/abs/2605.13063
作者: Ehsan Aghazadeh,Masoud Malekzadeh,Ahmad Ghasemi,Hossein Pishro-Nik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing continuous trajectories whose time-averaged occupancy provably matches a prescribed spatial density (the \emphergodic coverage problem) is central to UAV-assisted data collection and sensing, robotic exploration, and mobile monitoring. For flying agents in particular, this challenge is acute: trajectories must balance coverage fidelity against tight energy budgets, no-fly zones, and acceleration limits. Existing methods either re-optimize each trajectory online (with cost growing in the horizon and re-running for every target, agent, and realization) or rely on bespoke analytical constructions that must be re-derived for each new constraint. We propose a \emphepushforward framework that decouples ergodicity from density matching: an analytic latent trajectory provides exact uniform ergodicity on a simple annular domain, and a single map, learned offline by optimal-transport conditional flow matching, transports this latent occupancy onto the prescribed target density. The composed trajectory is then asymptotically ergodic with respect to the learned pushforward distribution, with deviation from the target controlled by the flow-matching training loss. Once trained for a given target density and constraint set, the map serves an unbounded number of trajectories and a multi-agent fleet without per-agent retraining, and many differentiable operational constraints (no-fly zones, acceleration ceilings, or fairness penalties) enter as additive soft penalties in the training loss without re-deriving the design. We prove three results (an acceleration-energy bound, an O(1/\sqrtK) ergodic convergence rate in the number of trajectory cycles K , and an approximation-error bound) that combine into an end-to-end coverage bound estimable from CFM training diagnostics (certified given an architectural Lipschitz bound on v_\theta ).

[LG-77] What Information Matters? Graph Out-of-Distribution Detection via Tri-Component Information Decomposition ICML26

链接: https://arxiv.org/abs/2605.13032
作者: Danny Wang,Ruihong Qiu,Zi Huang
类目: Machine Learning (cs.LG)
*备注: ICML26

点击查看摘要

Abstract:Graph neural networks are widely used for node classification, but they remain vulnerable to out-of-distribution (OOD) shifts in node features and graph structure. Prior work established that methods trained with standard supervised learning (SL) objectives tend to capture spurious signals from either features and/or structure, leaving the model fragile under distributional changes. To address this, we propose textscTide, a textbfnovel and effective underlineTri-Component underlineInformation underlineDecomposition framework that textbfexplicitly decomposes information into textitfeature-specific, structure-specific and joint components. textscTide aims to textbfpreserve only the label-relevant part of the joint information while textbffiltering out spurious feature- and structure-specific information, thereby enhancing the separation between in-distribution (ID) and OOD nodes. Beyond the framework, we provide theoretical and empirical analyses showing that an information bottleneck objective is preferable to standard SL for graph OOD detection, with higher ID confidence and a greater entropy gap between ID and OOD data. Extensive experiments across seven datasets confirm the efficacy of textscTide, achieving up to a 34% improvement in FPR95 over strong baselines while maintaining competitive ID accuracy.

[LG-78] Offline Two-Player Zero-Sum Markov Games with KL Regularization

链接: https://arxiv.org/abs/2605.13025
作者: Claire Chen,Yuheng Zhang,Xinyu Liu,Zixuan Xie,Shuze Daniel Liu,Nan Jiang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast \widetilde\mathcalO(1/n) convergence rate under \textitunilateral concentrability, improving over the standard \widetilde\mathcalO(1/\sqrtn) rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same \widetilde\mathcalO(1/n) statistical rate up to a vanishing optimization error of order \widetilde\mathcalO(1/\sqrtT) in the number of self-play iterations T .

[LG-79] JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2605.13013
作者: Jing Yu Lim,Rushi Shah,Zarif Ikram,Samson Yu,Haozhe Ma,Tze-Yun Leong,Dianbo Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion world models have recently become competitive for online model-based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end-to-end world-model objectives that have driven much of modern MBRL progress. In particular, JEPA-style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion-style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end-to-end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3 \times faster world-model sampling, and 2.5 \times faster training. JEDI also exhibits a markedly different task-level performance profile from the pixel baseline, suggesting that end-to-end predictive latents change more than compute alone.

[LG-80] emphDRIFT: A Benchmark for Task-Free Continual Graph Learning with Continuous Distribution Shifts

链接: https://arxiv.org/abs/2605.12998
作者: Guiquan Sun,Xikun Zhang,Jingchao Ni,Dongjin Song
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Continual graph learning (CGL) aims to learn from dynamically evolving graphs while mitigating catastrophic forgetting. Existing CGL approaches typically adopt a task-based formulation, where the data stream is partitioned into a sequence of discrete tasks with pre-defined boundaries. However, such assumptions rarely hold in real-world environments, where data distributions evolve continuously and task identity is often unavailable. To better reflect realistic non-stationary environments, we revisit continual graph learning from a task-free perspective. We propose a unified formulation that models the data stream as a time-varying mixture of latent task distributions, enabling continuous modeling of distribution drift. Based on this formulation, we construct DRIFT, a benchmark that spans a spectrum of transition dynamics ranging from hard task switches to smooth distributional drift through a Gaussian parameterization. We evaluate representative continual learning methods under this task-free setting and observe substantial performance degradation compared to traditional task-based protocols. Our findings indicate that many existing approaches implicitly rely on task boundary information and struggle under realistic task-free graph streams. This work highlights the importance of studying continual graph learning under realistic non-stationary conditions and provides a benchmark for future research in this direction. Our code is available at this https URL.

[LG-81] Frequency Bias and OOD Generalization in Neural Operators under a Variable-Coefficient Wave Equation

链接: https://arxiv.org/abs/2605.12997
作者: Runlong Xie,An Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators learn to map initial conditions to the terminal solution of partial differential equations (PDEs), providing a surrogate for the full operator mapping. This enables rapid prediction across different input configurations. While recent neural operator architectures have demonstrated strong performance on diverse PDE tasks, their behavior under structured distribution shifts remains insufficiently understood. To investigate this, we study operator learning in a wave propagation setting governed by a one-dimensional variable-coefficient wave equation, using two representative architectures, the Fourier Neural Operator (FNO) and the Deep Operator Network (DeepONet). To examine their generalization under distribution shifts, we consider structured out-of-distribution (OOD) settings that independently vary input frequency and coefficient smoothness. The results show that under smoothness shifts, both models maintain stable performance, with FNO achieving lower error. In contrast, under frequency shifts, FNO exhibits a sharp increase in error under unseen high-frequency inputs, whereas DeepONet shows milder degradation despite higher overall error. Our analysis reveals that these differences arise from how each architecture represents and responds to variations in frequency structure. Together, these findings highlight a fundamental gap between strong in-distribution performance and generalization under distribution shifts in operator learning, underscoring the role of architectural representation bias in developing more reliable neural operators for physics-based PDE simulations beyond the training distribution.

[LG-82] F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

链接: https://arxiv.org/abs/2605.12995
作者: Rohan Surana,Gagan Mundada,Junda Wu,Xintong Li,Yizhu Jiao,Bowen Jin,Sizhe Zhou,Tong Yu,Ritwik Sinha,Jiawei Han,Jingbo Shang,Julian McAuley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.

[LG-83] DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

链接: https://arxiv.org/abs/2605.12994
作者: Jihwan Kim,Chenglin Fan
类目: Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton–Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton–Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton–Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton–Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

[LG-84] Decision Tree Learning on Product Spaces ICML2026

链接: https://arxiv.org/abs/2605.12983
作者: Arshia Soltani Moakahr,Faraz Ghahremani,Kiarash Banihashem,MohammadTaghi Hajiaghayi
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: ICML 2026

点击查看摘要

Abstract:Decision tree learning has long been a central topic in theoretical computer science, driven by its practical importance. A fundamental and widely used method for decision tree construction is the top-down greedy heuristic, which recursively splits on the most influential variable. Despite its empirical success, theoretical analysis of this heuristic has been limited. A recent breakthrough by Blanc et al. (ITCS, 2020) provided the first rigorous theoretical guarantees for the greedy approach, but only under the uniform distribution. We extend this analysis to the more general and practically relevant setting of arbitrary product distributions. Our main result shows that for any function f computable by an optimal decision tree of size s , maximum depth D_\textopt , and average depth \Delta_\textopt , the greedy heuristic constructs an \epsilon -approximating tree whose size grows at most with \exp\bigl(\Delta_\textopt D_\textopt \log(e/\epsilon)\bigr) . In the special case where the optimal tree is a full binary tree, this bound improves upon the bound of Blanc et al. and holds under a strictly broader class of distributions. Moreover, we present an algorithm based on the top-down greedy heuristic that is entirely parameter-free – it requires no prior knowledge of the optimal tree’s size or depth – offering a practical advantage over Blanc et al.'s method.

[LG-85] U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics

链接: https://arxiv.org/abs/2605.12965
作者: Yingzhe Ma,Xiao Yang,Yuxin Xie,Zihan Xiong,Jinliang Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Solutions to many partial differential equations (PDEs) display coexisting smooth global transport and localized sharp features within a single trajectory: shock fronts, thin interfaces, and concentrated high-frequency content sit on top of slowly varying backgrounds. This poses a challenge for neural operators: Fourier-based architectures mix nonlocal interactions efficiently but tend to under-resolve localized non-smooth features, whereas spatially local architectures recover fine detail at the cost of long-range propagation and rollout stability. Existing hybrid operators paper over this tension with a fixed, spatially uniform fusion that forces the same trade-off everywhere. We propose U-HNO, a U-shaped hybrid neural operator whose central design is Sparse-Point Adaptive Routing (SPAR): at every spatial location, a per-pixel hard mask selects whether the global Fourier branch or the local multi-scale Gaussian branch should dominate, and the sparsity ratio is a function of the local contrast of the routing signal, so smooth and shock-aligned regions receive different mixtures of global and local computation. SPAR is embedded in a hierarchical encoder-bottleneck-decoder backbone with skip connections so that the dual branches and the gate operate at every resolution. Training combines pointwise supervision with a finite-difference H^1 gradient term and a band-wise spectral consistency regularizer. Across benchmarks spanning 1D Burgers, Kuramoto-Sivashinsky, KdV, 2D advection, Allen-Cahn, Navier-Stokes, Darcy flow, and 3D transonic compressible Navier-Stokes from PDEBench, U-HNO achieves state-of-the-art rollout accuracy on the majority of tasks in both relative L^2 and H^1 metrics, with the largest gains on problems dominated by sharp localized features. Ablations show that removing any single component substantially degrades rollout error. Comments: 26 pages, 7 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2605.12965 [cs.LG] (or arXiv:2605.12965v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12965 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-86] Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model

链接: https://arxiv.org/abs/2605.12945
作者: Hongmin Li
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Shortcut features are often invoked to explain out-of-distribution (OOD) failure, but training correlation, learned shortcut use, and test-time failure need not coincide. We study a minimal binary model with one invariant coordinate and one family-dependent shortcut coordinate. In the deterministic regime, positive average shortcut correlation pulls logistic ERM toward positive shortcut weight, but ridge regularization keeps the classifier invariant-dominated and prevents deterministic OOD failure. When the invariant coordinate is noisy, ridge-logistic ERM switches to the shortcut rule once the training shortcut signal exceeds the invariant signal. Whether that transition causes failure depends on the held-out family: weaker shortcut correlation yields positive excess risk, and sign-flipped families yield above-chance error. Synthetic checks match these analytic regimes and show that the same training-side transition can have different held-out consequences. The model separates shortcut attraction, shortcut-rule transition, and cross-family OOD failure.

[LG-87] Reinforced Collaboration in Multi-Agent Flow Networks

链接: https://arxiv.org/abs/2605.12943
作者: Zheng Wang,Yuang Liu,Yangkai Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi-Agent Network Gradient Optimization), a data-driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well-optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state-of-the-art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at this https URL.

[LG-88] he Efficiency Gap in Byte Modeling

链接: https://arxiv.org/abs/2605.12928
作者: Celine Lee,Jing Nathan Yan,Chen Liang,Jiaxin Shi,Yin Zhang,Jeremiah Liu,Pengcheng Yin,Fernando Pereira,Ed Chi,Derek Cheng,Alexander M. Rush,Ruoxi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model’s learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR’s stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.

[LG-89] IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

链接: https://arxiv.org/abs/2605.12924
作者: Vahid Balazadeh,Hamidreza Kamkari,Medha Barath,Ricardo Silva,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The instrumental-variables (IV) setting is standard for partial identification of causal effects when unobserved confounding makes point identification impossible. Existing approaches face methodological bottlenecks: closed-form bound estimands are required – e.g., Balke-Pearl equations in binary IV – and even when available, designing accurate estimators requires manual effort tailored to each estimand. While direct Bayesian inference of the causal effects, instead of the bounds, circumvents these challenges, it is often computationally intensive and suffers from high prior sensitivity or under-dispersed posteriors. As a remedy, we introduce IV-ICL, an amortized Bayesian in-context learning method that learns the marginal posterior distribution of the causal effects directly and derives bounds as its quantiles. Unlike standard variational inference that optimizes exclusive KL divergence, amortized Bayesian inference minimizes the expected inclusive KL, a mass-covering objective. We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes, while exclusive-KL (e.g. with variational inference) of the same Bayesian formulation collapses onto a single mode and fails to cover the identified set. We evaluate IV-ICL on synthetic and semi-synthetic IV benchmarks and show it produces intervals that are more reliably valid and more informative compared to efficient semi-parametric, Bayesian, and plug-in baselines, at 20-500x lower inference time. Beyond methodology, we propose a procedure to convert randomized controlled trials into IV benchmarks with provably preserved ground-truth causal effects that enables a more realistic evaluation of partial-identification methods.

[LG-90] Revisiting DAgger in the Era of LLM -Agents

链接: https://arxiv.org/abs/2605.12913
作者: Changhao Li,Rushi Qiang,Jiawei Huang,Chenxiao Gao,Chao Zhang,Niao He,Bo Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher’s behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.

[LG-91] VIP-COP: Context Optimization for Tabular Foundation Models

链接: https://arxiv.org/abs/2605.12904
作者: Yilong Chen,Xueying Ding,Leman Akoglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular foundation models (TFMs) have emerged as a powerful paradigm for in-context learning on structured data, enabling direct prediction on new tabular tasks without task-specific training. However, their effectiveness is constrained by context length limits, restricting application to medium-scale data and degrading performance when inference-time data exceed pretraining size distributions. Our work introduces VIP-COP, estimating the Value of Importance for Prediction of training examples and features for hard Context OPtimization for TFMs. Its explicit selection mechanism suppresses noise and isolates influential data, enabling the model to also benefit from data augmentation by prioritizing high-value augmented samples and features. VIP-COP is (i) fast, boosting performance often within minutes of optimization, based on an online KernelSHAP-based regression with iterative refinement, value-guided context sampling, and multi-fidelity pruning; (ii) budget-aware and any-time, improving with additional test-time compute unlike heuristics that produce fixed contexts; (iii) model-aware yet fully black-box, requiring no access to model internals, making it compatible with both proprietary and open-source TFMs; (iv) interpretable, identifying discrete ``Very Important Predictors’’ (samples and features) that maximize signal-to-noise, which makes it (v) robust, isolating high-value data from noise. In contrast, soft-prompt optimization requires model gradients, produces abstract latent tokens, and lacks explicit signal discrimination. Extensive experiments show that VIP-COP consistently outperforms heuristic and optimized baselines across large-scale high-dimensional testbeds, including data augmentation and data-noise settings, establishing a new state of the art in test-time context refinement for TFMs.

[LG-92] ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

链接: https://arxiv.org/abs/2605.12879
作者: Huy Tran,Max Milkert,David Hyde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Doubly-stochastic attention has emerged as a transport-based alternative to row-softmax attention, with recent Transformer variants using it to reduce attention sinks and rank collapse while improving performance. In this family, the standard approach is Sinkhorn scaling, which trains more efficiently but still repeats matrix scaling in every inference forward pass. Sliced-transport attention removes the online iteration, but its soft sorting approximation materializes dense tensors for each slice, requiring substantially more training resources than Sinkhorn attention. We introduce ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection, a train-then-compile method that trains the doubly-stochastic layer with Sinkhorn, then replaces the iterative scaling loop at inference with a fixed sliced-dual operator. It learns a lightweight parametric map from exact one-dimensional Kantorovich potentials to the Sinkhorn query-side dual, then reconstructs the attention plan with a two-sided entropic c-transform. Across language and vision benchmarks, ASAP keeps the cheaper training setup and remains highly competitive with recent baselines. In the main frozen-layer benchmark, ASAP is 5.3 faster than the trained Sinkhorn teacher while matching its accuracy; in downstream replacements, ASAP recovers most of the teacher performance without any retraining.

[LG-93] Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing ICML2026

链接: https://arxiv.org/abs/2605.12876
作者: Blaise Delattre,Hengyu Wu,Paul Caillon,Wei Yang Bryan Lim,Yang Cao
类目: Machine Learning (cs.LG)
*备注: ICML 2026. Code: this https URL

点击查看摘要

Abstract:Randomized smoothing provides strong, model-agnostic robustness certificates, but existing guarantees are limited to single modalities, treating continuous and discrete inputs in isolation. This limitation becomes critical in multimodal models, where decisions depend on cross-modal semantics and adversaries can jointly perturb heterogeneous inputs, rendering unimodal certificates insufficient. We introduce a unified randomized smoothing framework for mixed discrete–continuous inputs based on an analytically tractable Neyman–Pearson formulation of the joint worst-case problem. By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, our approach yields a closed-form, one-dimensional certificate that strictly generalizes both Gaussian (image-only) and discrete (text-only) randomized smoothing. We validate the framework on multimodal safety filtering, providing, to our knowledge, the first model-agnostic Neyman–Pearson certificate for joint discrete-token and continuous-image perturbations in interaction-dependent text–image safety filtering.

[LG-94] Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

链接: https://arxiv.org/abs/2605.12874
作者: Jordan F. McCann
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity – one feature with many meanings – or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string (“plural nouns”) labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics – collision-adjusted detection and discrimination scoring – that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.

[LG-95] SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

链接: https://arxiv.org/abs/2605.12872
作者: Truong Pham,Anay Majee,Rishabh Iyer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emphSubmodular Modality Aligner (SMA), which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular Mutual Information (SMI), which jointly maximizes inter-modality mutual information while reducing cross-modal divergence. This formulation enables the model to effectively utilize multiple positive associations and extract significantly more information from limited data. We evaluate SMA on 14 zero-shot classification and retrieval tasks from the CLIP benchmark and demonstrate consistent gains in the low-data regime. Notably, SMA achieves strong multimodal generalization using only tens of thousands of samples. This is orders of magnitude fewer than standard approaches. Our results highlight the importance of set-based formulations and submodular objectives for data-efficient multimodal learning.

[LG-96] NeuroRisk: Physics-Informed Neural Optimization for Risk-Aware Traffic Engineering

链接: https://arxiv.org/abs/2605.12862
作者: Yingming Mao,Ximeng Liu,Jingyi Cheng,Xiyuan Liu,Jiashuai Liu,Yike Liu,Zhen Yao,Yuzhou Zhou,Siyuan Feng,Qiaozhu Zhai,Shizhen Zhao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In production Wide-Area Networks (WANs), correlated failures dominate availability losses, forcing operators to reserve large safety margins that leave substantial capacity underutilized. Achieving high utilization under strict availability targets therefore requires risk-aware Traffic Engineering (TE) over dozens to hundreds of probabilistic failure scenarios-yet solving this problem at operational timescales remains elusive. We demonstrate that existing risk-aware formulations can be unified under an embedded Sort-and-Select structure, exposing a fundamental trade-off between expressiveness and tractability: classical optimizers either restrict scenario selection for efficiency or incur prohibitive decomposition costs. While deep learning appears promising, prior Deep TE methods mainly target maximum link utilization and rely on scaling-based feasibility, which fundamentally breaks under explicit capacity constraints and scenario-dependent risk. We present NeuroRisk, a physics-informed deep unrolled optimizer that exploits the structure of Sort-and-Select. NeuroRisk enforces feasibility via gated edge-local reservations and represents scenario sets through permutation-invariant, gradient-aligned cues. Evaluations on production-style WANs show that NeuroRisk achieves small optimality gaps relative to the solver with orders of magnitude speedup (10^2- 10^5 \times) on risk objectives, while outperforming neural baselines on nominal throughput.

[LG-97] Multitask Multimodal Fusion with Tabular Foundation Models for Peak and Durability Prediction of Pertussis Booster Response

链接: https://arxiv.org/abs/2605.12852
作者: Divya Sitani
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 22 pages, 8 figures, 4 tables. Code available at this https URL

点击查看摘要

Abstract:Pertussis booster vaccination produces immune responses that vary widely across individuals in both peak magnitude and long-term durability. These two phases are governed by partly distinct biological compartments:peak reflects acute B-cell activation and antibody secretion, while durability reflects the establishment of long-term humoral memory. Yet most computational models target only one, missing the full boost-and-wane trajectory. Jointly predicting both is non-trivial because the two endpoints are biologically dissociated rather than redundant; samples are small, modalities are heterogeneous with structured missingness, and the two tasks rely on different measurement windows. We propose a multi-task contrastive multimodal fusion architecture combining frozen TabPFN-v2 per-modality encoders, a dual-label supervised contrastive loss that treats two subjects as a positive pair if they agree on the Task 1 label or the Task 2 label, modality dropout calibrated to empirical missingness, and missingness-masked attention fusion. Applied to a curated subset of the CMI-PB pertussis booster dataset (n = 158 subjects, four modalities, 44.9% with at least one modality missing; Spearman r = -0.58 between peak and durability, n = 96), the model achieves test AUROC 0.797 (95% CI [0.621, 0.948]) for peak response and 0.755 (95% CI [0.519, 0.945]) for durability, with both significant under joint label permutation (N = 1000; p = 0.002 and p = 0.045). Across logistic regression, XGBoost, and MLP baselines on raw features and on TabPFN embeddings, the proposed model is the only one whose 95% CIs lie above chance on both tasks simultaneously. Per-modality contribution analyses recover task-specific modality contributions consistent with the underlying immunology: peak prediction is carried by cytokine signatures, while durability is carried by baseline antibody features.

[LG-98] Discrete Stochastic Localization for Non-autoregressive Generation

链接: https://arxiv.org/abs/2605.12836
作者: Yunshu Wu,Jiayi Cheng,Longxuan Yu,Partha Thakuria,Rob Brekelmans,Evangelos E. Papalexakis,Greg Ver Steeg
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2602.16169

点击查看摘要

Abstract:Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emphDiscrete Stochastic Localization (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from T=128 to T=1024 , and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps – without distillation or retraining.

[LG-99] Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2605.12831
作者: Leo Benac,Abhishek Sharma,Alihan Huyuk,Finale Doshi-Velez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning (IRL), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision-making behavior. Many variants of IRL have been developed to capture complexities of human decision-making, such as subjective beliefs, imperfect planning, and dynamic goals. However, an often-overlooked issue in real-world behavioral datasets is that the recorded data may be missing observations that were available to the original decision-maker. In use-inspired settings such as healthcare, this can make expert actions appear suboptimal, even when they were near-optimal given the information available at the time. As a result, the rewards learned by standard IRL may be misleading. In this paper, we identify the minimal perturbations to the recorded observations needed for the expert’s actions to appear optimal. We develop a practical algorithm for this problem and demonstrate its utility for quantifying the possible extent of missing observations in behavioral datasets through extensive experiments on synthetic navigation tasks, a cancer treatment simulator, and ICU treatment data.

[LG-100] Hessian Matching for Machine-Learned Coarse-Grained Molecular Dynamics

链接: https://arxiv.org/abs/2605.12823
作者: Sanya Murdeshwar,Sanjit Shashi,Kevin Bachelor,William Noid,Ashwin Lokapally,Razvan Marinescu
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注: 15 pages, 4 figures, 1 table

点击查看摘要

Abstract:Coarse-grained (CG) molecular dynamics enables simulations of atomic systems such as biomolecules at timescales inaccessible to all-atom (AA) methods, but existing CG neural potentials trained via force matching capture only the gradient of the free-energy surface, leaving its curvature unconstrained. We introduce a framework that augments force matching with stochastic Hessian-vector product (HVP) matching, instilling second-order curvature information into CG potentials without constructing the full Hessian. We derive a decomposition of the target CG Hessian into a model-independent projected AA Hessian, precomputed once before training, and a model-dependent covariance correction computed online at negligible cost. We construct an unbiased stochastic estimator of the Hessian-matching objective by using random probe vectors. We evaluate our method by comparing against force matching on a benchmark of nine fast-folding proteins unseen during training. HVP matching outperforms plain force matching on 8 of 9 proteins on slow-mode metrics, with reductions of up to 85% in the Kullback–Leibler divergence between the CG and reference distributions along the slowest collective mode of the largest protein. Our results demonstrate that higher-order physical supervision is a practical path to more accurate and transferable CG potentials for biomolecular simulation.

[LG-101] AGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers

链接: https://arxiv.org/abs/2605.12816
作者: Raj Kiran Gupta Katakam
类目: Machine Learning (cs.LG)
*备注: 8 pages. Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI 2026), Late-Breaking Work track, Fortaleza, Brazil, July 1-3, 2026

点击查看摘要

Abstract:The Average Gradient Outer Product (AGOP) governs feature learning in neural networks: the Neural Feature Ansatz states that weight Gram matrices at each layer align with the corresponding AGOP matrices computed over the training distribution. We ask a complementary question: can this same quantity serve as a post-hoc attribution method for explaining individual predictions? We introduce AGOP-Weighted: a novel attribution method that multiplies the per-sample gradient by sqrt(diag(M) / max diag(M)), a training-distribution prior that suppresses gradient noise and amplifies consistently important pixels – a combination not present in any prior attribution method. We formalise two companion variants – AGOP-Local (per-sample gradient, equivalent to VanillaGrad) and AGOP-Global (diag(M) directly as a zero-cost saliency map) – and implement an efficient training-time accumulation hook; AGOP-Global then requires zero inference cost (disk lookup) while AGOP-Weighted requires only a single gradient pass. We conduct the first rigorous comparison of AGOP attribution against Integrated Gradients (IG), SmoothGrad, GradCAM, and VanillaGrad across two benchmarks with pixel-level ground truth: (i) the synthetic XAI-TRIS benchmark (four classification scenarios, 8x8 images, CNN8by8) and (ii) the photorealistic CLEVR-XAI benchmark (ResNet-18 fine-tuned from ImageNet). AGOP-Weighted achieves 44% higher mIoU than IG on linear tasks; AGOP-Global achieves 7x higher mIoU than IG on multiplicative tasks (where IG falls below random) at zero inference cost. Both findings generalise to ResNet-18 on CLEVR-XAI (+18% and +37% respectively). We further show that GradCAM fails on small-resolution images due to spatial resolution collapse, and that diag(M) quality improves monotonically throughout training even after classification accuracy has plateaued.

[LG-102] Neurodata Without Boredom: Benchmarking Agent ic AI for Data Reuse

链接: https://arxiv.org/abs/2605.12808
作者: Ling-Qi Zhang,Kristin Branson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neuroscience data are highly fragmented across labs, formats, and experimental paradigms, and reuse often requires substantial manual effort. A persistent roadblock to data reuse and integration is the need to decipher bespoke and diverse data formatting choices. Common data formats have been proposed in response, but the field continues to struggle with a fundamental tension: formats flexible enough to accommodate diverse experiments are rarely descriptive enough to be self-explanatory, and sufficiently descriptive formats demand detailed documentation and curation effort that few labs can sustain. Agentic AI is a natural candidate to solve this problem: LLMs read code and text faster and with sustained attention to the low-level details humans tend to skim over. To measure how well agentic AI performs on this task, we selected eight recent papers studying large-scale mouse neural population recordings that shared both data and code, spanning diverse recording modalities, behavioral paradigms, and dataset formats (e.g., NWB, specialized APIs, and general-purpose Python or MATLAB files). We provided agents with the data, code, and paper, and prompted them to load, understand, and reformat the data for a common downstream task: training a decoder from neural activity to task or behavioral variables. General-purpose coding agents commonly used by scientists performed well on each sub-task, but rarely strung together a fully error-free end-to-end solution. We characterize the types of mistakes agents made and the dataset properties that elicited them, and propose data-sharing best practices for the agentic-AI era. We further find that agents-as-judges are unreliable at catching errors, especially without ground-truth references, so interactive, human-in-the-loop coding remains necessary.

[LG-103] Pitfalls of Unlabeled Disagreement-Based Drift Detection in Streaming Tree Ensembles ICLR2026

链接: https://arxiv.org/abs/2605.12803
作者: Lara Sá Neves,Afonso Lourenço,Lizy K. John,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at CAO Workshop at ICLR 2026

点击查看摘要

Abstract:Detecting concept drift in high-speed data streams remains challenging, particularly when models must operate on unlabeled data and avoid false alarms caused by benign shifts. While disagreement-based uncertainty has shown promise in neural networks, its adaptation to ensembles of incremental decision trees (IDTs) remains largely unexplored. We investigate this approach by constructing batch-specific disagreement measures via label flipping in ensemble members and evaluating their effectiveness for drift detection in tabular data streams. Our experiments show that, although this method performs well in ensembles of multi-layer perceptrons (MLPs), it consistently underperforms loss-based detectors when applied to IDTs. We attribute this behavior to the intrinsic rigidity of IDTs: learning primarily through structural expansion, with limited parameter adaptation, restricts model plasticity and prevents disagreement from reliably reflecting learning potential. Recent work on restructuring IDTs using their intrinsic decomposition into non-overlapping rules offers a promising direction for improving adaptability.

[LG-104] SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions

链接: https://arxiv.org/abs/2605.12792
作者: Thushari Hapuarachchi,Kaiqi Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There is recently a serious issue that Deep Neural Networks (DNNs) training uses more and more unauthorized data. A clean-label generalization attack, one type of data poisoning attacks, has been suggested to address this issue. The Neural Tangent Generalization Attack (NTGA) is considered as the first well-known clean-label generalization attack under the black-box settings, which provided an unprecedented step in data protection approaches. In this paper, we conduct a comprehensive analysis on the state-of-the-art of NTGA; to the best of our knowledge, this is the first thorough analysis regarding NTGA. First, we provide a classification of attacks against DNNs with their explanations and relations to NTGA. Then, this paper presents a taxonomy of black-box attacks and demonstrate that the NTGA is the first clean-label generalization attack under the black-box setting. We further analyze the existing studies of NTGA and give a comprehensive comparisons of their findings by conducting our own experiments to verify these findings. Moreover, our extensive experiments show that NTGA is vulnerable to adversarial training and image transformations, and applying linear separability to NTGA-generated images makes them more susceptible to such vulnerablities. We present the pros and cons of NTGA and suggest ways to improve NTGA robustness based on our analysis. Our further experiments indicate that several recently proposed clean-label generalization attacks outperform NTGA on data protection. Finally, we unveil the necessity of further research with future research insights on NTGA.

[LG-105] From Heuristics to Analytics: Forecasting Effort and Progress in Online Learning

链接: https://arxiv.org/abs/2605.12788
作者: Eric S. Qiu,Danielle R. Thomas,Boyuan Guo,Vincent Aleven,Conrad Borchers
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted as full paper to the 19th International Conference on Educational Data Mining (EDM 2026)

点击查看摘要

Abstract:Sustained effort is essential for realizing the benefits of intelligent tutoring systems (ITS), yet many learners disengage or underuse available practice time. We introduce engagement forecasting as a supervised prediction task based on ITS logs, targeting two outcomes central to effort and learning progress: minutes practiced per week and new skills mastered per week. Using interaction log data from 425 middle-school students over a school year, we benchmark fifteen predictors including regressions, decision trees, and neural networks. We show that these feature-based models reduce mean absolute error (MAE) by 22-33% relative to heuristic baselines, including fixed-percentile rules adapted from prior work in other behavioral domains. We find that percentile heuristics systematically overpredict, whereas feature-based models better track student practice trajectories across weeks. To support explainability, we analyze feature importance and ablations, revealing target-specific patterns: effort forecasting is driven mainly by recent activity features, while progress forecasting depends more on learner-state and content difficulty signals. Finally, in a semi-structured user interview case study with eight college tutors, we examine how tutors reasoned about system-generated predictive features when setting goals with students. We find that tutors reasoned differently about effort versus progress goals in ways that mirror our pattern analysis. Together, these results establish a reproducible benchmark for forecasting weekly effort and learning progress in ITS. By making patterns of sustained effort and progress visible at a weekly timescale, engagement forecasting offers a foundation for supporting tutor-learner goal setting and timely instructional decisions.

[LG-106] Identifying the nonlinear string dynamics with port-Hamiltonian neural networks

链接: https://arxiv.org/abs/2605.12785
作者: Maximino Linares,Guillaume Doras,Thomas Hélie
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Hybrid machine learning combines physical knowledge with data-driven models to enhance interpretability and performance. In this context, Port-Hamiltonian Systems (PHS), which generalize Hamiltonian mechanics to describe open, non-autonomous dynamical systems, have been successfully integrated with neural networks under the name Port-Hamiltonian Neural Networks (PHNNs). While the ability of PHNNs to identify Hamiltonian ordinary differential equation (ODE) systems has already been demonstrated, their application to learning Hamiltonian partial differential equation (PDE) systems remains largely unexplored. This limitation restricts their use in musical acoustics, where instruments are typically modeled as distributed parameter systems governed by PDEs. In this work, we demonstrate how to learn the nonlinear string dynamics from data in a physically-consistent framework through a PHNN extension to PDEs. By constructing structured neural network architectures based on PHS, we can recover both the Hamiltonian governing the string and the dissipation affecting it. This approach outperforms baseline, non-physics-informed methods in terms of both accuracy and interpretability. Numerical experiments using synthetic data demonstrate the ability of the proposed PHNN model to identify and emulate the nonlinear dynamics of the system.

[LG-107] oolMol: Evolutionary Agent ic Framework for Multi-objective Drug Discovery

链接: https://arxiv.org/abs/2605.12784
作者: Andrew Y. Zhou,Sharvaree Vadgama,Sumanth Varambally,Peter Eckmann,Michael K. Gilson,Rose Yu
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Advances in large language models (LLMs) have recently opened new and promising avenues for small-molecule drug discovery. Yet existing LLM-based approaches for molecular generation often suffer from high rates of invalid and low-quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings. In this paper, we introduce \textttToolMol , an evolutionary agentic framework for de novo drug design. \textttToolMol combines a multi-objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population. We build a comprehensive toolbox of RDKit-backed functions that allows our agentic operator to consisently make precise ligand modifications. \textttToolMol achieves state-of-the-art performance on multi-objective property optimization tasks, discovering drug-like and synthesizable ligands that have 10% stronger predicted binding affinity compared to existing methods, evaluated on three protein targets. \textttToolMol ligands additionally achieve state-of-the-art results in gold-standard Absolute Binding Free Energy scores, gaining over existing methods by over 35% . By studying chain-of-thought reasoning traces, we observe that tool-calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs.

[LG-108] Graph-Based Financial Fraud Detection with Calibrated Risk Scoring and Structural Regularization

链接: https://arxiv.org/abs/2605.12782
作者: Yunfei Nie,Jiawei Wang,Ruobing Yan,Yuhan Wang,Zouxiaowei Ma,Yilun Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial transaction fraud prevention faces challenges such as complex relationship structures, concealed behavioral patterns, and dynamically changing data distribution. Discrimination models relying solely on independent sample features are insufficient to fully characterize the risks of group collaboration and chain transfers within transaction networks. This paper proposes a graph neural network representation learning and risk discrimination framework for financial transaction fraud prevention. It integrates transaction records and identity information into node attributes and constructs a transaction graph based on shared attributes and interaction consistency to explicitly model inter-transaction relationships. In model design, a multi-layer message passing mechanism is employed to aggregate neighborhood information, learn node embedding representations containing structural context semantics, and output transaction-level fraud probability and risk scores through a lightweight risk discrimination head. A weighted supervision objective is introduced to mitigate training bias caused by class imbalance, and structural consistency regularization constraints are combined to suppress the impact of noisy edges on representation drift, thereby improving the stability and usability of risk characterization. Experiments are conducted on a publicly available financial transaction dataset, comparing various methods in the same direction and comprehensively evaluating them under a unified evaluation protocol. The results show that the proposed method outperforms other methods in risk ranking and probability calibration quality, validating the effectiveness of graph structure modeling and representation learning collaboration in financial transaction fraud prevention.

[LG-109] Inference-Time Machine Unlearning via Gated Activation Redirection

链接: https://arxiv.org/abs/2605.12765
作者: Vinícius Conte Turani,Otávio Parraga,João Vitor Boer Abitante,Kristen K. Arguello,Joana Pasquali,Ramiro N. Barros,Flavio du Pin Calmon,Christian Mattjie,Rodrigo C. Barros,Lucas S. Kupssinskü
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.

[LG-110] State-Space NTK Collapse Near Bifurcations

链接: https://arxiv.org/abs/2605.12763
作者: James Hazelden,Eric Shea-Brown
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics. We develop a local theory of gradient descent near these transitions through the empirical state-space neural tangent kernel (sNTK). Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, we give a procedure for decomposing sNTK into bifurcation-relevant and residual channels, showing that near commonly codimension-1 bifurcations the relevant channel is a rank-one operator that is highly amplified. This amplification causes the bifurcation channel to dominate the full sNTK. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms. We illustrate this in a student-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in sNTK effective rank and the emergence of a dominant parameter direction whose restricted sNTK closely matches the landscape predicted by the scalar pitchfork normal form. Finally, we show that low-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD.

[LG-111] Predicting Channel Closures in the Lightning Network with Machine Learning

链接: https://arxiv.org/abs/2605.12759
作者: Simone Antonelli,Vincent Davis,Harrison Rush,Anthony Potdevin,Jesse Shrader,Vikash Singh,Emanuele Rossi
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 8 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The Lightning Network (LN) is a second-layer protocol for Bitcoin designed to enable fast and cost-efficient off-chain transactions. Channels in the LN can be closed either by mutual agreement or unilaterally through a forced closure, which locks the involved capital for an extended period and degrades network reliability. In this paper, we study the problem of predicting channel closure types from publicly available gossip data, framing it as a temporal link classification task over the evolving channel graph. We construct a dataset spanning over two years of LN activity and benchmark a range of machine learning approaches, from MLPs to temporal graph neural networks and spectral encodings. Our experiments reveal that the dominant predictive signals are temporal and behavioural, namely how recently each endpoint was active and the per-node history of past closures, while the surrounding network topology provides no additional benefit. We find that a simple MLP operating on edge-level features, node-level event counts, and temporal patterns outperforms all graph-based approaches, and discuss how the inherent privacy of the LN, where critical information such as channel balances and payment flows remains hidden, fundamentally limits the predictability of closures from gossip data alone. We publicly release the dataset and code at this https URL to encourage further research on this practically relevant task.

[LG-112] Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling

链接: https://arxiv.org/abs/2605.12754
作者: Jacob K. Christopher,James E. Warner,Ferdinando Fioretto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models provide state-of-the-art performance across a wide array of applications, with recent studies showing increasing applicability for science and engineering. Despite a growing corpus of literature focused on the integration of physics-based constraints into the generation process, existing approaches fail to enforce strict constraint satisfaction while maintaining sample quality. In particular, training-free constrained sampling methods, while providing per-sample feasibility guarantees, introduce a fundamental mismatch between the training objective and the constrained sampling procedure, often leading to performance degradation. Identifying this training-sampling misalignment as a central limitation of current constrained generative modeling approaches, this paper proposes Constraint-Aware Flow Matching, a novel end-to-end framework that explicitly incorporates constraint projections into the training objective. By aligning the model’s learned dynamics with the constrained sampling process, the proposed method mitigates distributional shift induced by projection-based corrections, enabling high-quality constrained generation. The proposed approach is evaluated on three challenging real-world benchmarks, illustrating the generality and efficacy of the method.

[LG-113] Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning

链接: https://arxiv.org/abs/2605.12752
作者: Joana Pasquali,Ramiro N. Barros,Arthur S. Bianchessi,Vinícius Conte Turani,João Vitor Boer Abitante,Rafaela Cappelari Ravazio,Christian Mattjie,Otávio Parraga,Lucas S. Kupssinskü,Rodrigo C. Barros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LoRA is widely adopted for continual fine-tuning of Large Language Models due to its parameter efficiency, modularity across tasks, and compatibility with replay strategies. However, LoRA-based continual learning remains vulnerable to catastrophic forgetting, whose severity depends on how successive task gradients interact: when consecutive task gradients conflict, standard adapter initializations channel updates into subspaces that overwrite previously learned directions. We propose SLICE, a gradient-surgery-based initialization for LoRA adapters in continual learning. SLICE accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and decomposes the result via truncated SVD to initialize the adapter weights. We evaluate SLICE on the TRACE benchmark and sequences of Super-NI tasks, including a set of adversarial Super-NI sequences that we construct by mining task pairs with maximally opposing gradients. Compared to vanilla LoRA, LoRA-GA, and LoRAM, SLICE consistently achieves a better stability-plasticity trade-off, improving Average Performance, Final Performance and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences.

[LG-114] Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

链接: https://arxiv.org/abs/2605.12741
作者: Yuwei Zhang,Sha Li,Changlong Yu,Qin Lu,Shuowei Jin,Chengyu Dong,Haoran Liu,Ilgee Hong,Xintong Li,Zhenyu Shi,Bing Yin,Jingbo Shang
类目: Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to continuously improve from environmental interactions is a central challenge in post-training. While on-policy self-distillation offers a promising paradigm, existing methods predominantly treat environmental feedback as a passive conditioning signal. Consequently, they heavily rely on successful demonstrations and struggle to learn in rare-success regimes. To bridge this gap, we introduce Reflection-Enhanced Self-Distillation (RESD), a framework that transforms raw failure feedback into an active source of corrective supervision. Instead of passively appending feedback, RESD interprets failed trajectories by generating retrospective reflections to diagnose local errors, and curates a persistent global playbook to preserve reusable lessons across training steps. The enriched context enables the self-teacher to provide actionable token-level supervision even in the absence of successful rollouts. Empirical evaluations on multiple continual learning tasks demonstrate that RESD substantially outperforms standard self-distillation baselines. Furthermore, RESD achieves significantly faster early-stage improvement than GRPO with 8\times samples using only a single rollout per prompt, highlighting its superior interaction efficiency.

[LG-115] ConRetroBert: EMA Stabilized Dual Encoders for Template-Based Single-Step Retrosynthesis NEURIPS2026

链接: https://arxiv.org/abs/2605.12736
作者: Mohammad Jahid Ibna Basher,Ali Khodabandeh Yalabadi,Ivan Garibay,Ozlem Ozmen Garibay
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2026 Main Conference

点击查看摘要

Abstract:Template based single step retrosynthesis predicts reactants by selecting and applying an explicit reaction template, making each prediction traceable to a chemical transformation rule. This is useful for synthesis planning, but template based methods are often viewed as less competitive than template free models because template prediction is commonly formulated as global classification over a long tailed rule library. We argue that this weakness is not inherent to templates, but to the learning formulation. We present ConRetroBert, a dual encoder framework that reframes template based retrosynthesis as dense product template retrieval followed by candidate set listwise ranking. Stage 1 uses contrastive pretraining to learn a shared embedding space between products and reaction templates. Stage 2 refines template ranking over mined hard negative candidate sets with a multi positive listwise objective. To enable template side adaptation without destabilizing hard negative mining, ConRetroBert uses a slow moving exponential moving average template encoder for retrieval bank construction while updating the live template encoder through the ranking loss. On the local USPTO-50k benchmark, Stage 2 candidate set ranking improves top-1 reaction accuracy from 50.5% to 61.3%, while EMA stabilized template adaptation further improves it to 62.4%. Fine tuning from a leakage controlled USPTO-Full checkpoint reaches 75.4% top-1 accuracy on USPTO-50k. We also show that retrieval based template prediction is strong in the long tail of rare templates, and that many correct reactant predictions arise from alternative explicit templates rather than only the recorded positive label. Code and data are available at this https URL.

[LG-116] Before the Last Token: Diagnosing Final-Token Safety Probe Failures

链接: https://arxiv.org/abs/2605.12726
作者: Shravan Doda
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe’s representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes

[LG-117] A Five-Layer MLOps Architecture for Connected Automated Driving

链接: https://arxiv.org/abs/2605.12719
作者: Bastian Lampe,Lutz Eckstein
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:The continual assurance of safety and performance of automated driving systems (ADSs) poses significant challenges. ADSs operate in complex, dynamic, open-world environments allowing a wide range of scenarios, including ones that are rare or not foreseen during initial development. While the incorporation of artificial intelligence (AI) and machine learning (ML) technology allows ADSs to learn from data gathered during operation and thus enables them to adapt over time, these approaches come with their own challenges. A key advantage of ADSs compared to human drivers is their greater ability to gather data collectively across a fleet of vehicles, or even across multiple fleets operated by different entities, and to learn from this data collectively. Vehicles can share and combine their data to identify additional learning opportunities otherwise missed by individual vehicles. This creates new opportunities to tackle the challenges of continual assurance of safety and performance, but requires the implementation of architectures that leverage the collective learning potential. Based on established MLOps principles and existing work in the field of connected automated driving, this paper presents a five-layer architecture for collective learning-enabled MLOps processes for ADSs. The goal of this architecture is to provide a conceptual blueprint for the design and implementation of MLOps processes by fleet operators and other relevant stakeholders. The paper describes the main responsibilities of each layer, their interactions, and how multi-level self-assessments enabled by the architecture can support the detection and reduction of edge cases including black swan events.

[LG-118] Spectral Energy Centroid: a Metric for Improving Performance and Analyzing Spectral Bias in Implicit Neural Representations

链接: https://arxiv.org/abs/2605.12709
作者: Tomasz Dądela,Adam Kania,Maciej Rut,Przemysław Spurek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) model continuous signals using multilayer perceptrons (MLPs), enabling compact, differentiable, and high-fidelity representations of data across diverse domains. However, due to the low-frequency bias of MLPs that prevents effective learning of small details, the model’s frequency must be carefully tuned through the embedding layer. Prior work established that this tuning can be performed before training based on the target signal, but it did not account for the significant effect of model depth, indicating that our understanding of the relationship between frequency and INR performance remains limited. To gain insights into this relationship, we utilize the Spectral Energy Centroid (SEC) metric that quantifies the frequency of target images and the spectral bias of INR models. We show that SEC is a versatile tool for INR analysis, demonstrating its utility across three tasks: (1) a data-driven strategy (SEC-Conf) for hyperparameter selection that outperforms existing heuristics and is robust to model depth, (2) a reliable proxy for signal complexity, and (3) effective alignment of spectral biases across diverse INR architectures.

[LG-119] A Resampling-Based Framework for Network Structure Learning in High-Dimensional Data

链接: https://arxiv.org/abs/2605.12706
作者: Ziwei Huang,Zeyuan Song,Paola Sebastiani,Stefano Monti
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:RSNet is an open-source R package that provides a resampling-based framework for robust and interpretable network inference, designed to address the limited-sample-size challenges common in high-dimensional data. It supports both the estimation of partial correlation networks modeled as Gaussian networks and conditional Gaussian Bayesian networks for mixed data types that combine continuous and discrete variables. The framework incorporates multiple resampling strategies, including bootstrap, subsampling, and cluster-based approaches, to accommodate both independent and correlated observations. To enhance interpretability, RSNet integrates graphlet-based topology analysis that captures higher-order connectivity and edge sign information, enabling single-node and subnetwork-level insights. Notably, RSNet is the first R package to efficiently construct signed graphlet degree vector matrices (GDVMs) in near-constant time for sparse networks, providing scalable analysis of higher-order network structure. Collectively, RSNet offers a versatile tool for statistically reliable and interpretable network inference in high-dimensional data.

[LG-120] Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

链接: https://arxiv.org/abs/2605.12705
作者: Lawrence Feng,Gaurav R. Ghosal,Jacob Mitchell Springer,Ziqian Zhong,Aditi Raghunathan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How can we train models whose post-trained capabilities survive subsequent fine-tuning? Rather than focusing on downstream interventions to mitigate forgetting of upstream capabilities, we study how upstream training choices - that is, the manner in which a capability is acquired - shape how robustly that capability is retained. We investigate this question in a controlled three-stage language-model pipeline: pretraining, post-training to acquire a target capability, and downstream fine-tuning on a new objective. Across 135M and 1B models, two post-training domains, and two downstream fine-tuning tasks, we find that immediate post-training performance does not reliably predict retention after subsequent fine-tuning: training recipes that look equivalent immediately after post-training can retain the target capability very differently after subsequent fine-tuning. In particular, early exposure - mixing post-training data into pretraining - consistently improves the frontier between retained upstream performance and downstream performance. In compute-matched experiments, where the target data must be allocated between pretraining and post-training, we find that the optimum lies at neither extreme. Together with our other empirical and theoretical findings, this supports the view that post-training drives immediate specialization while early exposure improves robustness to later forgetting. Replay and dropout, typically used to mitigate forgetting as it occurs during fine-tuning, provide complementary gains to early exposure when applied during post-training. Our findings suggest that robustness to subsequent fine-tuning should be treated as a first-class objective of upstream training, addressed preventatively through choices like early exposure rather than reactively during fine-tuning itself.

[LG-121] UFO: A Domain-Unification-Free Operator Framework for Generalized Operator Learning

链接: https://arxiv.org/abs/2605.12700
作者: Hanli Qiao,George Em Karniadakis,Muhammad Muniruzzaman
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural operators have become an effective framework for learning mappings between function spaces, yet most existing architectures realize operators within a single representational domain, such as physical, spectral, or latent space. In this work, we introduce UFO (Domain-Unification-Free Operator), a cross-domain neural operator framework that realizes operators through adaptive, jointly conditioned interactions among representations defined on distinct domains. UFO enables discretization decoupling: the input function can be observed at resolutions or locations different from those used during training, while the solution can be queried at arbitrary output resolutions. Across four complementary benchmarks covering discontinuous inputs, irregular sampling with spectral mismatch, nonlinear dynamics, and stochastic high-frequency fields, UFO delivers accurate, robust, and physically coherent predictions under distribution shifts. These results establish cross-domain, phase-modulated realization as a powerful framework for discretization-decoupled neural operator learning.

[LG-122] IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback NEURIPS2026

链接: https://arxiv.org/abs/2605.12693
作者: Benjamin Amoh,Geoffrey G. Parker,Wesley Marrero
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, NeurIPS 2026 conference

点击查看摘要

Abstract:Decision-focused learning trains predictive models end-to-end against downstream decision loss, but online settings suffer delayed feedback: outcomes may not arrive for many environment interactions. We identify \emphstaleness amplification, a failure mode unique to bilevel optimization under delay, in which gradient staleness couples with inner-solver sensitivity to inflate regret beyond single-level delay theory. We prove that any black-box delayed optimizer incurs an irreducible regret cost from inner-solver approximation error, and that gradient staleness contributes a quadratically growing transport error without bilevel-aware correction. Our algorithm, \textbfIGT-OMD, applies Implicit Gradient Transport to hypergradients within Online Mirror Descent, re-evaluating stale gradients at the current parameters using stored inner solutions. This method reduces transport error from a quadratic to a linear dependence on delay and achieves the first sublinear regret bound for delayed bilevel optimization with queue-length-adaptive step sizes. Controlled experiments provide a \emphmechanistic fingerprint: transport benefit is exactly 0.0% ( p=1.00 ) at unit delay and grows monotonically to 9.5% at fifty rounds ( p0.001 ), isolating the correction’s effect. On Linear Quadratic Regulator, Warcraft shortest-path, and Sinkhorn optimal transport, IGT-OMD reduces decision loss by 17 – 55% relative to single-level baselines, with phase transitions matching the theory.

[LG-123] Profit Maximization in Bilateral Trade against a Smooth Adversary

链接: https://arxiv.org/abs/2605.12664
作者: Simone Di Gregorio,Paul Dütting,Federico Fusco,Chris Schwiegelshohn
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, who wish to trade a good. We study this problem from the perspective of a profit-maximizing broker within an online learning framework, where the agents’ valuations are generated by a smooth adversary. We devise a learning algorithm that guarantees a \tildeO(\sqrtT) regret bound, which is tight in the time horizon T up to poly-logarithmic factors. This matches the minimax rate for the stochastic i.i.d. case, and is also well separated from the adversarial setting, where sublinear-regret is unattainable. By extending the strong regret guarantees from the i.i.d. case to the smooth adversary, we significantly broaden the scope of settings where such fast rate is achievable, while closing an important gap in the regret landscape of this fundamental economic problem. To overcome the challenges posed by this adversary, we leverage a continuity property of smooth instances and combines this with a hierarchical net-construction of the broker’s action space, which is analyzed via algorithmic chaining. We showcase the applicability of these techniques by deriving a similarly tight \tildeO(\sqrtT) regret bound for a related mechanism design model: the joint ads problem. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2605.12664 [cs.GT] (or arXiv:2605.12664v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2605.12664 Focus to learn more arXiv-issued DOI via DataCite

[LG-124] scShapeBench: Discovering geometry from high dimensional scRNAseq data

链接: https://arxiv.org/abs/2605.12662
作者: Andrew J Steindl,João Felipe Rocha,Brian Tshilengi Di Bassinga,Zachary Warren,Matthew Scicluna,César Miguel Valdez Córdova,Shabarni Gupta,Leire Torices,Daniel Neumann,Timothy J. Mann,Ihuan Gunawan,Dhananjay Bhaskar,John G Lock,Christine L Chaffer,Guy Wolf,Smita Krishnaswamy
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these datasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.

[LG-125] Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

链接: https://arxiv.org/abs/2605.12651
作者: Parv Kapoor,Abigail Hammer,Ashish Kapoor,Karen Leung,Eunsuk Kang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Runtime monitoring of autonomous systems traditionally relies on mapping continuous sensor observations to discrete logical propositions defined over low-dimensional state variables. This abstraction breaks down in perception-driven settings, where such mappings require additional learned modules that are often computationally expensive, brittle, and semantically misaligned. In this work, we propose Embedding Temporal Logic (ETL), a temporal logic that performs monitoring directly in learned embedding spaces. ETL defines predicates through distances between observed embeddings and target embeddings derived from reference observations. This formulation allows specifications to capture high-level perceptual concepts, such as similarity to visual goals or avoidance of semantic regions, that are difficult or impossible to express using traditional predicates. By composing these predicates with temporal operators, ETL naturally expresses temporally extended and sequential perceptual behaviors. We introduce ETL monitors for evaluating specifications over bounded embedding traces, along with a conformal calibration procedure that provides reliable and safety-oriented predicate evaluation. We evaluate our approach across multiple manipulation environments to show that ETL achieves strong empirical agreement with ground-truth semantics, including accurate monitoring of temporally composed behaviors.

[LG-126] Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

链接: https://arxiv.org/abs/2605.12648
作者: Puyu Wang,Jan Schuchardt,Nikita Kalinin,Junyu Zhou,Sophie Fellenz,Christoph Lampert,Marius Kloft
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as differentially private SGD (DP-SGD) with Gaussian perturbations that interpolate between independent and temporally correlated noise. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini-batch SGD, the standard recipe for modern networks, rather than full-batch gradient descent (GD); and correlated-noise mechanisms have empirically shown a more favorable privacy-utility tradeoff than independent-noise mechanisms. Our results cover the corresponding full-batch GD and independent-noise DP-GD results for KANs by Wang et al. (2026), while yielding sharper fixed-second-layer specializations. The technical core is a new analysis route for correlated-noise DP training in the non-convex regime. Temporal dependence breaks the conditional-centering structure underlying standard one-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high-probability bootstrap certifying projection inactivity. Combining this optimization analysis with a stability-based generalization argument yields the stated population risk bounds. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated-noise mechanism for DP training beyond convex learning, in particular for neural networks.

[LG-127] OceanCBM: A Concept Bottleneck Model for Mechanistic Interpretability in Ocean Forecasting

链接: https://arxiv.org/abs/2605.12639
作者: Sanah Suri,Kieran Ringel,Maike Sonnewald
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Extreme ocean phenomena are challenging not only to predict but to diagnose, as accurate forecasts alone do not reveal the underlying physical drivers. While recent machine learning approaches achieve strong predictive skill, they remain largely opaque and provide limited guarantees of fidelity to ground-truth physics. We introduce OceanCBM, the first concept bottleneck model (CBM) for spatiotemporal prediction and mechanistic interrogation of ocean dynamics. OceanCBM uses mixed supervision to predict mixed layer heat content, a key precursor of marine heatwaves, while routing information through an intermediate layer of prescribed concepts derived from geophysical fluid dynamics and a ‘free’ concept. This design imposes soft physical structure without over-constraining the model, and the free concept both regularizes concept predictions and captures residual physical processes. Across ensemble initializations, we show that mixed supervision yields consistent mechanistic representations, whereas prediction-only and prescription-only baselines learn highly variable latent structures despite similar predictive performance. OceanCBM achieves interpretable, physically grounded representations without sacrificing skill, explicitly characterizing the interpretability-performance trade-off.

[LG-128] CAWI: Copula-Aligned Weight Initialization for Randomized Neural Networks

链接: https://arxiv.org/abs/2605.12580
作者: Mushir Akhtar,M. Tanveer,Mohd. Arshad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized neural networks (RdNNs) enable efficient, backpropagation-free training by freezing randomly initialized input-to-hidden weights, which permits a closed-form solution for the output layer. However, conventional random initialization is blind to inter-feature dependence, ignoring correlations, asymmetries, and tail dependence in the data, which degrades conditioning and predictive performance. To the best of our knowledge, this limitation remains unaddressed in the RdNN literature. To close this gap, we propose CAWI (Copula-Aligned Weight Initialization), a framework that draws input-to-hidden weights from a data-fitted copula that matches empirical dependence, ensuring the frozen projections respect inter-feature dependence without sacrificing the closed-form solution. CAWI (i) maps each feature to the unit interval using empirical CDFs, (ii) fits a multivariate copula that captures rank-based dependence among features, and (iii) samples each weight column w_j from the fitted copula and applies a fixed inverse marginal transform to set scale. The objective, solver, and “freeze-once” paradigm remain unchanged; only the sampling law for W becomes dependence-aware. For dependence modeling, we consider two copula families: elliptical (Gaussian, t) and Archimedean (Clayton, Frank, Gumbel). This enables CAWI to handle diverse dependence, including tail dependence. We evaluate CAWI across 83 diverse classification benchmarks (binary and multiclass) and two biomedical datasets, BreaKHis and the Schizophrenia dataset, using standard shallow and deep RdNN architectures. CAWI consistently delivers significant improvements in predictive performance over conventional random initialization. Code is available at: this https URL

[LG-129] Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

链接: https://arxiv.org/abs/2605.12561
作者: Adam Haroon,Erick J. Rodríguez-Seda,Cody Fleming,Tristan Schuler
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 27 pages, 6 figures

点击查看摘要

Abstract:Safe reinforcement learning (RL) typically asks \textitwhat an agent should do. We ask \textitwhen it needs to act, and show that a single policy can jointly learn control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield. We focus on stabilization around a known equilibrium, where CARE-based LQR backups, Lyapunov certificates, and classical Lyapunov-STC are well defined, enabling clean comparison against analytical baselines. A run-time assurance (RTA) layer overrides the policy via a one-step-ahead Lyapunov prediction and a precomputed LQR backup, providing a strictly stronger guarantee than constrained MDP methods that enforce safety only in expectation. On an inverted pendulum, cart–pole, and planar quadrotor, the learned policy achieves 1.91\times , 1.45\times , and 3.51\times higher mean inter-sample interval (MSI) than a Lyapunov-triggered baseline; a fixed LQR controller at the same average rate is unstable on all three plants, showing that adaptive timing, not a lower average rate, makes sparsity safe. A CARE-derived Lyapunov reward transfers across environments without redesign, with a single weight w_c controlling the stability–communication tradeoff; ablations confirm the RTA shield is essential, with its removal reducing MSI by 1.27 – 1.84\times and degrading state norms. A preference-conditioned extension recovers the full tradeoff frontier from one model at \tfrac211 of training compute, and SAC experiments show the results are algorithm-agnostic across discrete and continuous domains. A 12-state 3D quadrotor case study extends the framework to higher-dimensional systems where classical STC is intractable, and robustness to \pm30% mass variation and disturbances shows graceful degradation, with the RTA absorbing what the learned policy cannot.

[LG-130] BioSEN: A Bio-acoustic Signal Enhancement Network for Animal Vocalizations

链接: https://arxiv.org/abs/2605.12534
作者: Tianyu Song,Ton Viet Ta,Ngamta Thamwattana,Hisako Nomura,Linh Thi Hoai Nguyen
类目: ound (cs.SD); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Most work in audio enhancement targets human speech, while bioacoustics is less studied due to noisy recordings and the distinct traits of animal sounds. To fill this gap, we adapt speech enhancement methods and build BioSEN, a model made for bioacoustic signals. BioSEN has three modules: a multi-scale dual-axis attention unit for time-frequency feature extraction, a bio-harmonic multi-scale enhancement unit for capturing harmonic structures, and an energy-adaptive gating connection unit that uses frequency weights to keep vocalizations from being removed as noise. Tests on three bioacoustic datasets show that BioSEN matches or exceeds state-of-the-art speech enhancement models while using far less computation. These results show BioSEN’s strength for bioacoustic audio enhancement and its promise for biodiversity monitoring and conservation. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2605.12534 [cs.SD] (or arXiv:2605.12534v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2605.12534 Focus to learn more arXiv-issued DOI via DataCite Journalreference: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Related DOI: https://doi.org/10.1109/ICASSP55912.2026.11463818 Focus to learn more DOI(s) linking to related resources

[LG-131] Real-World Challenges in Fake News Detection: Dealing with Posts by Cold Users

链接: https://arxiv.org/abs/2605.12511
作者: Sai Keerthana Karnam,Abhirup Kundu,Jashn Arora,Manish Jain,Animesh Mukherjee
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: This paper is accepted at ICWSM 2026

点击查看摘要

Abstract:Social media serves as a primary source of information in the current digital era. Many people consume a vast range of information in a very short span, yet, amidst the stream of genuine information, fake news and rumors continue to spread. The need for effective detection models is becoming increasingly critical. Past user behavior and user engagement on a post are strong signals that SOTA approaches leverage for fake news detection and other post classification tasks. However, these approaches lean too heavily on knowing this past behavior, and thus suffer from a cold user problem, or users that are new or have minimal footprint on the platform. In this paper, we make three core contributions. We first establish the value of user behavior, both content and user-user interactions, in the task of fake news and rumor detection. We then establish the extensive prevalence of cold users in the real-world datasets, and show the need for newer algorithms considering cold users. We next propose a novel socially-aware context representation scheme - USER EVIDENCE NETWORK (UEN) - to detect the spread of misinformation and unverified information while efficiently navigating this cold user challenge. We introduce techniques that approximate missing or absent behavior data of a new user from existing users’ interactions. By carefully addressing the cold user challenge, our work provides robust approaches targeting fake news and rumor detection for real-world platforms.

[LG-132] What is Learnable in Valiants Theory of the Learnable?

链接: https://arxiv.org/abs/2605.13840
作者: Steve Hanneke,Anay Mehrotra,Grigoris Velegkas,Manolis Zampetakis
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注: Abstract shortened for arXiv

点击查看摘要

Abstract:Valiant’s 1984 paper is widely credited with introducing the PAC learning model, but it, in fact, introduced a different model: unlike PAC learning, the learner receives only positives, may issue membership queries, and must output a hypothesis with no false positives. Prior work characterized variants, including the case without queries. We revisit Valiant’s original model and ask: Which classes are learnable in it? For every finite domain, including Valiant’s Boolean-hypercube setting, we show that a class is learnable if and only if every realizable positive sample can be certified by a poly-size adaptive query-compression scheme. This is a new variant of sample compression where the learner certifies samples via a short interaction with the membership oracle. Our characterization shows that learnability in Valiant’s model is strictly sandwiched between learnability in the PAC model and the variant of Valiant’s model without membership queries. This is one of the rare cases where introducing membership queries changes the set of learnable classes, and not just the sample or computational complexity. Next, we study the natural extension of the model to arbitrary domains. While we do not obtain an exact characterization, our techniques readily generalize and show that the same strict sandwiching persists. Finally, we show that d -dimensional halfspaces, which are not learnable without queries, are learnable with queries: we give a \mathrmpoly(d) \tildeO(1/\epsilon) sample and \mathrmpoly(d) \mathrmpolylog(1/\epsilon) query algorithm, and prove that at least \Omega(d) samples or queries are necessary. To our knowledge, this is the first algorithm for halfspaces in Valiant’s model. Together, these results uncover a surprisingly rich theory behind Valiant’s original notion of learnability and introduce ideas that may be of independent interest in learning theory. Comments: Abstract shortened for arXiv Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2605.13840 [stat.ML] (or arXiv:2605.13840v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.13840 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anay Mehrotra [view email] [v1] Wed, 13 May 2026 17:58:46 UTC (112 KB)

[LG-133] Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

链接: https://arxiv.org/abs/2605.13807
作者: Ejaaz Merali,Mohamed Hibat-Allah,Mohammad Kohandel,Richard T. Scalettar,Ehsan Khatami
类目: rongly Correlated Electrons (cond-mat.str-el); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: 13 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Neural-network quantum states have emerged as a powerful variational framework for quantum many-body systems, with recent progress often driven by massively parallel architectures such as transformers. Recurrent neural network quantum states, however, are frequently regarded as intrinsically sequential and therefore less scalable. Here we revisit this view by showing that modern recurrent architectures can support fast, accurate, and computationally accessible neural quantum state simulations. Using autoregressive recurrent wave functions together with recent advances in parallelizable recurrence, we develop variational ansätze, called parallel scan recurrent neural quantum states (PSR-NQS), which can be trained efficiently within variational Monte Carlo in one and two spatial dimensions. We demonstrate accurate benchmark results and show that, with iterative retraining, our approach reaches two-dimensional spin lattices as large as 52\times52 while remaining in agreement with available quantum Monte Carlo data. Our results establish recurrent architectures as a practical and promising route toward scalable neural quantum state simulations with modest computational resources.

[LG-134] Conformal Anomaly Detection in Python: Moving Beyond Heuristic Thresholds with nonconform

链接: https://arxiv.org/abs/2605.13642
作者: Oliver Hennhöfer,Maximilian Kirsch,Christine Preisach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Most anomaly detection systems output scores rather than calibrated decisions, leaving practitioners to choose thresholds heuristically and without clear statistical interpretation. Conformal anomaly detection addresses this limitation by converting anomaly scores into calibrated p-values that are valid under the statistical assumption of data exchangeability, with a growing literature extending this idea beyond that setting. We present ‘nonconform’, a Python package for applying conformal anomaly detection within existing machine-learning workflows, and use it as the basis for an implementation-grounded introduction to the field. The package integrates with ‘scikit-learn’, ‘pyod’, and custom anomaly detectors, and provides a unified interface for calibration, p-value generation, and false discovery rate control. It supports several conformalization strategies, ranging from simple split-conformal calibration to more data-efficient and shift-aware extensions. Through a progression from foundational concepts to advanced conformalization strategies, complemented by code examples, the paper connects the statistical ideas behind conformal anomaly detection to their practical use in ‘nonconform’. Empirical results demonstrate that the implemented methods enable statistically principled anomaly detection. Together, the package and exposition aim to make core conformal anomaly detection workflows more accessible and reproducible in experimental and production-oriented settings.

[LG-135] CO-MAP: A Reinforcement Learning Approach to the Qubit Allocation Problem NEURIPS’26

链接: https://arxiv.org/abs/2605.13638
作者: Ankit Kulshrestha,Xiaoyuan Liu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Under review at NeurIPS’26

点击查看摘要

Abstract:A quantum compiler is a critical piece in the quantum computing pipeline since it allows an abstract quantum circuit to be run on a physical quantum computer. One extremely important subproblem in quantum compilation is the generation of a logical to physical qubit mapping. Typically in quantum compilers this step is either implemented as a random or a heuristic based assignment that aims to minimize additional (SWAP) gate overhead in the quantum circuit. In this paper, we present an alternative approach to solving the qubit mapping problem. Specifically, we formulate the qubit mapping problem with a combinatorial optimization (CO) objective. We then present a method to find a solution to the CO problem by training a reinforcement learning (RL) policy. We also propose a local search based post-processing algorithm to further reduce the overhead. Our results show a dramatic improvement over conventional techniques in reducing the number of SWAPs. On different real world datasets like MQTBench and Queko circuits, our trained policy achieves a \textbf65-85% reduction in SWAP overhead when compared to existing quantum compilers. Comments: Under review at NeurIPS’26 Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2605.13638 [quant-ph] (or arXiv:2605.13638v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.13638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-136] Causal Learning with the Invariance Principle

链接: https://arxiv.org/abs/2605.13589
作者: Francesco Montagna,Francesco Locatello
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery, the problem of inferring the direction of causality, is generally ill-posed. We use the language of structural causal models (SCM) to show that assuming that the causal relations are acyclic and invariant across multiple environments (e.g., the way minimum wage affects employment rate is stable across different geographical regions), \textitonly two auxiliary environments are sufficient to infer the causal graph for arbitrary nonlinear mechanisms. Moreover, we demonstrate that this implies identifiability of the SCM functional mechanisms: as a corollary, we show that \textittwo auxiliary environments are sufficient to guarantee correct counterfactual inference. We empirically support our theoretical results on synthetic data.

[LG-137] Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models

链接: https://arxiv.org/abs/2605.13587
作者: Gregory Beurier,Robin Reiter,Camille Noûs,Lauriane Rouan,Denis Cornet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 23 pages, 4 figures; includes supplementary material. Code: this https URL

点击查看摘要

Abstract:Near-infrared spectroscopy (NIRS) is rapid and non-destructive, but reliable calibration still depends heavily on spectral preprocessing. In routine practice, preprocessing is often selected by large external pipeline searches that are costly, unstable on small calibration sets, and difficult to audit. We introduce operator-adaptive calibration, a framework that moves linear preprocessing selection inside the calibration model. Candidate treatments are encoded as linear spectral operators, while nonlinear or sample-adaptive corrections such as SNV, MSC, and ASLS are handled as fold-local branches to prevent leakage. We instantiate the framework for PLS and Ridge regression. For PLS, covariance identities enable fast NIPALS and SIMPLS variants while preserving original-wavelength coefficients. For Ridge, operator-adaptive kernels yield a dual formulation with recoverable original-space coefficients. The approach was evaluated on more than 50 heterogeneous NIRS datasets against conventional PLS, Ridge, CatBoost, and CNN baselines under documented search budgets. Compact operator-adaptive PLS with ASLS branch preprocessing achieved a median RMSEP/PLS ratio of 0.960 with 42 wins on 57 datasets, while a deployable AOM-Ridge selector improved over tuned Ridge by a median 2.22% with 35 wins on 52 datasets. The proposed models reduce dependence on large preprocessing-HPO campaigns, produce traceable operator choices, retain interpretable coefficients, and fit in seconds for compact AOM-PLS. Operator-adaptive calibration therefore offers a practical route to faster, more robust, and more auditable NIRS method development. Comments: 23 pages, 4 figures; includes supplementary material. Code: this https URL Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2605.13587 [stat.ML] (or arXiv:2605.13587v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.13587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-138] Beyond Explained Variance: A Cautionary Tale of PCA

链接: https://arxiv.org/abs/2605.13520
作者: Gionni Marchetti
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis based on tt t-SNE and persistent homology.

[LG-139] On the Limits of Latent Reuse in Diffusion Models

链接: https://arxiv.org/abs/2605.13448
作者: Yifeng Yu,Lu Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Diffusion models are often trained in low-dimensional latent spaces, which are then reused for related but shifted datasets. In this work, we study when such latent reuse remains reliable under distribution shift. We consider a source-target setting in which both datasets are approximately low-dimensional but may lie near different subspaces. We show that freezing and reusing a source latent space induces a target-domain score error governed by two quantities: the principal-angle misalignment between the source and target subspaces, and the target ambient noise amplified by the diffusion time scale. Motivated by these limits, we further study mixed source-target training and characterize how the required shared latent dimension depends on the relative geometry of the two distributions. Our results provide theoretical guidance on when latent reuse is reliable and when learning a shared representation may be necessary.

[LG-140] Learning Perturbations to Extrapolate Your LLM

链接: https://arxiv.org/abs/2605.13284
作者: Zetai Cen,Chenfei Gu,Jin Zhu,Ting Li,Yunxiao Chen,Chengchun Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 35 pages

点击查看摘要

Abstract:Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which limits their flexibility. In this work, we propose a framework where token prefixes are perturbed by a learnable transformation of a continuous latent vector within an embedding space. To overcome the challenge of an intractable marginal likelihood, we derive unbiased estimating equations for model parameters and optimize them via stochastic gradient descent. We establish the statistical properties of the resulting estimator in over-parameterized regimes. Empirical evaluations on both synthetic and real-world datasets demonstrate that our proposal yields significant gains in out-of-domain settings over a range of state-of-the-art baseline methods.

[LG-141] Proximal-Based Generative Modeling for Bayesian Inverse Problems

链接: https://arxiv.org/abs/2605.13278
作者: Boyang Zhang,Zhiguo Wang,Ya-Feng Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score-based diffusion models demonstrate superior performance in generative tasks but encounter fundamental bottlenecks in inverse problems due to the analytical intractability of the time-dependent likelihood score. To bridge this gap, we propose a novel proximal-based generative modeling (PGM) framework that rigorously circumvents explicit likelihood evaluation. Our framework is built upon a theoretical equivalence between Gaussian convolution in diffusion processes and Moreau-Yosida regularization in nonsmooth optimization. This enables a new sampling mechanism driven by the proposed Moreau score, which admits a closed-form expression via proximal operators. Moreover, we introduce Moreau score matching to learn the proximal operators that rely solely on samples drawn from the prior distribution. Theoretically, PGM eliminates the early-stopping bias inherent in the score-based diffusion model and achieves non-asymptotic convergence. Experiments demonstrate that PGM significantly surpasses state-of-the-art methods in reconstruction quality and sampling time.

[LG-142] Physics Guided Generative Optimization for Trotter Suzuki Decomposition

链接: https://arxiv.org/abs/2605.13268
作者: WenBin Yan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Product formulas for Trotter Suzuki simulation remain a practical route to Hamiltonian evolution on noisy intermediate scale quantum (NISQ) hardware, yet their accuracy hinges on three coupled choices: term grouping, product formula order, and timestep allocation. Toolchains such as Qiskit and Paulihedral lean on hand tuned heuristics, while the discrete nature of grouping and order makes naive gradient based optimization awkward. We describe a generate and evaluate loop: a conditional diffusion model proposes strategies, a physics informed neural network (PINN) supplies differentiable fidelity feedback, and a graph neural network (GNN) encodes commutator structure. Training spans a hybrid space (discrete grouping and order, continuous time steps); the closed loop uses REINFORCE and a Pareto tracker. On the transverse field Ising model (TFIM), under our primary comparison setup, the method reaches 85.6% of the fidelity of a fourth order Qiskit baseline (0.856) at roughly 21.8% of the circuit depth and 19.2% of the baseline CNOT count. Under an equal depth budget, fine tuning in the loop reached a best observed fidelity of 0.9994. Updated ablations show that, for a fixed training budget and default guidance knobs, module contributions depend on the training recipe and guidance hyperparameters CFG in particular needs to be tuned jointly with compute budget. Overall, the results suggest that “generative model and physics supervision” is a viable angle for NISQ oriented compilation, though where it wins still depends on the operating point.

[LG-143] he Sample Complexity of Multiple Change Point Identification under Bandit Feedback

链接: https://arxiv.org/abs/2605.13252
作者: Maximilian Graf,Victor Thuot
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study multiple change point localization under bandit feedback. An unknown piecewise-constant function on a compact interval can be queried sequentially at adaptively chosen inputs, and each query returns a noisy evaluation of the function. The goal is to identify a prescribed number of discontinuities, known as change points, within a target precision \eta and confidence level 1-\delta , while using as few samples as possible. We propose an adaptive algorithm that first detects intervals likely to contain change points and then refines their locations to precision \eta . We establish non-asymptotic upper bounds on its sample budget, together with corresponding lower bounds. Prior work shows that jump magnitudes alone determine the asymptotic sample complexity as \delta\to 0 . We reveal that this picture is incomplete beyond this regime. We demonstrate, both empirically and theoretically, that for general \delta and \eta , the complexity is jointly governed by the jumps and the relative positions of the change points.

[LG-144] Coupling-Informed Transport Maps for Bayesian Filtering in Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2605.13174
作者: Dengfei Zeng,Lijian Jiang,Shuyu Sun,Dunhui Xiao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 29 pages, 14 figures

点击查看摘要

Abstract:A likelihood-free transport filtering method is proposed based on the couplings between state and observation variables. By exploiting a block-triangular structure in the transport map, the analysis step of filtering is reformulated as the minimization of the maximum mean discrepancy (MMD) between the true joint measure and its transport-based approximation. To circumvent the non-convexity in the MMD optimization, we introduce a training-free transport filter method via gradient flows, which leads to an analytic computation for the transport map that implies the steepest descent direction of the MMD. The proposed approach accurately approximates non-Gaussian filtering posteriors and avoids particle collapse. We provide a convergence analysis for the expectation of the MMD between the approximated posterior and the truth posterior. Finally, we extend the method to high-dimensional problems through domain localization. Numerical examples demonstrate the superior performance of our approach over conventional filtering methods in nonlinear, non-Gaussian scenarios.

[LG-145] Kernel-based guarantees for nonlinear parametric models in Bayesian optimization

链接: https://arxiv.org/abs/2605.13160
作者: Rafael Oliveira
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on Gaussian processes, kernel machines, linear models, or linearized neural approximations, leaving a gap between theory and the nonlinear models used in practice. We develop a kernel based framework for analyzing regularized nonlinear parametric models trained on adaptively collected data. Our approach uses kernels over the parameter space to induce reproducing kernel Hilbert space structures over the corresponding model class, yielding confidence bounds for models trained with broad classes of regularized convex losses. We show how these bounds can support convergence guarantees for nonlinear acquisition and surrogate models, including randomized regularized policies that select points by maximizing a trained random model. These results provide a unified route to analyzing nonlinear parametric models in Bayesian optimization and related adaptive optimization settings.

[LG-146] Generative Modeling of Approximately Periodic Time Series by a Posterior-Weighted Gaussian Process

链接: https://arxiv.org/abs/2605.13150
作者: Elias Reich,Saverio Messineo,Stefan Huber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete automated processes in industrial and cyber-physical systems often exhibit a repetitive structure in which successive repetitions follow a common trajectory while differing in duration, amplitude, and fine-scale dynamics. Such \emphapproximately periodic behavior poses a challenge for Gaussian Processes (GP) modeling: strictly periodic models suppress inter-repetition variability, while non-periodic models fail to capture the strong structural regularities required for generation. In this work, we propose a stochastic generative model for approximately periodic time series. The model is based on a GP whose posterior is modulated by a novel kernel. Our approach decouples intra-repetition structure from inter-repetition variability through a two-stage construction which yields a generative distribution with a identical mean function across repetitions, while allowing smooth variation between repetitions. The modeling choices are supported by an implementation in which realistic synthetic trajectories are generated from toy datasets.

[LG-147] Amortized Neural Clustering of Time Series based on Statistical Features

链接: https://arxiv.org/abs/2605.13128
作者: Ángel López-Oriona,Ying Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This paper introduces an algorithm-agnostic approach to feature-based time series clustering via amortized neural inference. By training neural networks to approximate the optimal partitioning rule from simulated data, the proposed framework reduces reliance on conventional clustering methods, such as K -means, K -medoids, or hierarchical clustering, and their associated objective functions and heuristics. Leveraging statistical features, such as autocorrelations and quantile autocorrelations, the approach learns a data-driven affinity structure from which clustering partitions can be recovered, without requiring explicit prior specification of cluster shapes or structures. In addition, one version of the method can automatically determine the number of clusters, avoiding ad-hoc selection procedures. Comprehensive empirical studies show that the proposed framework achieves competitive or superior clustering accuracy relative to traditional methods, even in challenging scenarios where competing techniques are provided with the true number of clusters. An application to financial time series of stock returns illustrates its practical utility. By reducing the need for algorithm selection and calibration, the proposed framework opens new possibilities for automated, adaptive, and data-driven clustering of temporal data across scientific and industrial domains.

[LG-148] State-of-art minibatches via novel DPP kernels: discretization wavelets and rough objectives

链接: https://arxiv.org/abs/2605.13127
作者: Hoang-Son Tran,Pranav Gupta,Rémi Bardenet,Subhroshekhar Ghosh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 52 pages

点击查看摘要

Abstract:Determinantal point processes (DPPs) have emerged as a kernelized alternative to vanilla independent sampling for generating efficient minibatches, coresets and other parsimonious representations of large-scale datasets. While theoretical foundations and promising empirical performance have been demonstrated, there are two challenges for current proposals for DPP-based coresets or minibatches. The first is the need for families of DPPs with certain key variance reduction properties, usually constructed in a continuous setting, of which there are few known examples. The second is the need for an ad-hoc construction of a discrete DPP defined on a given dataset, that inherits such variance reduction. In this work, we contribute to the programme of establishing DPPs as a subsampling toolbox for ML by advancing on these two fronts. First, we propose new DPPs on the Euclidean space based on wavelets, with provably better accuracy guarantees than the best known rates. Second, we introduce a general method to convert such continuous DPPs, which are more amenable to proving analytical statements, into discrete kernels, which are pertinent for subsampling tasks such as minibatch and coreset constructions. This conversion mechanism simultaneously preserves the desired variance decay and reveals a low-rank decomposition of the discrete kernel, which makes sampling the corresponding DPP computationally inexpensive. En route, we enlarge the class of ML tasks amenable to improvements via DPP-based minibatches and coresets to include objective functions with arbitrarily low regularity, and rate guarantees that explicitly adapt to this regularity.

[LG-149] Adaptive Kernel Density Estimation with Pre-training

链接: https://arxiv.org/abs/2605.13092
作者: Ruitong Zhang,Ke Deng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 8 pages main text, 14 pages total including references and appendix, 3 figures

点击查看摘要

Abstract:Density estimation in high-dimensional settings is an important and challenging statistical this http URL methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate location-adaptive kernels. In this work, we introduce pre-training, a key idea behind many cutting-edge AI technologies, to the context of non-parametric density estimation. By establishing a pre-trained neural network that can recommend an appropriate location-adaptive kernel for each sample point, efficient density estimation with adaptive kernels is achieved in high dimensions. A wide range of numerical experiments show that this strategy is highly effective for improving density-estimation accuracy, when the target distribution is close to the distribution family for pre-training. When the target distribution is substantially different from the pre-training distribution family, the benefit from the proposed pre-training strategy may be diluted, but can be reactivated by an additional fine-tuning procedure.

[LG-150] Implicit Behavioral Decoding from Next-Step Spike Forecasts at Population Scale NEURIPS2026

链接: https://arxiv.org/abs/2605.12999
作者: John R. Minnick,Jesus Gonzalez-Ferrer,Kamran Hussain,Jinghui Geng,Ash Robbins,Mohammed A. Mostajo-Radji,David Haussler,Jason Eshraghian,Mircea Teodorescu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 5 tables; submitted to NeurIPS 2026 Neuroscience Cognitive Science Track

点击查看摘要

Abstract:Closed-loop brain-computer interfaces often require both a forecast of upcoming neural population activity and a readout of the animal’s behavioral state. A single Mamba forecaster, trained only on next-step spike counts at Neuropixels scale, can deliver both in one forward pass. A lightweight per-session linear head reading the model’s predicted rates decodes behavior better than the same linear classifier reading the raw spike counts, under matched temporal context. We test on the Steinmetz visual-discrimination benchmark, which spans 39 sessions, roughly 27,000 neurons, and 1,994 held-out trials. Across three training seeds, Mamba’s predicted rates decode mouse choice at 75.7 \pm 0.2% trial vote, roughly 2.3 times chance level, and stimulus side at 66.1 \pm 0.6%, about twice chance. Compared to a matched 500 ms-context linear decoder on the raw spike counts, Mamba wins at trial vote by 4-6 pp on response and 4-6 pp on stimulus side. A session-start calibration block of about 100-150 trials brings the readout within 1-2 pp of asymptote, and the full pipeline fits inside the 50 ms bin budget on workstation-class GPUs typical of tethered chronic Neuropixels recordings.

[LG-151] SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting NEURIPS2026

链接: https://arxiv.org/abs/2605.12992
作者: John R. Minnick,Jinghui Geng,Kamran Hussain,Jesus Gonzalez-Ferrer,Ash Robbins,Mohammed A. Mostajo-Radji,David Haussler,Jason K. Eshraghian,Mircea Teodorescu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 26 pages, 4 figures, 12 tables; submitted to NeurIPS 2026 Datasets and Benchmarks Track; processed dataset at this https URL (CC-BY-4.0); code at this https URL

点击查看摘要

Abstract:Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation r between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region \Delta R^2 = 0.018 above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.

[LG-152] Coreset-Induced Conditional Velocity Flow Matching

链接: https://arxiv.org/abs/2605.12951
作者: Xiao Wang,Zihua She,Jianxi Su
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Coreset-Induced Conditional Velocity Flow Matching (CCVFM), a generative model that augments hierarchical rectified flow with a data-informed source distribution. Hierarchical flow matching models the full conditional velocity law in velocity space, but its inner flow is asked to transport isotropic Gaussian noise to a multimodal target velocity distribution from scratch. Our key observation is that this inner source can be replaced by a closed-form surrogate built from a coreset of the target. CCVFM first compresses the target into weighted atoms using an entropic Sinkhorn coreset and lifts them to a Gaussian mixture. The induced conditional velocity law is then a closed-form Gaussian mixture that can be sampled without a learned neural sampler. A lightweight correction flow, trained from this exact surrogate source, then refines the remaining surrogate-to-target residual rather than learning an entire noise-to-data map. We prove that the surrogate transport cost equals the target–surrogate Wasserstein gap under an explicit compression assumption, whereas the noise-source analogue has a dimension-scale lower bound. We further characterize the conditional second moment of the direct surrogate-source training target and show that its source-dependent excess is small when the surrogate conditional law is close to the true conditional velocity law in mean and covariance. Empirically, on MNIST, CIFAR-10, ImageNet-32, and CelebA-HQ, the proposed method reaches competitive few-step generation under matched architectures.

[LG-153] he Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

链接: https://arxiv.org/abs/2605.12908
作者: Ryoya Awano,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages, 1 figure

点击查看摘要

Abstract:Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student’s representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces V_k , and is fine-tuned under the supervision of a weak model specialized on task \kappa . We prove that the strong model efficiently learns task \kappa , eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target’s. Numerical experiments on synthetic data confirm our theoretical results.

[LG-154] Robust Sequential Experimental Design for A/B Testing

链接: https://arxiv.org/abs/2605.12899
作者: Qianglin Wen,Xiangkun Wu,Chengchun Shi,Ting Li,Niansheng Tang,Yingying Zhang,Hongtu Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worst-case mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.

[LG-155] Steer-to-Detect: Probing Hidden Representations for Detection of LLM -Generated Texts

链接: https://arxiv.org/abs/2605.12890
作者: Luxu Liang,Xiang Li
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\textttS2D), a two-stage framework for detecting LLM-generated text. In the first stage, \textttS2D learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \textttS2D achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.

[LG-156] Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization

链接: https://arxiv.org/abs/2605.12878
作者: Yaxin Yu,Long Chen,Minfu Feng
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 25 pages, 13 figures

点击查看摘要

Abstract:We propose Adam-SHANG, a Lyapunov-guided Adam-type method that couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, we prove convergence in expectation under an admissible stepsize condition that can always be satisfied by a conservative spectral bound, without imposing global monotonicity on the second-moment sequence. To obtain a less conservative practical rule, we introduce a computable trace-ratio stepsize, motivated by a local coordinatewise alignment condition. The same structural update is also tested beyond the convex setting with simplified parameters. Experiments validate the predicted stochastic decay and show competitive training performance against Adam and AdamW on deep learning tasks.

[LG-157] Decision Support for Marketplace Policies under Incomplete Evidence: From Replay to Launch Readiness

链接: https://arxiv.org/abs/2605.12840
作者: Prashant Shekhar,Caroline Howard
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Marketplace platforms routinely evaluate pricing and allocation policies using logged observational data, yet strong offline performance does not imply that a policy is safe to deploy. In real-time bidding (RTB) marketplaces, reserve-price and floor-policy changes affect not only revenue but also fill, advertiser value, budget pacing, and competition across auctions, creating feedback and interference. The central problem is therefore not to estimate whether a policy improves an offline metric, but to determine whether the available evidence justifies direct launch or only further validation. In this regard, we propose a support-aware decision-support system (DSS) that distinguishes promising from actionable evidence. The framework integrates replay, support-aware off-policy evaluation (OPE), conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware validation design into a claim-preserving pipeline that outputs a launch-readiness classification rather than a single performance estimate. Applying the framework to iPinYou-style RTB logs, we identify a margin-gated floor policy as the leading candidate, with a 47.7% replay yield lift, a 45.8% conservative lower-tail lift, and stable out-of-time performance. However, the framework does not recommend direct launch. A decision-rule ablation shows that simplified pipelines select the same policy but incorrectly recommend deployment, leaving key causal assumptions unresolved. In contrast, the proposed DSS selects the same policy but changes the action to online validation, reflecting missing evidence on propensities, bidder response, and interference. Overall, the contribution is a reproducible DSS protocol that prevents decision overclaim under partial identification and converts offline evaluation into an auditable, action-oriented recommendation.

[LG-158] Digital Twins as Synthetic Controls in Single-Arm Trials

链接: https://arxiv.org/abs/2605.12832
作者: Daniele Bertolini,Franklin Fuller,Aaron M. Smith,Jonathan R. Walsh,Run Zhuang
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Single-arm trials are an important study design for evaluating drug efficacy and safety without enrolling patients into a control arm. Although they do not provide the gold-standard evidence of randomized controlled trials, they are increasingly used in clinical development as they offer an efficient, ethical, and practical alternative. A wide variety of approaches can be used to construct control comparators and estimate treatment effects, from fixed comparators informed by clinical knowledge to data-based and model-based patient-level comparators, also known as synthetic controls. Powerful and flexible machine learning models can allow outcome-model-based synthetic controls to overcome key limitations of direct data-based approaches, yield more robust estimates of treatment effects, and provide a principled way to incorporate corrections or encode additional assumptions when external data are not directly comparable. In this work, we argue that outcome-model-based synthetic control arms are an important tool for single-arm trials. We focus on digital twins, personalized predictions of disease progression generated from machine learning models trained on historical datasets, which naturally leverage these flexible approaches. We review doubly robust estimators, present power and sample size formulas, and discuss trade-offs in selecting historical data for training and analysis. We also outline practical considerations for deploying digital twins within the framework of recent FDA draft guidance on the use of artificial intelligence in drug development. Finally, we reanalyze data from trials in amyotrophic lateral sclerosis and Huntington’s disease to demonstrate the proposed methods.

[LG-159] When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression

链接: https://arxiv.org/abs/2605.12780
作者: Marcell T. Kurbucz
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance V^=\mathbbE[\operatornameVar(p\mid X)] on the unlabelled set after partialling out the downstream controls X . We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary V^=0 , which holds iff p is a deterministic function of X ; this motivates a structural separation between classifier features W and downstream controls X\subsetneq W . Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a (V^*, \kappa) decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.

[LG-160] ISOMORPH: A Supply Chain Digital Twin for Simulation Dataset Generation and Forecasting Benchmarks

链接: https://arxiv.org/abs/2605.12768
作者: Zhizhen Zhang,Hyemin Gu,Benjamin J. Zhang,Daniel Elenius,Michael Tyrrell,Theo J. Bourdais,Houman Owhadi,Markos A. Katsoulakis,Tuhin Sahai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with fully interpretable, user-configurable parameters and modular topology, demand process, and control rules. The simulator advances a directed routing graph in discrete time: demand arrives at the destination, is served from stock or recorded as backlog, and triggers replenishment through the network. The state vector tracks per-node on-hand inventory with outstanding orders, in-transit shipments, and a smoothed demand estimate, so the dynamics close as a Markov chain on a tractable state space whose transition kernel acts linearly on the empirical distribution of the state. The released data reproduces the bullwhip effect at empirically consistent magnitudes, and three conservation laws encoded in the Markov chain serve as verification tools when users extend the simulator. We release datasets at two catalogue scales ( C=50 and C=200 ) with six scenario sweeps producing 30 additional rollouts and 20 Latin-hypercube perturbations, exhibiting dynamics absent from fixed TSF benchmarks: variance amplification, cascading bottlenecks, regime shifts, and cross-channel coupling through shared macro shocks. Zero-shot evaluation of four foundation models (Chronos, Moirai, TimesFM, Lag-Llama) shows MASE values exceeding public GIFT-Eval references at low-to-moderate horizons, supporting incorporation into existing benchmarks. The same pairing produces forecast confidence bands via Latin-hypercube perturbation of demand-side knobs, forward UQ from parameter uncertainty unavailable on standard TSF datasets, demonstrating that foundation models can serve as fast surrogates for the digital twin’s forward UQ. Code (MIT): this https URL.

[LG-161] Yield Curves Dynamics Using Variational Autoencoders Under No-arbitrag e

链接: https://arxiv.org/abs/2605.12764
作者: Fusheng Luo,H’elyette Geman
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This is the full script of our paper, which is awaiting submission to financial journals/conferences

点击查看摘要

Abstract:This paper introduces a physics-informed generative framework that resolves the fundamental conflict between the statistical flexibility of deep learning and the rigorous theoretical constraints of fixed-income modeling. We demonstrate that standard generative models and unconstrained statistical extrapolations suffer from “manifold collapse” and severe arbitrage violations when forecasting term structures across diverse macroeconomic regimes. To overcome this, we propose a two-stage architecture. First, a Student-t Conditional Variational Autoencoder with Dynamic Level Injection (CVAEsT+LS) extracts a robust, heavy-tailed term structure manifold, effectively decoupling macroeconomic shape dynamics from absolute base rates. Second, the latent dynamic evolution is governed by a continuous-time Neural Stochastic Differential Equation (SDE) strictly penalized by a No-Arbitrage Partial Differential Equation (PDE). Empirical results across multiple sovereign currencies (USD, GBP, JPY) confirm that our synergistic approach drastically reduces out-of-sample forecasting errors – achieving an exceptional 6.58 bps Mean Tenor RMSE – and successfully overcomes the massive parallel drift and zero-lower-bound violations exhibited by the classical HJM model in extreme environments. Furthermore, through phase space vector field analysis, we demonstrate the model’s superior capability in unsupervised macroeconomic regime detection and high-quality continuous-time scenario generation. Ultimately, this research provides a highly scalable, mathematically sound evolutionary engine for term structure modeling.

[LG-162] A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

链接: https://arxiv.org/abs/2605.12697
作者: Tomohiro Hayase,Ryo Karakida
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length n , ranging from (\log n)^1/2 to \log n and (\log n)^2 . We provide a general theory showing that the desirable scale is determined by the gap-counting function N_n of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different N_n and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

[LG-163] Online Conformal Prediction: Enforcing monotonicity via Online Optimization

链接: https://arxiv.org/abs/2605.12668
作者: Eduardo Ochoa Rivera,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction provides a principled framework for uncertainty quantification with finite-sample coverage guarantees. While recent work has extended conformal prediction to online and sequential settings, existing methods typically focus on a single coverage level and do not ensure consistency across multiple confidence levels. In many real-world applications, such as weather forecasting, macroeconomic prediction, and risk management, different users operate under heterogeneous risk tolerances and require calibrated uncertainty estimates across a range of coverage levels. In such settings, it is desirable to produce prediction sets corresponding to different coverage levels that are nested and valid simultaneously. In this paper, we propose two novel online conformal prediction methods that output \emphnested prediction sets across a range of coverage levels, enabling simultaneous uncertainty quantification across the entire risk spectrum. Beyond interpretability, jointly estimating multiple coverage levels is known to improve statistical efficiency in classical quantile regression by enforcing non-crossing constraints and sharing information across quantiles. Our approaches leverage an online optimization perspective with small regret that translates to quantile estimation error control while enforcing nestedness of prediction sets. Empirical results on synthetic and real-world datasets, including applications in forecasting tasks with heterogeneous risk requirements, demonstrate that our method achieves stable coverage across all levels, strictly nested prediction sets, and improved efficiency compared to existing online conformal baselines.

[LG-164] Recurrent Transformer-Based Near- and Far-Field THz Wideband Channel Estimation for UM-MIMO

链接: https://arxiv.org/abs/2605.12578
作者: Dmitry Artemasov(1),Alexander Shmatok(1),Kirill Andreev(1),Alexey Frolov(1),Manjesh K. Hanawal(2),Nikola Zlatanov(3) ((1) Center for Next Generation Wireless and IoT, Skolkovo Institute of Science and Technology, Moscow, Russia, (2) Department of IEOR, Indian Institute of Technology Bombay, India, (3) Faculty of Computer and Engineering Sciences, Innopolis University, Innopolis, Russia)
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 15 pages, 15 figures

点击查看摘要

Abstract:The integration of terahertz communications and ultra-massive multiple-input multiple-output (UM-MIMO) systems in 6G networks is motivated by their ability to enable unprecedented data rates, mitigate spectrum congestion, and enhance overall network performance. However, the enlarged antenna apertures and higher carrier frequencies in these systems increase the Rayleigh distance, causing users to span both the near-field and conventional far-field regions. Accurate spatial precoding thus requires exact channel estimation at the base station - a task made more challenging by the hybrid coexistence of near- and far-field effects and the limited number of digital chains available in hybrid beamforming architectures. In this paper, we propose a block recurrent transformer model to address this challenge. We demonstrate that a single transformer block equipped with state memory can be trained once and then iteratively applied for hybrid-field channel estimation. Furthermore, we train the model such that it generalizes to wireless channels with varying scatterer distances, different numbers of propagation paths, and wideband operation. Simulation results show that the proposed method achieves performance gains of approximately 5 dB and 7.5 dB in normalized mean squared error (NMSE) over state-of-the-art solutions in narrowband and wideband scenarios, respectively.

[LG-165] On Privacy-Preserving Image Transmission in Low-Altitude Networks: A Swin Transformer-Based Framework with Federated Learning

链接: https://arxiv.org/abs/2605.12566
作者: Kexin Zhang,Lixin Li,Yuna Yan,Xin Zhang,Wensheng Lin,Rui Li,Dongwei Zhao,Zhu Han
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures, 2 tables

点击查看摘要

Abstract:The rapid development of low-altitude economy has driven the proliferation of Unmanned Aerial Vehicle (UAV) applications, including logistics, inspection, and emergency response. However, transmitting high-volume image data from UAVs to ground stations faces significant challenges due to limited bandwidth and stringent privacy requirements. To address these issues, a Semantic Communication (SC) framework based on Federated Learning (FL) is proposed for efficient and privacy-preserving image transmission. A Swin Transformer-based Semantic Communication (STSC) architecture is designed to extract multi-scale semantic features under constrained bandwidth conditions. Dedicated communication and computing nodes are deployed on UAVs to enhance real-time coverage and flexibility. Meanwhile, a FL mechanism enables global model training across distributed devices without sharing raw data, thus preserving user privacy. Simulation experiments conducted on the CIFAR-10 dataset demonstrate that the proposed STSC framework achieves at least 5.7 dB improvement in Peak Signal-to-Noise Ratio (PSNR) compared to DeepJSCC baselines, while also showing superior convergence and generalization performance. The framework effectively integrates UAV-assisted deployment with SC and privacy protection, offering a practical solution for bandwidth-constrained image transmission in low-altitude networks.

[LG-166] he Payment Heterogeneity Index: An Integrated Unsupervised Framework for High-Volume Procurement Oversight and Decision Support

链接: https://arxiv.org/abs/2605.12547
作者: Kyriakos Christodoulides
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Public procurement is vulnerable to error, fraud and corruption, yet high transaction volumes overwhelm oversight. While research often focuses on tender-stage anomalies, post-award payments remain underexplored. Since labelled datasets are rare and existing methods such as Benford’s Law face restrictive assumptions, there is a need for additional interpretable, unsupervised frameworks that augment oversight and simplify management. This paper introduces the Structural Heterogeneity Index (SHI), a composite statistic for one-dimensional samples defined by four components: modality, asymmetry, tail behaviour, and structural dispersion. The Payment Heterogeneity Index (PHI) is its multiplicative instance for post-award payments. PHI combines a tail-behaviour component, sensitive to outliers and point clustering, with a structural-dispersion component summarising payment regime architecture. Structural dispersion is computed via Gaussian Mixture Model (GMM) estimation, integrating within-regime variability, prevalence, and separation from the dominant mode. Applied to UK municipal procurement data, PHI isolates a financially significant cohort (10.1% of high-volume suppliers) whose structural signatures deviate from the population and interact with recurring payment anchors. Permutation and Kolmogorov-Smirnov tests confirm that high-PHI suppliers exhibit statistically significant structural differences. A forensic review by a Certified Fraud Examiner supports the plausibility of the prioritised cases. Comparison shows PHI uniquely identifies regime separation obscured by metrics like the Coefficient of Variation (\rho=0.310). PHI functions as an effective discovery tool where no confirmed labels exist, offering a transparent, lightweight screening mechanism for post-award oversight.

[LG-167] Earth Science Foundation Models: From Perception to Reasoning and Discovery

链接: https://arxiv.org/abs/2605.12542
作者: Xiangyu Zhao,Bo Liu,Yuehan Zhang,Zelin Song,Wanghan Xu,Feng Liu,Fengxiang Wang,Ben Fei,Fenghua Ling,Wangxu Wei,Wenlong Zhang,Xiao-Ming Wu
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-05-14

目录

概览 (2026-05-14)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载