本篇博文主要内容为 2026-05-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-05-08)

今日共更新917篇论文,其中:

  • 自然语言处理118篇(Computation and Language (cs.CL))
  • 人工智能356篇(Artificial Intelligence (cs.AI))
  • 计算机视觉148篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习349篇(Machine Learning (cs.LG))
  • 多智能体系统17篇(Multiagent Systems (cs.MA))
  • 信息检索26篇(Information Retrieval (cs.IR))
  • 人机交互30篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Recursive Agent Optimization

【速读】:该论文旨在解决传统单代理模型在处理长上下文或复杂任务时的局限性,尤其是在模型上下文窗口(context window)不足、推理能力受限以及难以扩展到更困难任务的问题。其解决方案的关键在于提出递归代理优化(Recursive Agent Optimization, RAO),这是一种基于强化学习的训练方法,使代理能够递归地生成并委托子任务给自身的新实例,从而实现推理时的分治式扩展。RAO通过训练代理在推理过程中智能地判断何时及如何进行任务分解与通信,显著提升了训练效率、任务泛化能力,并支持超出原始上下文长度的任务处理,同时降低实际运行时间(wall-clock time)。

链接: https://arxiv.org/abs/2605.06639
作者: Apurva Gandhi,Satyaki Chakraborty,Xiangjun Wang,Aviral Kumar,Graham Neubig
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model’s context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.

[MA-1] Cross-Modal Navigation with Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决复杂环境中机器人导航对多模态感知数据依赖性强但高质量、对齐良好的多模态数据难以获取,以及单一模型因输入模态丰富导致表征复杂、策略空间膨胀而难以训练的问题。其解决方案的关键在于提出一种基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的框架CRONA,通过轻量级模态专用代理之间的协作实现可扩展的跨模态导航:一方面利用控制相关的辅助信念(control-relevant auxiliary beliefs)增强各代理间的协同能力,另一方面引入具备全局状态信息的集中式多模态评论家(centralized multi-modal critic)优化策略学习,从而在保持各模态优势的同时提升导航性能与效率。

链接: https://arxiv.org/abs/2605.06595
作者: Shuo Liu,Xinzichen Li,Christopher Amato
机构: Northeastern University (东北大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbfCRONA, a Multi-Agent Reinforcement Learning (MARL) framework for \textbfCross-Modal \textbfNavigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.

[MA-2] Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决合作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)基准测试中仅依赖总体性能指标(如回报、成功率或完成时间)而忽视智能体间协调机制的问题。传统评估方法难以揭示在智能体数量、任务种类及联合分配选择呈组合增长时,各智能体如何进行协作。解决方案的关键在于提出一种“协调感知”的评估视角,通过引入STAT(一个受控的承诺约束空间任务分配测试平台),系统性地改变智能体数、任务数和环境规模,同时固定观测访问权限与任务规则,从而对六种典型基于价值的MARL方法进行过程级诊断分析。结果显示,相似的回报趋势可能对应不同的协调机制,包括冗余分配、分配多样性以及任务完成效率等方面的差异,表明在承诺约束的任务分配场景中,性能受标称动作空间大小之外的因素显著影响,如分配压力、稀疏决策机会以及相互依赖智能体间的冗余选择。这一发现强调了将协调感知评估作为回报基准必要补充的重要性。

链接: https://arxiv.org/abs/2605.06557
作者: Maria Ana Cardei,Matthew Landers,Afsaneh Doryab
机构: University of Virginia (弗吉尼亚大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages. Submitted and under review

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commitment-constrained spatial task-allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value-based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task-completion efficiency. We find that in commitment-constrained task allocation, performance under scale is shaped not only by nominal action-space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination-aware evaluation as a necessary complement to return-based benchmarking for cooperative MARL.

[MA-3] Sustaining Cooperation in Populations Guided by AI: A Folk Theorem for LLM s

【速读】:该论文旨在解决多智能体系统中因共享大型语言模型(Large Language Models, LLMs)指导而引发的协作机制问题,特别是在个体激励不一致时如何维持合作行为。其核心挑战在于:LLMs通过向多个客户端提供指令来间接影响代理间的博弈策略,形成一种由客户端中介的元博弈(meta-game),这使得传统博弈论中的合作机制难以直接适用。解决方案的关键在于提出并证明了一个针对LLMs的“ Folk 定理”——即使在客户端无法识别对手所受LLM指导的情况下,所有可行且个体理性(feasible and individually rational)的结果均可作为ε-均衡被维持。这一结果突破了标准Folk定理的限制,依赖于新的分析技术,揭示了共享LLM指导能够在缺乏直接通信或身份识别的条件下,有效促进跨群体代理的合作行为。

链接: https://arxiv.org/abs/2605.06525
作者: Jonathan Shaki,Eden Hartman,Sarit Kraus,Yonatan Aumann
机构: Bar-Ilan University (巴伊兰大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to provide instructions to many agents who interact with one another. Such shared reliance couples agents who appear to act independently: they may in fact be guided by a common model. This coupling can change the prospects for cooperation among agents with misaligned incentives. We study settings in which multiple LLMs each advise a population of clients who participate in instances of an underlying game, creating strategic interaction at the level of the LLMs themselves. This induces a meta-game among the LLMs, mediated through clients. We first analyze the one-shot setting, where shared instructions can change equilibrium behavior only when an LLM may influence more than one role in the same interaction; in such cases, cooperation may emerge, and the effect of client share can be beneficial, harmful, or non-monotone, depending on the base game. Our main result concerns the repeated setting. We prove a folk theorem for LLMs: despite indirect observation and the clients’ inability to identify which LLM advised their opponents, all feasible and individually rational outcomes can be sustained as \varepsilon -equilibria. The result does not follow from the standard folk theorem and requires new proof techniques. Together, these results show that shared LLM guidance can sustain cooperation among populations of agents even when the underlying incentives are misaligned.

[MA-4] Optimizing Social Utility in Sequential Experiments

【速读】:该论文旨在解决高风险领域(如药物开发)中,由于大规模随机对照试验(Randomized Controlled Trial, RCT)高昂成本导致开发者缺乏绝对疗效信心而难以推进具有高社会价值的“突破性”产品(moonshot products)的问题。解决方案的关键在于设计一种统计实验协议:由开发者(agent)以顺序方式执行RCT,监管机构(principal)提供部分成本补贴;通过构建信念马尔可夫决策过程(belief Markov decision process)建模该协议,利用动态规划高效求解代理的最优策略,并证明社会效用随补贴水平呈分段线性凸函数,从而可通过分治法快速找到社会最优补贴水平。仿真结果表明,相较于传统非序列协议,该方法可使社会效用提升超过35%。

链接: https://arxiv.org/abs/2605.06520
作者: Ander Artola Velasco,Stratis Tsirtsis,Manuel Gomez-Rodriguez
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product’s efficacy, ultimately stifling the development of `moonshot’ products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent’s optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than 35 % relative to standard, non-sequential protocols.

[MA-5] Agent icPrecoding: LLM -Empowered Multi-Agent System for Precoding Optimization

【速读】:该论文旨在解决现有预编码(Precoding)方法在6G异构与动态场景下适应性不足的问题,即传统方法通常针对特定系统模型、优化目标和约束集设计,难以灵活应对未来网络的多样化需求。其解决方案的关键在于提出AgenticPrecoding框架,通过四个协同阶段(问题建模、求解器选择、提示增强与代码生成)实现端到端自动化预编码推导,其中引入两个LoRA微调的专用推理代理注入预编码领域知识以提升建模与求解效率,同时利用两个通用大语言模型(LLM)完成提示优化与可执行代码生成,并结合反馈驱动机制迭代优化代码可执行性、约束可行性与解质量,从而显著提升跨场景适应能力。

链接: https://arxiv.org/abs/2605.06443
作者: Zijiu Yang,Zixiang Zhang,Shunpu Tang,Qianqian Yang,Zhiguo Shi
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Precoding is a key technique for interference management and performance improvement in multi-antenna wireless systems. However, existing precoding methods are typically developed for specific system models, objectives, and constraint sets, which limits their adaptability to the heterogeneous and evolving scenarios expected in future 6G networks. To address this limitation, we propose AgenticPrecoding, a universal multi-agent framework that automates end-to-end precoding derivation directly from user-level communication requirements. Specifically, AgenticPrecoding decomposes the derivation process into four coordinated stages: problem formulation, solver selection, prompt upsampling, and code generation, assigning each stage to a specialized agent tailored to its specific reasoning demands. We employ two LoRA-adapted reasoning agents to inject precoding-specific domain knowledge for problem formulation and solver selection, while two general-purpose Large Language Models (LLMs) handle prompt refinement and executable code generation. Furthermore, a feedback-driven refinement mechanism is incorporated to enhance code executability, constraint feasibility, and solution quality. Extensive experiments across 10 representative precoding scenarios demonstrate that AgenticPrecoding achieves superior cross-scenario adaptability compared to conventional optimization-based and LLM-based baselines.

[MA-6] Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

【速读】:该论文旨在解决部分可观测马尔可夫博弈(Partially Observable Markov Games, POMGs)中纳什均衡(Nash equilibrium)的学习问题,尤其针对多智能体强化学习场景下因观测不完整导致的样本复杂度和计算复杂度随玩家数量指数增长的挑战。其解决方案的关键在于:首先假设POMG满足独立状态转移且底层完全可观测的马尔可夫博弈为势博弈(Markov potential game),在此结构化前提下设计一种无需通信、仅依赖个体动作与观测的历史信息的独立学习算法;其次,在滤波稳定性假设下,证明基于有限历史窗口的策略即可提供足够近似保证,从而将原POMG近似为一个近似势博弈(near-potential game),最终实现纳什均衡学习的准多项式时间复杂度(quasi-polynomial sample and computational complexity)。

链接: https://arxiv.org/abs/2605.06377
作者: Philip Jordan,Maryam Kamgarpour
机构: EPFL(洛桑联邦理工学院)
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.

[MA-7] From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在作为代理工作流(agentic workflows)部署时,因依赖隐式对话状态而导致的可维护性问题,如难以保持稳定的工作成果、隔离无关更新或在中间产物中传播变更。其解决方案的核心是引入执行谱系(execution lineage)——一种将AI原生工作表示为有向无环图(DAG)的执行模型,其中每个生成产物的计算都具有显式依赖关系、稳定的中间边界和基于身份的重放机制。这使得系统能够精确控制哪些部分应随修改而变化、哪些应保持稳定,并确保跨版本迭代中状态的一致性和可追踪性,从而显著优于传统的以循环为中心(loop-centric)的更新基线方法。

链接: https://arxiv.org/abs/2605.06365
作者: Josh Rosen,Seth Rosen
机构: ThruWire, Inc.(ThruWire公司)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions. Comments: 16 pages, 1 figure Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2605.06365 [cs.AI] (or arXiv:2605.06365v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.06365 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Josh Rosen [view email] [v1] Thu, 7 May 2026 14:39:37 UTC (27 KB)

[MA-8] Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在团队协作中面临的协调效率问题,即现有方法要么过于结构化(如固定角色或任务分解),导致灵活性不足;要么完全无结构,引发错误传播、代理冲突和资源浪费等问题。其解决方案的关键在于提出一种名为Language Agent Teams for Task Evolution (LATTE)的框架,该框架受分布式系统启发,通过构建并维护一个共享的、动态演化的协调图(coordination graph),显式编码子任务依赖关系、代理分配及进度状态,从而在保证一致性的同时支持代理动态分配工作、调整协作策略并发现新任务,显著降低Token消耗、运行时间、通信开销和协调失败率,同时保持或提升任务准确性。

链接: https://arxiv.org/abs/2605.06320
作者: Elizabeth Mieczkowski,Alexander Ku,Tiwalayo Eisape,Dilip Arumugam,John Matters,Katherine M. Collins,Ilia Sucholutsky,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); University of Cambridge (剑桥大学); MIT (麻省理工学院); New York University (纽约大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.

[MA-9] Power-Efficiency and Scalability Analysis of Magnetically-Actuated Satellite Swarms via Convex Optimization

【速读】:该论文旨在解决大规模磁力驱动卫星群(magnetically actuated satellite swarms)在轨道动力学不稳定环境下维持虚拟天线阵列结构时的能耗优化问题。其核心挑战在于电磁力与力矩模型的非线性特性导致功率消耗约束呈非凸形式,从而难以进行系统级配置分析。解决方案的关键是构建一种基于凸优化(convex optimization)的评估框架,将非凸功率约束转化为可处理的凸形式,从而实现对大型卫星群形成保持效率的量化分析,并揭示增加卫星数量可提升功率利用效率,为构建大尺度空间系统提供了一种低功耗替代方案。

链接: https://arxiv.org/abs/2605.06286
作者: Yuta Takahashi,Seang Shim,Hiraku Sakamoto,Shin-ichiro Sakai
机构: Institute of Science Tokyo (东京科学研究所); The Graduate University for Advanced Studies (高级研究大学); Institute of Space and Astronautical Science (宇宙航空科学研究所)
类目: Multiagent Systems (cs.MA)
备注: Submitted to IEEE Transactions on Aerospace and Electronic Systems (Correspondence)

点击查看摘要

Abstract:This correspondence presents a convex-optimization-based evaluation framework of satellite-swarm-based apertures maintained by magnetic-field interactions. Spaceborne distributed apertures are composed of multiple satellites and are attractive for scientific and commercial missions because their scalability enables high-gain, narrow-beam, and large-aperture capabilities beyond the launch-size limitations. A key challenge is that the long-term maintenance of such virtual structures requires consistent formation control amid unstable orbital dynamics, and magnetic interactions generated by satellite-mounted magnetorquers offer a desirable propellant-free position-control strategy. However, the nonlinearities of the electromagnetic force and torque model lead to a nonconvex power-consumption constraint, making system-level configuration analysis difficult. To address this issue, we develop a convex optimization-based framework to analyze the power consumption of large magnetically actuated satellite swarms. The resulting analysis shows that increasing the number of satellites can improve formation-keeping power efficiency. This indicates that magnetically actuated swarm architectures provide a power-efficient alternative to the conventional few-satellite electromagnetic formation-flight concept for constructing large-scale space systems.

[MA-10] Multiagent Stochastic Shortest Path Problem IJCAI2026

【速读】:该论文旨在解决多智能体随机最短路径(Multi-Agent Stochastic Shortest Path, MSSP)问题,即在k个智能体协同或独立行动的情况下,最小化任意一个智能体到达目标状态的期望时间。其核心挑战在于如何在策略复杂度与计算效率之间取得平衡,尤其是在自主决策和协调控制两种场景下。解决方案的关键在于设计高效的策略合成算法,这些算法能够根据问题规模进行扩展,并通过实验验证其相对于自然基线方法的优势。

链接: https://arxiv.org/abs/2605.06056
作者: Martin Jonáš,Antonín Kučera,Vojtěch Kůr,Jan Mačák,Vojtěch Řehák
机构: Masaryk University (马萨里克大学)
类目: Multiagent Systems (cs.MA)
备注: A full version of the paper that was presented at IJCAI 2026

点击查看摘要

Abstract:We introduce and study the multi-agent stochastic shortest path (MSSP) problem, in which k agents strive to reach a target state, aiming to minimize the expected time to reach the target by any agent. We analyze the computational and strategy-complexity of the problem in both autonomous and coordinated settings, and we design efficient strategy-synthesis algorithms. The algorithms are experimentally evaluated on instances of increasing size against natural baselines.

[MA-11] BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine

【速读】:该论文旨在解决转化医学(translational medicine)中因多源异构数据(如文献、临床试验、专利及多组学定量分析)整合困难而导致的证据合成效率低、可追溯性差与流程不可审计的问题。传统通用基础模型或工具增强型多智能体系统难以满足特定生物医学场景下对结构化工作流、不确定性建模和溯源能力的需求。其解决方案的关键在于提出一种场景引导的多智能体系统——Ingenix BioResearcher,该系统通过将查询映射到版本化的研究剧本(playbook),调用30余种专业子代理与机器学习端点,并结合结构化数据库访问与沙盒代码执行实现基因组规模分析,在声明级别进行多模型一致性校验后由编辑模块组装结果,从而在单元级能力、开放域生物医学推理及端到端临床发现任务中均展现出领先性能。

链接: https://arxiv.org/abs/2605.05985
作者: Remigiusz Kinas,Joanna Krawczyk,Rafał Powalski,Przemysław Pietrzak,Agnieszka Kowalewska,Krzysztof Kolmus,Maciej Sypetkowski,Łukasz Smoliński,Tomasz Jetka
机构: Ingenix.AI(英格尼克斯人工智能)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注: 5 pages (main text), 21 pages (appendix), 8 figures, 11 tables

点击查看摘要

Abstract:Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi-omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General-purpose foundation models and off-the-shelf tool-augmented or multi-agent systems are not built for this: they tend to produce single-shot answers or run open-endedly, and fall short on the auditable, scenario-specific workflows that heterogeneous biomedical sources demand. This paper introduces Ingenix BioResearcher, a scenario-guided multi-agent system that maps queries to versioned research playbooks, delegates to specialized subagents over 30+ tools and machine-learning endpoints, mixes structured database access with sandboxed code for genome-scale analyses, and applies claim-level multi-model reconciliation before editorial assembly. We evaluate BioResearcher across unit-level capabilities, open-ended biomedical reasoning, and end-to-end clinical discovery. It leads evaluated baselines on 109 single-step tests (83.49% pass rate; 0.892 average score), achieves strong biomedical benchmark performance (89.33% on BixBench-Verified-50 and the top 0.758 mean score on BaisBench Scientific Discovery), and leads on a 30-query clinical end-to-end benchmark with the highest positive hit rate (74.7% \pm 3.3%) and negative clear rate (96.8% \pm 0.2%). These results show broad, competitive performance across unit-level, open-ended, and end-to-end clinical evaluations. Comments: 5 pages (main text), 21 pages (appendix), 8 figures, 11 tables Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM) Cite as: arXiv:2605.05985 [cs.AI] (or arXiv:2605.05985v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.05985 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-12] Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

【速读】:该论文旨在解决自动机器学习(AutoML)中缺乏可审计、持续迭代的实验闭环问题,即如何在不依赖人工干预的情况下,通过系统性反馈机制实现模型架构与训练策略的自主优化。其解决方案的关键在于构建一个由专家代理(specialist agents)驱动的封闭式实证循环(closed empirical loop),该循环以外部测量结果为输入,将每次试验(trial)视为包含假设、代码变更、评估结果及反馈的完整单元,并通过 lineage(谱系)反馈机制将失败信号(如崩溃、预算超支、精度未达标等)转化为后续程序级配方(recipe)的改写指令,而非一次性建议。这一设计使得系统能在无需人类介入的前提下,持续改进公共初始配方(public starting recipes),并在多个任务上显著提升性能指标(如降低比特每字符误差率、提高核心指标得分等)。

链接: https://arxiv.org/abs/2605.05724
作者: Jingjie Ning,Xiaochuan Li,Ji Zeng,Hao Kang,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81% , raises NanoChat-D12 CORE by 38.7% , and reduces CIFAR-10 Airbench96 wallclock by 4.59% , with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

[MA-13] Active Learning for Communication Structure Optimization in LLM -Based Multi-Agent Systems

【速读】:该论文旨在解决大规模语言模型驱动的多智能体系统(LLM-MAS)中通信结构优化在有限训练预算下不稳定、对训练任务敏感的问题。现有方法依赖随机采样的训练任务,但不同任务在难度和领域上的差异导致其信息价值不均,从而影响优化效果。解决方案的关键在于提出一种基于集成的信息论任务选择框架,通过集成卡尔曼反演(ensemble Kalman inversion)估算候选任务对图参数分布的改变程度,以衡量其信息量;该估计器无需梯度且适用于黑箱和噪声环境,结合嵌入式代表性任务池构建、代理建模与批量汤普森采样(batch Thompson sampling),显著提升了任务选择的效率与可扩展性,在受限计算预算下实现了更稳定、有效的通信结构优化,即使在存在智能体攻击的场景中也表现出鲁棒性。

链接: https://arxiv.org/abs/2605.05703
作者: Huchen Yang,Xinghao Dong,Dan Negrut,Jin-Long Wu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication-structure optimization, we propose an ensemble-based information-theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative-free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black-box and noisy multi-agent systems. To enhance scalability, we construct a compact candidate pool through embedding-based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication-structure optimization under constrained computational budgets.

[MA-14] Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation NEURIPS2026

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)在代码生成中因缺乏对代码结构复杂度感知而导致的拓扑路由不当问题,即现有系统在选择智能体协作拓扑时未考虑目标代码的实际结构复杂性,从而影响任务执行效率与资源分配准确性。解决方案的关键在于提出检索引导的自适应编排机制(Retrieval-Guided Adaptive Orchestration, RGAO),其核心创新是通过从分层代码索引中提取结构复杂度向量来指导拓扑选择,并结合形式化的预算代数(budget algebra)实现动态拓扑下的可证明预算守恒。这一设计首次将复杂度条件路由与形式化资源代数相结合,使得系统能够在检索驱动下动态调整智能体协作结构的同时保证资源使用的确定性约束。

链接: https://arxiv.org/abs/2605.05657
作者: Abhijit Talluri,Pujith Anne,Bhagavan Choudary Pendiyala,Raghavendra Chilukuri
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 30 pages, 9 figures. NeurIPS 2026 Evaluations and Datasets Track Submission Under review

点击查看摘要

Abstract:Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-Agent, a multi-agent framework whose sub-agents are governed by formal contracts with six-dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work – complexity-conditioned LLM routing and formal resource algebras – yielding a property neither admits alone: provable budget conservation under retrieval-conditioned dynamic topology selection. Concretely we contribute: (1) a complexity-conditioned topology router that reduces proxy-measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural-induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub-millisecond DAG construction and linear tree-index scalability.

[MA-15] FinRAG -12B: A Production-Validated Recipe for Grounded Question Answering in Banking ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在金融行业落地时面临的三大核心挑战:高精度要求、监管合规性以及响应结果的可验证性和事实依据(grounding)。针对这些问题,作者提出了一种统一且数据高效的训练框架,其关键在于三个创新点:一是构建了一个结合LLM-as-a-Judge过滤、引用标注与课程学习的数据生成管道,仅用143M tokens即训练出12B参数模型,在答案质量上超越GPT-4.1,且在引用 grounding 方面表现优异;二是设计了一种校准拒绝机制(calibrated refusal),通过在训练中引入22%不可回答样本,使模型“我不知道”回复比例达到12%,显著优于基线模型的4.3%,同时避免了GPT-4.1过度拒绝的问题(20.2%);三是实现了从数据整理到量化部署的端到端方法论,已在40余家金融机构上线,实现查询解决率提升7.1个百分点(p < 0.001),并相较GPT-4.1实现3–5倍加速和20–50倍成本降低。

链接: https://arxiv.org/abs/2605.05482
作者: Denys Katerenchuk,Pablo Duboue,Keelan Evanini,David Gondek,Nithin Govindugari,Olivier Allauzen,Joshua Baptiste,David J More,Joshua Schechter
机构: Kasisto; Textualization; NBME
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 7 pages, ACL 2026 conference

点击查看摘要

Abstract:Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% “I don’t know” rate, substantially improving over the base model’s unsafe 4.3% rate while avoiding GPT-4.1’s over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.

[MA-16] Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

【速读】:该论文旨在解决多智能体系统中大型语言模型(Large Language Models, LLMs)协作行为的可预测性问题,即如何在实际科学任务部署前高效评估LLM的协作能力。其核心挑战在于:尽管协作机制在共享资源约束下的AI团队中至关重要,但缺乏一种可靠、低成本的方法来量化模型的协作倾向并预判其在复杂任务中的表现。解决方案的关键在于构建一个基于行为经济学的博弈框架,通过六种标准化博弈测试35个开放权重LLM的协作特征,并发现这些博弈衍生的协作画像能够稳健预测其在AI for Science任务中的下游表现——特别是那些能有效协调博弈并采取乘法式团队生产策略(而非贪婪策略)的模型,在科学报告的准确性、质量和完成度三个指标上均显著优于其他模型。这一方法为筛选具备协作潜力的LLM提供了快速且经济的诊断工具。

链接: https://arxiv.org/abs/2604.20658
作者: Shivani Kumar,Adarsh Bharathwaj,David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model’s behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

自然语言处理

[NLP-0] EMO: Pretraining Mixture of Experts for Emergent Modularity

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因模块化不足导致的资源浪费问题,即传统单体架构要求加载全部参数,即使应用仅需特定能力(如代码生成、数学推理或领域知识)。针对这一挑战,作者提出一种名为EMO的Mixture-of-Experts(MoE)架构,其核心创新在于通过引入“文档级专家池共享”机制,使相似语义域的token自动聚集到同一专家子集,从而实现无需人工定义先验即可自发形成模块化专家分组。关键在于利用文档边界作为监督信号,在预训练过程中自然诱导出语义层次的专业化专家组合,而非传统MoE中常见的低级语法特征分化,这使得EMO可在保留少量专家(如25%)的情况下维持接近全模型性能,显著提升内存效率与可组合性。

链接: https://arxiv.org/abs/2605.06663
作者: Ryan Wang,Akshita Bhagia,Sewon Min
机构: University of California, Berkeley (加州大学伯克利分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.

[NLP-1] Verifier-Backed Hard Problem Generation for Mathematical Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成有效、具有挑战性且新颖的科学与数学问题方面能力不足的问题,这是推动LLM训练和实现自主科学研究的关键瓶颈。现有方法要么依赖昂贵的人工专家参与,要么采用简单的自对弈(self-play)范式,易因奖励黑客(reward hacking)导致生成无效问题。解决方案的核心在于提出一种验证器增强的困难问题生成框架(Verifier-enhanced Hard Problem Generation, VHG),其基于三方自对弈机制——包含出题者(setter)、解题者(solver)和独立验证器(verifier)。通过将验证器引入传统出题-解题二元结构,VHG将出题者的奖励设计为由验证器评估的问题有效性与解题者评估的问题难度共同决定,从而约束生成过程以确保问题的合法性与挑战性。该框架实现了两种验证器变体:硬符号验证器(Hard symbolic verifier)和软LLM验证器(Soft LLM-based verifier),并在不定积分任务和通用数学推理任务上验证了其显著优于基线方法的效果。

链接: https://arxiv.org/abs/2605.06660
作者: Yuhang Lai,Jiazhan Feng,Yee Whye Teh,Ning Miao
机构: City University of Hong Kong (香港城市大学); Hong Kong Institute of AI for Science, City University of Hong Kong (香港人工智能科学研究所,香港城市大学); Peking University (北京大学); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter’s reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.

[NLP-2] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

【速读】: 该论文旨在解决在缺乏标注基准的情况下,如何对候选语言模型(Language Models, LM)进行安全性的比较评估问题,尤其适用于特定语言、行业或监管环境下的部署场景。其核心挑战在于传统依赖人工标注的评测方法不可行时,如何构建可信且可复现的安全评分体系。解决方案的关键在于提出“无基准比较安全评分”(benchmarkless comparative safety scoring)的形式化框架,并建立一个由可控对比(safe-versus-abliterated contrast)、目标驱动方差主导性(target-driven variance dominance)和重跑稳定性(stability across reruns)构成的工具有效性链(instrumental-validity chain)。该链通过局部优先的 SimpleAudit 工具实现,在挪威语安全场景中验证了其有效性:安全与破坏目标能清晰区分(AUROC 0.89–1.00),目标身份是主要变异来源(η20.52\eta^2 \approx 0.52),且十次重跑后严重性分布趋于稳定。这表明即使没有真实标签,也可通过设计严谨的审计协议和量化指标组合(如得分、匹配差异、临界率、不确定性及审计者/评判者信息)来提供可解释的部署证据。

链接: https://arxiv.org/abs/2605.06652
作者: Sushant Gautam,Finn Schwall,Annika Willoch Olstad,Fernando Vallecillos Ruiz,Birk Torpmann-Hagen,Sunniva Maria Stordal Bjørklund,Leon Moonen,Klas Pettersen,Michael A. Riegler
机构: Simula Metropolitan Center for Digital Engineering (Simula Metropolitan Center for Digital Engineering); Oslo Metropolitan University (Oslo Metropolitan University); University of Oslo (University of Oslo); Simula Research Laboratory (Simula Research Laboratory); Norwegian Directorate of Health (Norwegian Directorate of Health)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SimpleAudit Repository: this https URL

点击查看摘要

Abstract:Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ( \eta^2 \approx 0.52 ), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking. Comments: SimpleAudit Repository: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 68T50, 68T07 ACMclasses: I.2.7; K.4.1; D.2.5 Cite as: arXiv:2605.06652 [cs.LG] (or arXiv:2605.06652v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-3] Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架下,因负样本(negative rollouts)缺乏失败严重程度的梯度区分度以及组合爆炸导致稀疏二值奖励难以有效传递信号的问题。传统方法如Group Relative Policy Optimization (GRPO) 依赖于对正负样本的分组优势估计,但其在实际应用中受限于负样本覆盖不全和梯度不稳定。为此,作者提出Positive-Only Policy Optimization (POPO),其核心创新在于仅使用在线正样本(positive rollouts)进行策略优化,通过有界重要性采样(bounded importance sampling)实现稳定更新,并引入两个关键机制:一是采用带动量适应律的孪生策略网络(siamese policy network)以稳定策略演化;二是用有界相似性惩罚项替代KL散度,在表示空间中约束策略变化。实验表明,POPO在数学推理任务上性能优于或等同于GRPO,例如在AIME 2025基准上达到36.67%准确率(Qwen-Math-7B),显著优于GRPO的30.00%,验证了其有效性与鲁棒性。

链接: https://arxiv.org/abs/2605.06650
作者: Mingwei Xu,Hao Fang
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.

[NLP-4] StraTA: Incentivizing Agent ic Reinforcement Learning with Strategic Trajectory Abstraction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长程决策任务中因当前方法多为纯反应式(reactive)而导致探索能力弱化与信用分配(credit assignment)困难的问题。其核心解决方案是提出一种名为战略轨迹抽象(Strategic Trajectory Abstraction, StraTA)的框架,关键在于引入显式的轨迹级策略(trajectory-level strategy),通过从初始任务状态采样紧凑策略并以此条件化后续动作,在分层GRPO风格的rollout设计下联合训练策略生成与动作执行,并结合多样化的策略rollout和关键性自我判断机制,从而显著提升样本效率与最终性能。

链接: https://arxiv.org/abs/2605.06642
作者: Xiangyuan Xue,Yifan Zhou,Zidong Wang,Shengji Tang,Philip Torr,Wanli Ouyang,Lei Bai,Zhenfei Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.

[NLP-5] Can RL Teach Long-Horizon Reasoning to LLM s? Expressiveness Is Key

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Model, LLM)推理能力时,因缺乏可控且可扩展的环境而导致训练规模与任务难度之间系统性关系难以研究的问题。其解决方案的关键在于提出了一种名为ScaleLogic的合成逻辑推理框架,该框架能够独立控制两个维度的任务难度:推理深度(即证明规划的层次数,horizon)和底层逻辑的表达能力(expressiveness)。通过该框架,作者发现RL训练所需的计算量 $ T $ 与推理深度 $ D $ 呈幂律关系 $ T \propto D^\gamma $,且幂律指数 $ \gamma $ 随逻辑表达能力增强而单调上升(从1.04到2.60),并进一步表明更复杂的逻辑训练能带来更大的下游性能增益和更高的计算效率,从而揭示了训练内容本身对迁移效果的关键影响。

链接: https://arxiv.org/abs/2605.06638
作者: Tianle Wang,Zhaoyang Wang,Guangchen Lan,Xinpeng Wei,Sipeng Zhang,Guanwen Qiu,Abulhair Saparov
机构: Purdue University (普渡大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校); Georgia Tech (佐治亚理工学院); UC San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic (“if-then”) towards more expressive first-order reasoning with conjunction (“and”), disjunction (“or”), negation (“not”), and universal quantification (“for all”). Using this framework, we show that the RL training compute T follows a power law with respect to reasoning depth D ( T \propto D^\gamma , R^2 0.99 ), and that the scaling exponent \gamma increases monotonically with logical expressiveness, from 1.04 to 2.60 . On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to +10.66 points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.

[NLP-6] Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成研究报告时存在的引用不可靠问题,即模型生成的引文虽形式上符合规范,但其链接有效性、内容相关性及事实准确性难以验证。现有方法或依赖模型自我引用(易引入偏倚),或采用检索增强生成(Retrieval-Augmented Generation, RAG)但无法确保源内容的可访问性、相关性和事实一致性。论文提出首个可复现的来源归属评估框架,其核心创新在于使用抽象语法树(Abstract Syntax Tree, AST)解析器自动提取并大规模评估LLM生成Markdown报告中的内联引文;该框架通过实际检索所引内容,使人工或模型评估者能够逐条比对引文与原始来源,从三个维度量化评估:(1) 链接有效性(Link Works)、(2) 内容相关性(Relevant Content)和(3) 事实准确性(Fact Check)。此闭环评估机制首次实现了对引文质量的系统性、可验证的量化分析,揭示了当前前沿模型在表面引文质量与真实事实可靠性之间存在显著断层。

链接: https://arxiv.org/abs/2605.06635
作者: Hailey Onweller,Elias Lumer,Austin Huber,Pia Ramchandani,Vamse Kumar Subbiah,Corey Feld
机构: PricewaterhouseCoopers(普华永道)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

[NLP-7] Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation

【速读】: 该论文旨在解决第二语言(L2)韩语形态句法标注中人工标注成本高、效率低的问题,提出了一种简化的“人在回路”(human-in-the-loop)工作流程。其解决方案的关键在于利用两个领域自适应解析器之间的共识作为标注正确性的代理指标,通过比较解析器一致性与独立人工判断的一致性,验证了该方法在半自动标注中的可行性;进一步分析表明,解析器分歧主要集中在可预测的语法范畴(如句法关系区分和小句边界模糊),其中多数可通过迭代模型优化处理,而部分分歧则反映了深层表征挑战,为后续模型改进提供了方向。

链接: https://arxiv.org/abs/2605.06625
作者: Hakyung Sung,Gyu-Ho Shin
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in the 20th Linguistic Annotation Workshop

点击查看摘要

Abstract:We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.

[NLP-8] MASPO: Joint Prompt Optimization for LLM -based Multi-Agent Systems ICML2026

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-agent System, MAS)中提示词(prompt)优化难题,即如何在不依赖真实标签的情况下,协同优化多个智能体的角色提示,以弥合局部代理目标与整体系统目标之间的错位问题。解决方案的关键在于提出MASPO框架,其核心创新是引入一种联合评估机制,通过衡量提示对后续智能体下游任务成功的影响来动态调整提示质量,从而实现从局部交互到全局性能的有效映射;同时结合数据驱动的进化束搜索策略,高效探索高维提示空间,显著提升多智能体协作任务的准确性和鲁棒性。

链接: https://arxiv.org/abs/2605.06623
作者: Zhexuan Wang,Xuebo Liu,Li Wang,Zifei Shan,Yutong Wang,Zhenxi Song,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at this https URL.

[NLP-9] Algospeak Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在内容生成与审核过程中,因语言规避策略(Algospeak)加剧的“逃避者-检测器”共演化问题,即如何量化并理解规避行为在提升检测绕过能力的同时对多数接收者可理解性造成的损失。其解决方案的关键在于提出并实证验证了“多数可理解调制阈值”(Majority Understandable Modulation, MUM)这一概念,并构建了一个可复现的框架,基于现有分类体系生成具有可控调制水平的意义保持型 Algospeak 变体;通过新冠虚假信息作为示例场景,创建包含700个样本的数据集并进行双维度评估——意义恢复(interpretation)与虚假信息检测(disinformation detection),从而揭示调制强度与可理解性之间的非线性关系,为系统性研究 Algospeak 动态机制提供了方法论基础和实验支撑。

链接: https://arxiv.org/abs/2605.06619
作者: Jan Fillies,Ronald E. Robertson,Jeffrey Hancock
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under Review

点击查看摘要

Abstract:As large language models (LLMs) increasingly mediate both content generation and moderation, linguistic evasion strategies known as Algospeak have intensified the coevolution between evaders and detectors. This research formalizes the underlying dynamics grounded in a joint action model: when Algospeak increases, detectability and understandability decrease. Further, the concept of Majority Understandable Modulation (MUM) is introduced and defined as the modulation level at which additional evasive alteration increases detector evasion but loses comprehension for the majority of recipients. To empirically probe this trade-off, we introduce a reproducible framework that can be used to create meaning-preserving, Algospeak-style variants, based on an existing taxonomy and with tunable modulation levels. Using COVID-19 disinformation as a first proof-by-example setting, we construct a reference dataset of 700 modulated items, drawn from twenty base sentences across five modulation levels and seven strategies. We then run two linked evaluations with seven different language models: one testing for interpretation through meaning recovery and one for disinformation detection through classification. Curve fitting over modulation levels yields an estimate of the Majority Understandable Modulation threshold and enables sensitivity analyses across strategies and models, see Figure 1. Results reveal the characteristic relationships between understandability and modulation. This study lays the groundwork for understanding the dynamics behind Algospeak and provides the framework, dataset, and experimental setups described.

[NLP-10] When and Why SignSGD Outperforms SGD: A Theoretical Study Based on ell_1-norm Lower Bounds

【速读】: 该论文旨在解决 sign-based 优化算法(如 SignSGD 和 Muon)在训练大规模基础模型时表现出优于传统随机梯度下降(SGD)的 empirically 现象,但缺乏理论解释的问题。其核心挑战在于:在标准平滑性和有限方差条件下,SGD 已被证明对 2\ell_2-范数意义上的驻点搜索是极小极大最优的,这从根本上排除了 sign-based 方法在常规设定下获得复杂度优势的可能性。为此,作者提出了一种新的理论框架,关键在于引入 1\ell_1-范数驻点性、\ell_\infty-光滑性以及可分离噪声模型,从而更好地刻画符号更新的坐标级特性。在此新问题几何下,作者推导出 SignSGD 的上下界匹配结果,并明确指出了其在稀疏噪声场景下相比 SGD 复杂度降低 dd 倍(dd 为问题维度)的理论优势;进一步将该框架扩展至矩阵域,证明 Muon 优化器同样保持与维度成线性关系的最优复杂度。最终,通过 GPT-2 模型预训练实验验证了理论预测的收敛速度提升,实现了理论与实践的有效衔接。

链接: https://arxiv.org/abs/2605.06615
作者: Hongyi Tao,Dingzhi Yu,Lijun Zhang
机构: Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注: Code is available at this https URL

点击查看摘要

Abstract:Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by \ell_2 -norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging \ell_1 -norm stationarity, \ell_\infty -smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emphupper bound of SignSGD with the \emphlower bound of SGD, illustrating that SignSGD effectively reduces the complexity by a factor of d under \emphsparse noise, where d is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.

[NLP-11] SkillOS: Learning Skill Curation for Self-Evolving Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在处理流式任务时缺乏持续学习能力的问题,尤其是如何从历史交互中自动提炼可复用的技能(skill),并构建长期有效的技能管理策略。现有方法或依赖人工标注、或采用启发式规则、或仅能训练短期技能操作,难以从间接且延迟的反馈中学习复杂的长期技能遴选与更新策略。解决方案的关键在于提出SkillOS——一种基于经验驱动的强化学习(Reinforcement Learning, RL)训练范式,其核心机制是将一个固定执行器(agent executor)与一个可训练的技能 curator 分离:执行器负责检索和应用技能,而curator则通过累积经验动态更新外部技能仓库(SkillRepo)。为提供有效的学习信号,SkillOS设计了复合奖励函数,并在基于技能相关性分组的任务流上进行训练,使早期轨迹用于更新SkillRepo,后续相关任务用于评估这些更新效果,从而实现技能的自演化与泛化能力。

链接: https://arxiv.org/abs/2605.06614
作者: Siru Ouyang,Jun Yan,Yanfei Chen,Rujun Han,Zifeng Wang,Bhavana Dalvi Mishra,Rui Meng,Chun-Liang Li,Yizhu Jiao,Kaiwen Zha,Maohao Shen,Vishy Tirumalashetty,George Lee,Jiawei Han,Tomas Pfister,Chen-Yu Lee
机构: Google Cloud AI Research(谷歌云人工智能研究); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Massachusetts Institute of Technology(麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 3 tables

点击查看摘要

Abstract:LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.

[NLP-12] UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

【速读】: 该论文旨在解决自蒸馏(Self-distillation, SD)在自回归大语言模型(Autoregressive Large Language Models, LLMs)中应用时面临的监督信号不可靠、表征对齐不足及训练不稳定等问题。现有方法多聚焦于孤立的设计选择,缺乏对各组件有效性、作用机制及其交互关系的系统理解。解决方案的关键在于提出UniSD统一框架,集成多种互补机制:包括多教师一致性(multi-teacher agreement)以提升监督可靠性、EMA教师稳定化(EMA teacher stabilization)增强训练稳定性、token级对比学习(token-level contrastive learning)和特征匹配(feature matching)促进表征对齐,以及发散裁剪(divergence clipping)控制输出分布偏移。通过系统评估六种基准任务和三种模型家族,UniSD揭示了自蒸馏何时优于静态模仿(static imitation),并识别出关键组件及其协同效应,最终构建出UniSDfull集成管道,在不依赖更强外部教师的情况下实现最优性能提升(相比基线模型+5.4点,相比最强基线+2.8点)。

链接: https://arxiv.org/abs/2605.06597
作者: Yiqiao Jin,Yiyang Wang,Lucheng Fu,Yijia Xiao,Yinyi Luo,Haoxin Liu,B. Aditya Prakash,Josiah Hester,Jindong Wang,Srijan Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

[NLP-13] Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLM s in Low-Resource Settings

【速读】: 该论文旨在解决远程认知康复治疗中临床报告生成效率低下的问题,尤其是在缺乏参考报告的低资源环境下,如何自动化生成既符合临床可靠性又具备良好语言质量的报告。其解决方案的关键在于提出并对比两种方法:一是基于规则的模板系统,通过显式决策规则和专家验证的模板确保临床可靠性与可追溯性;二是零样本大语言模型(LLM)方法(GPT-4),以提升输出的流畅性和简洁性。两者均使用相同预提取的结构化变量,从而实现对事实准确性可控的多维比较,为生成式AI在医疗场景中的负责任应用提供实证依据和设计指导。

链接: https://arxiv.org/abs/2605.06594
作者: Yongxin Zhou,Fabien Ringeval,François Portet
机构: Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians to review efficiently. This paper investigates automated clinical report generation for avatar-guided, home-based cognitive remediation sessions in a low-resource setting with no reference reports. We present and compare two approaches: (1) a rule-based template system encoding speech therapy domain knowledge as explicit decision rules and validated templates, ensuring clinical reliability and traceability; and (2) a zero-shot LLM-based approach (GPT-4) aimed at more fluent and concise output. Both systems use identical pre-extracted, expert-validated structured variables, enabling a controlled factual comparison. Outputs were evaluated by eight speech therapists and final-year students using a nine-criterion questionnaire. Results reveal a clear trade-off between clinical reliability and linguistic quality. The template-based system scored higher on fluidity, coherence, and results presentation, while GPT-4 produced more concise output. Directional differences are consistent across evaluation dimensions, though no comparison reached statistical significance after correction, reflecting the scale constraints of expert clinical evaluation. Based on evaluator feedback, we derive eight design recommendations for clinical reporting systems in remote rehabilitation settings. More broadly, this work contributes a replicable methodology combining expert elicitation, taxonomy-driven generation, and multi-dimensional human evaluation for clinical NLG in low-resource settings, and illustrates how controlled comparisons can inform the responsible adoption of generative AI in healthcare.

[NLP-14] PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

【速读】: 该论文旨在解决音频数据中离散符号化(audio tokenization)的优化问题,现有方法如量化、聚类或编解码重建通常仅在局部层面分配token,导致序列一致性、紧凑性、长度控制、终止标记和编辑相似性等关键属性难以直接优化。其解决方案的核心是提出PairAlign框架,将tokenization建模为条件序列生成任务:通过编码器将语音映射为连续条件,再由自回归解码器从起始符(BOS)生成token,从而联合学习token身份、顺序、长度及结束符(EOS)位置。PairAlign利用两个内容保持视图之间的**序列级自我对齐(sequence-level self-alignment)**机制,使每个视图的序列在另一视图表示下具有高概率,同时无关样本提供竞争序列,形成可扩展的编辑距离保留代理目标,并抑制多对一坍塌。该方法从VQ-style初始tokenization出发,结合EMA教师目标、交叉配对教师强制、前缀扰动、似然对比和长度控制等技术,实现了更紧凑、非退化的音频符号序列,在TIMIT检索任务中减少55%的归档token数并保持编辑距离搜索性能。

链接: https://arxiv.org/abs/2605.06582
作者: Adhiraj Banerjee,Vipul Arora
机构: Indian Institute of Technology, Kanpur (印度理工学院坎普尔分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD)
备注: 101 pages, 7 Figures, pre-print, Under Review

点击查看摘要

Abstract:Many operations on sensory data – comparison, memory, retrieval, and reasoning – are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view’s sequence is trained to be likely under the other’s representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On TIMIT retrieval, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent. Comments: 101 pages, 7 Figures, pre-print, Under Review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2605.06582 [cs.LG] (or arXiv:2605.06582v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06582 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Adhiraj Banerjee [view email] [v1] Thu, 7 May 2026 17:11:22 UTC (1,386 KB)

[NLP-15] Long Context Pre-Training with Lighthouse Attention

【速读】: 该论文旨在解决在极端序列长度下训练因果Transformer时,由于缩放点积注意力(Scaled Dot-Product Attention, SDPA)的二次时间与内存复杂度所导致的计算瓶颈问题。其解决方案的关键在于提出一种仅用于训练阶段的对称选择式分层注意力机制——Lighthouse Attention,该方法通过三个核心创新实现高效训练:(i) 一种亚二次的分层预处理与后处理步骤,实现序列的自适应压缩与解压缩;(ii) 一种对称压缩策略,在保持左到右因果性的前提下同时池化查询(Query)、键(Key)和值(Value),显著提升并行性;(iii) 两阶段训练范式,即大部分训练时间使用Lighthouse Attention进行预训练,最后以短时微调恢复完整注意力结构。该方法无需复杂反向传播核,且可轻松移除,实验证明其能在相同设置下实现更快的总训练速度和更低的最终损失。

链接: https://arxiv.org/abs/2605.06554
作者: Bowen Peng,Subho Ghosh,Jeffrey Quesnelle
机构: Nous Research (Nous Research)
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: this https URL

[NLP-16] Continuous Latent Diffusion Language Model

【速读】: 该论文旨在解决传统自回归语言模型(Autoregressive Language Models)在文本生成中面临的效率瓶颈与全局语义建模不足的问题,尤其是如何在保持生成质量的同时实现更高效的非自回归生成、可扩展的表征学习以及有效的全局语义建模。其解决方案的关键在于提出一种分层潜扩散语言模型(Cola DLM),该模型通过三层结构实现:首先使用文本变分自编码器(Text VAE)建立稳定的文本到潜空间映射;其次在连续潜空间中利用块因果扩散 Transformer(block-causal DiT)建模全局语义先验;最后通过条件解码生成文本。从统一马尔可夫路径视角看,该方法的本质是潜空间先验传输而非词元级观测恢复,从而将全局语义组织与局部文本实现解耦,形成更具灵活性的非自回归归纳偏置,并支持连续空间中的语义压缩与先验拟合,为跨离散文本与连续模态的统一建模提供可行路径。

链接: https://arxiv.org/abs/2605.06548
作者: Hongcan Guo,Qinyu Zhao,Yian Zhao,Shen Nie,Rui Zhu,Qiushan Guo,Feng Wang,Tao Yang,Hengshuang Zhao,Guoqiang Wei,Yan Zeng
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 99 pages, 31 figures, 9 tables. Project page: this https URL

点击查看摘要

Abstract:Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

[NLP-17] Efficient Pre-Training with Token Superposition

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练过程中数据吞吐量低、计算效率差的问题,尤其是在高规模场景下,传统方法往往需要复杂的并行策略或架构修改才能提升性能。其解决方案的关键在于提出一种无需改动并行策略、优化器、分词器、数据或模型结构的“Token-Superposition Training”(TST)方法:该方法分为两个阶段——首先通过高效超位置(superposition)阶段将多个连续token合并为一个集合,并采用多热交叉熵(multi-hot cross-entropy, MCE)损失进行训练;随后进入恢复阶段,回归标准训练流程。实验表明,TST在不同参数规模(从270M到10B)和模型架构(包括专家混合模型A1B)中均表现出强鲁棒性,且在同等损失条件下可将10B规模模型的预训练时间缩短至原来的40%(即提速2.5倍)。

链接: https://arxiv.org/abs/2605.06546
作者: Bowen Peng,Théo Gigant,Jeffrey Quesnelle
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 11 figures, 28 tables

点击查看摘要

Abstract:Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

[NLP-18] STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期个性化记忆中缺乏动态更新能力的问题,尤其是当新证据出现时无法识别并修正过时信念的“隐式冲突”(Implicit Conflict)现象。现有基准主要评估静态事实检索,忽视了模型在无显式否定情况下基于上下文推理和常识判断以更新记忆的能力。解决方案的关键在于提出一个三维度的评测框架(State Resolution、Premise Resistance 和 Implicit Policy Adaptation),用于系统性地检验模型对状态变化的感知、抵抗错误前提以及行为层面的适应能力,并在此基础上设计了CUPMem原型,通过结构化的状态整合与传播感知搜索机制强化写入时的修订逻辑,从而推动具备状态意识的记忆系统发展。

链接: https://arxiv.org/abs/2605.06527
作者: Hanxiang Chao,Yihan Bai,Rui Sheng,Tianle Li,Yushi Sun
机构: Wuhan University(武汉大学); The Chinese University of Hong Kong(香港中文大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user’s query, and they struggle to recognize when a change in one aspect of the user’s state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.

[NLP-19] he Frequency Confound in Language-Model Surprisal and Metaphor Novelty

【速读】: 该论文旨在解决语言模型(Language Model, LM)中词频与上下文可预测性(即 surprisal)对隐喻新颖性(metaphor novelty)判断的影响机制问题。研究发现,尽管传统观点认为 surprisal 可作为语境可预测性的代理变量并与其新颖性相关,但实际分析表明,词频才是更强大的预测因子;尤其在不同训练阶段中,surprisal 与新颖性的关联强度随训练进展先升后降,且与 surprisal 和词频之间关系的增强同步,说明早期模型可能误将高频词的低 surprisal 当作“高可预测性”,从而混淆了处理难度与新颖性之间的因果关系。关键解决方案在于通过系统比较多种词频度量和多个 Pythia 模型规模及训练检查点,揭示出词频才是影响隐喻新颖性判断的核心因素,而非单纯依赖 surprisal。

链接: https://arxiv.org/abs/2605.06506
作者: Omar Momen,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: to be presented and published at the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

点击查看摘要

Abstract:Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal–novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal–frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.

[NLP-20] Cubit: Token Mixer with Kernel Ridge Regression

【速读】: 该论文旨在解决传统Transformer架构中注意力机制(Attention Mechanism)的数学基础较弱、且在长序列建模能力上存在局限的问题。其解决方案的关键在于将注意力模块重新诠释为Nadaraya-Watson回归(Nadaraya-Watson Regression),并进一步提出基于核岭回归(Kernel Ridge Regression, KRR)的新架构Cubit,通过引入KRR的闭式解来替代原生注意力中的相似度加权聚合方式,同时结合核矩阵逆的归一化机制增强稳定性,并设计有限范围重缩放(Limited-Range Rescale, LRR)策略提升训练稳定性。这一改进使得Cubit具备更强的理论基础和潜在的长序列建模优势。

链接: https://arxiv.org/abs/2605.06501
作者: Chuanyang Zheng,Jiankai Sun,Yihang Gao,Yuehao Wang,Liangchen Tan,Mac Schwager,Anderson Schneider,Yuriy Nevmyvaka,Xiaodong Liu
机构: Stanford University (斯坦福大学); Google (谷歌); Meta (Meta)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Tech Report

点击查看摘要

Abstract:Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.

[NLP-21] Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个人计算设备上部署时面临的高计算资源需求问题,即当前标准推理依赖昂贵的数据中心GPU或云API,导致超过十亿台个人电脑无法有效利用其算力进行AI任务。解决方案的关键在于设计专用的SIMD(单指令多数据流)内核,将传统的浮点矩阵乘法替换为基于整数点积指令的加减运算,从而充分利用三值化模型(ternary models)中权重仅取-1、0、+1的稀疏结构,实现高效推理。该方法通过Litespark-Inference框架实现,与Hugging Face无缝集成,在Apple Silicon等主流CPU平台上显著提升性能并降低内存占用。

链接: https://arxiv.org/abs/2605.06485
作者: Nii Osae Osae Dade,Tony Morri,Moinul Hossain Rahat,Sayandip Pal
机构: Mindbeam AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to -1, 0, +1, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.

[NLP-22] Patch-Effect Graph Kernels for LLM Interpretability

【速读】: 该论文旨在解决机制可解释性(mechanistic interpretability)在大规模激活修补(activation patching)实验中产生的高维、非结构化数据难以系统比较的问题。其核心挑战在于如何从不同提示(prompt)和任务家族中提取出具有判别性的结构信号,并区分局部因果电路与强基线效应。解决方案的关键在于将激活修补结果重构为修补效应图(patch-effect graphs),通过三种图构建方法(直接因果影响、偏相关性和共影响)捕捉模型组件间的结构性关系,并利用图核(graph kernels)进行分析。研究表明,局部边槽特征比全局图形状描述符更具分类准确性,且筛选后的配对修补验证表明,所选边对应更强的激活影响效应,从而实现了对因果电路证据的压缩与控制基线下的严格评估。

链接: https://arxiv.org/abs/2605.06480
作者: Ruben Fernandez-Boullon,David N. Olivieri
机构: University of Vigo (维戈大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.

[NLP-23] owards Emotion Consistency Analysis of Large Language Models in Emotional Conversational Contexts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在情绪驱动的对话场景中对其自身生成内容的一致性问题,即模型是否能够识别并纠正自身输出中包含的错误信念或虚假前提。其核心解决方案在于设计一种自引用测试框架:将LLM生成的文本作为查询重新输入同一模型,评估其后续响应的一致性和准确性。关键发现表明,模型在面对含极端和中等情绪强度的虚假前提时表现低于平均水平,尤其在中等情绪情境下尤为脆弱;进一步基于注意力分数的分析揭示了模型从评价性处理向生成性处理的注意力转移趋势,凸显了其对内部逻辑一致性维护能力的不足。这一结果对LLMs在高风险、情绪敏感场景中的部署具有重要警示意义。

链接: https://arxiv.org/abs/2605.06476
作者: Sneha Oram,Ojaswita Bhushan,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay, India
类目: Computation and Language (cs.CL)
备注: Under-review

点击查看摘要

Abstract:In this work, we conduct an analysis to examine the consistency of Large Language Models (LLMs) with respect to their own generated responses in an emotionally-driven conversational context. Specifically, the text generated by LLM is framed as a query to the same model, and its responses are subsequently assessed. This is performed with three queries across two dimensions of extreme and moderate emotions. The three queries are, in particular, false claim queries that contain inherently wrong assumptions (false presuppositions) in increasing order of intensity. Two commercial models, Claude-3.5-haiku, GPT4o-mini, and a medium-sized model, Mistral-7B, are considered in the study. Our findings indicate that LLMs exhibit below-average performance and remain vulnerable to false beliefs embedded within queries. This susceptibility is especially pronounced for moderate emotional content. Furthermore, an extended attention-score-based analysis highlights a shift in models’ priority from evaluative to generative. The results raise important considerations for LLMs’ deployment in high-stakes, emotionally sensitive contexts.

[NLP-24] Invariant Features in Language Models: Geometric Characterization and Model Attribution

【速读】: 该论文旨在解决语言模型对改写(paraphrasing)具有强鲁棒性这一现象背后的机制问题,即语义信息如何通过稳定的内部表示实现不变性(invariance),以及这种不变性的结构来源尚不明确的问题。解决方案的关键在于提出一个局部几何框架(local geometric framework),将语义等价输入在潜在空间中视为占据结构化的区域:沿“干扰方向”(nuisance directions)发生改写变化,而语义一致性则保留在不变子空间(invariant subspaces)中。基于此框架,论文进一步提出了三个核心贡献:(1) 对不变潜特征的几何刻画,(2) 一种对比子空间发现方法以分离语义变化与语义保持的变异,(3) 利用不变表示实现零样本模型归因(zero-shot model attribution)。实证结果表明,不变结构出现在特定深度区域,语义位移主要位于干扰子空间之外,且在表示层面的干预实验验证了不变成分对模型输出的因果作用,从而揭示了语义不变性可被视为潜表示的局部几何属性,为理解语言模型如何组织意义提供了原则性视角。

链接: https://arxiv.org/abs/2605.06458
作者: Agnibh Dasgupta,Abdullah Tanvir,Xin Zhong
机构: University of Nebraska Omaha(内布拉斯加大学奥马哈分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: (1) a geometric characterization of invariant latent features, (2) a contrastive subspace discovery method that separates semantic-changing from semantic-preserving variation, and (3) an application of invariant representations to zero-shot model attribution. Across models and layers, empirical results support these contributions. Invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, and representation-level interventions indicate a causal role of invariant components in model outputs. Invariant representations also capture model-specific geometric patterns, enabling accurate attribution. These findings suggest that semantic invariance can be viewed as a local geometric property of latent representations, offering a principled perspective on how language models organize meaning.

[NLP-25] COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach

【速读】: 该论文旨在解决虚假新闻检测中内容特征(尤其是文本和语言学特征)研究不足的问题,以提升检测准确性。其解决方案的关键在于系统性地提取并验证了包括词二元组(word bigrams)和词性分布(part of speech distribution)在内的多种文本与语言学特征,并在新冠疫情期间收集的新数据集上采用决策树、K近邻、逻辑回归、支持向量机(Support Vector Machine, SVM)和随机森林(Random Forest)等传统机器学习算法进行实验。结果表明,随机森林表现最优,且单独使用文本或语言学特征均能有效提升检测效果,但二者融合并未显著增强性能;此外,词二元组与词性标记在检测效能上存在差异,证明了传统机器学习方法在虚假新闻识别中具有可行性与有效性。

链接: https://arxiv.org/abs/2605.06435
作者: Balakrishnan Vimala,Hii Lee Zing,Laporte Eric
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.

[NLP-26] From 124 Million Tokens to 1021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection LREC COLING2026

【速读】: 该论文旨在解决大规模语料中自动检测新词(neologism)的难题,尤其是在社交媒体文本中识别真实语言创新的问题。其解决方案的关键在于构建一个可扩展、模块化的流水线,结合基于规则的过滤与大语言模型(LLM)分类,并依托语法形态学和非语法形态学两个互补框架来界定新词范围,进而实现四类分类(新词、实体、外来词、无意义)。该方法通过在5.27亿条英语Reddit帖子中提取1.246亿个唯一词元并压缩至1,021个候选词,显著降低人工标注负担,同时利用多个LLM的多数投票机制提升分类可靠性,最终通过专家人工验证确认599个(58.7%)为真实的新词,验证了该方案的有效性与实用性。

链接: https://arxiv.org/abs/2605.06426
作者: Diego Rossini,Lonneke van der Plas
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 tables. Accepted at NeoLLM 2026 Workshop, co-located with LREC-COLING 2026

点击查看摘要

Abstract:We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at this https URL.

[NLP-27] MiA-Signature: Approximating Global Activation for Long-Context Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时,如何高效地捕捉和利用全局激活模式(global activation pattern)以提升下游任务性能的问题。现有研究表明,报告性意识访问与分布式记忆系统中的全局点火(global ignition)相关,但个体无法直接访问或枚举所有激活内容,这暗示认知可能依赖于一种近似全局激活影响的紧凑表示。为此,作者提出Mindscape Activation Signature (MiA-Signature),即通过子模性(submodular)选择高阶概念来压缩查询引发的全局激活模式,并可选地借助工作记忆进行轻量级迭代优化。其核心创新在于:将复杂的全激活状态转化为一个计算上可行的条件信号,从而在保留关键信息的同时显著降低计算复杂度,且在检索增强生成(RAG)和代理系统(agentic systems)中均实现了多任务性能提升。

链接: https://arxiv.org/abs/2605.06416
作者: Yuqing Li,Jiangnan Li,Mo Yu,Zheng Lin,Weiping Wang,Jie Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Pattern Recognition Center, WeChat AI, Tencent (腾讯微信人工智能研究院模式识别中心); Hunyuan Team, Tencent (腾讯混元团队)
类目: Computation and Language (cs.CL)
备注: This is a work in progress; we will continue to revise and improve the manuscript

点击查看摘要

Abstract:A growing body of work in cognitive science suggests that reportable conscious access is associated with \emphglobal ignition over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of \textbfMindscape Activation Signature (MiA-Signature), a compressed representation of the global activation pattern induced by a query. In LLM systems, this is instantiated via submodular-based selection of high-level concepts that cover the activated context space, optionally refined through lightweight iterative updates using working memory. The resulting MiA-Signature serves as a conditioning signal that approximates the effect of the full activation state while remaining computationally tractable. Integrating MiA-Signatures into both RAG and agentic systems yields consistent performance gains across multiple long-context understanding tasks.

[NLP-28] E = T*H/(OB): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在训练过程中出现“死专家”(dead experts)的问题,即部分专家始终未被激活,导致资源浪费和模型性能下降。解决方案的关键在于提出一个无量纲控制参数 $ E = T \cdot H / (O + B) $,该参数将路由温度 $ T $、路由熵权重 $ H $、Oracle 权重 $ O $ 和平衡权重 $ B $ 综合为单一指标,并通过12组受控实验(涵盖视觉与语言任务共超过11,000个训练周期)证明:当 $ E = 0.5 $ 时即可保证零死专家现象,从而无需依赖手工设计的负载均衡辅助损失函数。此参数可作为MoE训练的统一诊断工具,类比流体力学中的雷诺数(Reynolds number),实现对专家生态健康状态的定量预测与调控。

链接: https://arxiv.org/abs/2605.06415
作者: Qingjun Zhang
机构: Wuxi Taihu University (无锡太湖大学); School of Integrated Circuits (集成电路学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

点击查看摘要

Abstract:We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters – routing temperature T, routing entropy weight H, oracle weight O, and balance weight B – into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E = 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate – triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

[NLP-29] SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

【速读】: 该论文旨在解决当前大语言模型在长多轮对话中难以持续遵循用户指令的问题,尤其在指令随对话进展被修改、补充或替换时表现不佳。现有评估基准多集中于单轮或短多轮场景,无法全面衡量模型在复杂、动态交互中的约束遵守能力。解决方案的关键在于提出 SEQUOR——一个自动化的基准测试框架,通过从真实对话中提取约束并构建受控的、基于角色的多轮模拟交互,系统性地评估模型在不同长度和约束复杂度下的指令遵循准确性。实验表明,随着对话轮次增加和约束数量增多,模型性能显著下降,凸显了当前模型在长期任务一致性上的局限性。

链接: https://arxiv.org/abs/2605.06353
作者: Beatriz Canaverde,Duarte M. Alves,José Pombal,Giuseppe Attanasio,André F. T. Martins
机构: Instituto de Telecomunicações; Instituto Superior Técnico, Universidade de Lisboa; Sword Health; TransPerfect; ELLIS Unit Lisbon
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.

[NLP-30] Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

【速读】: 该论文旨在解决模型级联(model cascade)在部署阶段如何有效权衡成本与质量的问题,尤其是现有方法将决策阈值视为经验超参数、缺乏对成本-质量前沿几何结构的理论指导。其解决方案的关键在于构建一个基于约束优化和对偶理论的决策论框架:对于两模型级联,证明了成本-质量前沿在置信度支持集的递减收益区域上具有分段凹性,并通过影子价格关联预算约束与质量约束的最优解;对于k模型情形,揭示了最优级联可通过两两组合的级联点包络实现,且一阶条件表明单一影子价格可均衡各阶段边际质量-成本比。实验验证表明,固定链式结构性能低于最优两两级联包络,而轻量级预生成路由策略在多数数据集上优于最佳级联策略,说明性能瓶颈主要源于结构性成本而非中间阶段不足。

链接: https://arxiv.org/abs/2605.06350
作者: Dylan Bouchard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of k models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over \binomk2 pairwise cascades, with switching points where the optimal pair changes. For k -model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model’s generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.

[NLP-31] Dont Lose Focus: Activation Steering via Key-Orthogonal Projections

【速读】: 该论文旨在解决生成式 AI(Generative AI)中激活引导(activation steering)技术在实现目标行为控制时导致推理与检索性能下降的问题。研究表明,这一性能折损的主要原因是注意力重定向(attention rerouting):引导向量改变了查询-键匹配模式,使注意力从语境关键标记转移到信息量较低的标记上。为解决此问题,作者提出基于键正交投影的引导方法(Steering via Key-Orthogonal Projections, SKOP),其核心在于通过保留模型依赖的关键标记(focus tokens)上的注意力分布,同时允许非关键尾部标记(tail tokens)间灵活再分配,从而抑制有害的注意力重定向,维持引导效果的同时显著降低对下游任务性能的损害。实验表明,SKOP 在多个引导基准上实现了最优的引导-效用权衡,将效用退化降低5–7倍,且保持超过95%的原始引导效能。

链接: https://arxiv.org/abs/2605.06342
作者: Haoyan Luo,Mateo Espinosa Zarlenga,Mateja Jamnik
机构: University of Cambridge (剑桥大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.

[NLP-32] MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

【速读】: 该论文旨在解决工具使用型大语言模型(Tool-using Large Language Model, LLM)代理在遵循复杂、长周期的自然语言规程手册时,缺乏可扩展且可靠的合规性评估手段的问题。现有方法依赖人工构建基准或基于LLM的评判者,难以应对多步骤、高复杂度的规程场景。其解决方案的关键在于提出MANTRA框架,通过自动合成机器可验证的合规性基准,该框架独立生成两个核心组件:(i) 描述程序依赖关系的符号世界模型(symbolic world model),以及(ii) 针对特定任务的执行轨迹级合规检查(trace-level compliance checks),并利用SMT求解器验证一致性,通过结构化修复循环自动纠正不一致,仅在必要时引入人工干预。这一方法实现了对任意领域和超50页长手册的规模化、形式化合规评估,显著提升了检查粒度与约束强度,同时支持任务复杂度的可调控制,从而为工具使用型代理提供可靠、可解释的评测能力。

链接: https://arxiv.org/abs/2605.06334
作者: Ashwani Anand,Ivi Chatzi,Ritam Raha,Anne-Kathrin Schmuck
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents’ failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.

[NLP-33] Measuring Evaluation-Context Divergence in Open-Weight LLM s: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

【速读】: 该论文旨在解决安全基准测试(safety benchmark)作为语言模型部署后行为预测依据时存在的可靠性问题,即模型在不同上下文框架(评估场景、实际部署交互或中性请求)下可能表现出行为差异,这种现象被称为“评估上下文偏离”(evaluation-context divergence)。其解决方案的关键在于提出了一种配对提示协议(paired-prompt protocol),通过控制重述变化、基准熟悉度和评判者框架敏感性等变量,在开放权重大语言模型(open-weight LLMs)中量化测量该偏差。实验结果显示,不同模型家族在评估与部署情境下的行为模式存在显著异质性,例如OLMo-3-Instruct表现出评估谨慎倾向,而Mistral-Small-3.2、Phi-3.5-mini和Llama-3.1-8B则呈现部署谨慎特征,且这一差异受评判者模型影响,表明当前安全评估指标的可比性需进一步澄清。

链接: https://arxiv.org/abs/2605.06327
作者: Florian A. D. Burnat,Brittany I. Davidson
机构: University of Bath (巴斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ( 20 paired items, 840 generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious – evaluation framing raises refusal vs. neutral by 11.8 pp ( p=0.007 ) and reduces harmful compliance vs. deployment by 3.6 pp ( p=0.024 , 0/20 items inverted) – while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious, with marginal eval-vs-deployment refusal effects of -9 to -20 pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the 70 B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.‘’ One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.06327 [cs.CL] (or arXiv:2605.06327v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.06327 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-34] aching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

【速读】: 该论文旨在解决工具集成推理(Tool-integrated Reasoning, TIR)中一个关键矛盾:即使强推理模型几乎不调用工具,引入工具增强的评估机制仍可能导致文本推理能力下降的问题。解决方案的核心在于提出一套系统性的TIR微调(SFT)配方,其关键包括:(i) 优先选择天然适合工具辅助求解的问题以提升教师轨迹的学习性;(ii) 控制工具使用轨迹比例以缓解文本推理能力的灾难性遗忘;(iii) 优化pass@k和响应长度而非训练损失,从而最大化SFT收益并保留强化学习(Reinforcement Learning, RL)探索空间;(iv) 基于良好SFT初始化和防止模式崩溃的显式保护机制,构建稳定且可验证奖励的强化学习(RLVR)阶段。该方法在Qwen3系列模型上验证有效,显著提升了开源模型在多个基准上的性能表现。

链接: https://arxiv.org/abs/2605.06326
作者: Qianjia Cheng,Yuchen Zhang,Zhilin Wang,Yuxin Zuo,Shunkai Zhang,Yuchen Fan,Yu Qiao,Bowen Zhou,Ning Ding,Yu Cheng,Yun Luo,Ganqu Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.

[NLP-35] Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

【速读】: 该论文旨在解决有害语言检测(harmful language detection)中因标注者差异导致的标注不一致性问题,尤其是以往研究多关注标注者特征(如社会人口统计学信息、态度等)和标注内容的语言属性,却忽视了二者之间的交互作用。其解决方案的关键在于首次对四个主流有害语言检测数据集进行了大规模分析,系统整合了标注者特征、语料的语言属性及其交互效应,并通过统计方法揭示出这些交互作用在不同数据集中的显著差异,从而强调了标注者与语言特征之间存在交叉影响(intersectional effects),并指出词汇线索和标注者态度在其中扮演核心角色。这一发现警示研究者在模型泛化和跨数据集迁移时需保持谨慎。

链接: https://arxiv.org/abs/2605.06318
作者: Maximilian Maurer,Maximilian Linde,Gabriella Lapesa
机构: GESIS - Leibniz Institute for the Social Sciences (GESIS - 马克思-普朗克社会研究所); Heinrich-Heine University Düsseldorf (海因里希-海涅大学杜塞尔多夫)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textitwho annotated (sociodemographics, attitudes, etc.), the \textitwhat (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes. Effect patterns, however, vary considerably across datasets. This urges caution about generalization and transferability.

[NLP-36] MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

【速读】: 该论文旨在解决跨语言场景下笑声检测(laughter detection)与分割(segmentation)的难题,尤其针对现有机器学习方法依赖昂贵的人工标注且主要适用于英语语境的问题。其解决方案的关键在于提出一种无监督的多语言方法,将笑声分割任务建模为基于能量的音频片段异常检测问题,并利用BYOL-A编码器提取音频表征后,结合孤立森林(Isolation Forest)进行高效分割,从而在非英语数据上显著优于当前主流方法。

链接: https://arxiv.org/abs/2605.06309
作者: Callejas Sofia,Gomez Nahuel,Pelachaud Catherine,Ravenet Brian,Barriere Valentin
机构: Université Paris-Saclay LISN – Orsay, France; Universidad de Chile DCC – Santiago, Chile; Sorbonne University ISIR – Paris, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.

[NLP-37] Log-Likelihood Simpsons Paradox and the Detection of Machine-Generated Text

【速读】: 该论文旨在解决当前文本检测器在区分人类撰写文本与大语言模型生成文本时性能不足的问题。现有方法普遍基于“似然假设”(likelihood hypothesis),即认为机器生成文本对检测器语言模型而言具有更高的概率,但作者发现该假设下的词元级信号在检测器的隐藏空间中分布不均,若简单地对不同统计结构区域的词元得分进行平均,会引发类似辛普森悖论(Simpson’s paradox)的现象,从而削弱甚至破坏局部强信号。解决方案的关键在于引入一个基于贝叶斯决策理论的局部校准步骤:首先学习轻量级预测器以建模不同隐藏空间位置下得分分布的条件特性,随后聚合经校准的对数似然比(calibrated log-likelihood ratios),而非原始词元得分。这一修正机制显著且一致地提升了所有基准检测器和数据集上的表现,例如Fast-DetectGPT的AUROC从0.63提升至0.85,且该方法具备模块化、可兼容任何基于词元平均的流水线的特点,为后续研究提供了坚实基础。

链接: https://arxiv.org/abs/2605.06294
作者: Tom Kempton,Viktor Drobnyi,Maeve Madigan,Stuart Burrell
机构: University of Manchester (曼彻斯特大学); Visa Inc. (维萨公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 2 tables, 11 appendices

点击查看摘要

Abstract:The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson’s paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from 0.63 to 0.85 on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.

[NLP-38] LatentRAG : Latent Reasoning and Retrieval for Efficient Agent ic RAG

【速读】: 该论文旨在解决代理式检索增强生成(Agentic RAG)在处理复杂问题时因多步迭代推理导致的高延迟问题。现有方法依赖大语言模型(LLM)逐词生成自然语言中间思考和子查询,这一自回归过程显著增加了推理时间。其解决方案的关键在于提出LatentRAG框架,将推理与检索从离散的语言空间转移到连续的潜在空间(latent space),通过单次前向传播直接生成潜在表示作为思维和子查询,从而避免逐token生成的开销;同时,在潜在空间中对齐LLM与密集检索模型,并引入并行潜在解码机制以增强可解释性,最终实现性能相当但推理延迟降低约90%的效果。

链接: https://arxiv.org/abs/2605.06285
作者: Yijia Zheng,Marcel Worring
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.

[NLP-39] Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

【速读】: 该论文旨在解决在自动评分系统中,当评分量表(rubric)从整体性判断(holistic judgment)调整为分析性判断(analytic judgment)时,人类评分者与生成式AI评分器(autorater)之间评分一致性下降的问题。其关键解决方案在于通过优化rubric设计来提升人机评分的一致性:具体包括增加代表性示例和额外上下文信息以减少主观歧义,并降低rubric中的位置偏差(positional bias),同时避免过度复杂的量表结构以及过于保守的聚合方法,从而在不同评估场景(如自动作文评分和指令遵循评估)中实现更可靠的评分一致性。

链接: https://arxiv.org/abs/2605.06283
作者: Jessica Huynh,Alfredo Gomez,Athiya Deviyani,Renee Shelby,Jeffrey P. Bigham,Fernando Diaz
机构: Carnegie Mellon University (卡内基梅隆大学); Google Research (谷歌研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emphholistic judgment - for example, rating the quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emphanalytic judgments, which decompose assessment criteria - for example, quality’’ into fluency'' and organization’'. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.

[NLP-40] Linear Semantic Segmentation for Low-Resource Spoken Dialects ACL

【速读】: 该论文旨在解决当前语义分割(Semantic Segmentation)模型在低资源口语变体(如方言阿拉伯语)中表现不佳的问题,尤其是针对其非正式句法、代码切换(Code-Switching)和弱标记的篇章结构等挑战。解决方案的关键在于构建一个涵盖多种语域(包括电话对话、混码播客、新闻广播及小说对白)的多类型基准数据集,并提出一种聚焦局部语义连贯性与篇章断裂鲁棒性的新型分割模型,该模型在方言非新闻语料上显著优于现有强基线方法,且具备向其他低资源口语语言迁移的潜力。

链接: https://arxiv.org/abs/2605.06276
作者: Kirill Chirkunov,Younes Samih,Abed Alhakim Freihat,Hanan Aldarmaki
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); IBM Research AI (IBM研究人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL Findings 2026

点击查看摘要

Abstract:Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.

[NLP-41] Rethinking RL for LLM Reasoning : Its Sparse Policy Selection Not Capability Learning

【速读】: 该论文试图解决的问题是:强化学习(Reinforcement Learning, RL)在提升大语言模型推理能力时,是否真正实现了新策略的习得,还是仅对基础模型已有的解空间进行概率重分配。研究表明,RL的实际作用并非扩展模型的能力边界,而是通过稀疏且可预测的修正,在高熵决策点(即模型不确定选择分支的位置)上调整输出分布,从而提升推理准确性。解决方案的关键在于识别这些高熵决策点,并基于基础模型自身的熵信息,采用对比损失(contrastive loss)仅在这些少数位置施加微调,无需在线生成或复杂的RL训练循环。这种方法被命名为ReasonMaxxer,其核心优势在于以极低的计算成本(仅需数百次基础模型推理和数分钟单GPU训练)即可实现与完整RL相当甚至更优的性能,验证了推理改进本质上是稀疏策略选择而非能力获取。

链接: https://arxiv.org/abs/2605.06241
作者: Ömer Faruk Akgül,Rajgopal Kannan,Willie Neiswanger,Viktor Prasanna
机构: University of Southern California (南加州大学); DEVCOM ARL (国防部研究与工程司令部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL’s beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1–3% of token positions are affected, the promoted token always lies within the base model’s top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL’s accuracy gain, while random corrections fail. The base model’s own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.

[NLP-42] YEZE at SemEval-2026 Task 9: Detecting Multilingual Multicultural and Multievent Online Polarization via Heterogeneous Ensembling ACL2026 SEMEVAL-2026

【速读】: 该论文旨在解决多语言、多元文化和多事件在线极化(Multilingual, Multicultural and Multievent Online Polarization)的检测问题,目标是从22种语言的社会媒体内容中识别极化现象,并完成三个子任务:二分类检测、目标分类和表现形式识别。其解决方案的关键在于构建一个异构集成模型(heterogeneous ensemble),融合XLM-RoBERTa-large与mDeBERTa-v3-base两种多语言预训练模型,并采用独立任务建模结合类别加权(class weighting)策略,在严重标签不平衡条件下显著提升分类性能。

链接: https://arxiv.org/abs/2605.06231
作者: Fengze Guo,Yue Chang(University of Tübingen)
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the SemEval-2026 workshop of the ACL 2026 conference

点击查看摘要

Abstract:This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmentation, and class weighting to improve classification performance under severe label imbalance. Our findings indicate that independent task modeling combined with class weighting is more effective.

[NLP-43] UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

【速读】: 该论文旨在解决长上下文推理中预填充(prefill)阶段效率低下问题,尤其是在新兴混合架构(如线性/全注意力混合或滑动窗口/全注意力混合)和连续批处理(continuous batching)场景下,现有稀疏注意力加速方法性能显著下降且难以集成到现代推理引擎(如vLLM)中的挑战。解决方案的关键在于提出UniPrefill——一个适用于几乎任意模型架构的预填充加速框架,其核心创新是基于token级别的直接计算加速,并通过实现为连续批处理算子、扩展vLLM调度策略以原生支持预填充-解码协同处理(prefill-decode co-processing)及张量并行(tensor parallelism),从而实现高效、灵活且无缝集成于vLLM的预填充加速。

链接: https://arxiv.org/abs/2605.06221
作者: Qihang Fan,Huaibo Huang,Zhiying Wu,Bingning Wang,Ran He
机构: MAISNLPR (中国科学院自动化研究所); CASIA (中国科学院自动化研究所); UCAS (中国科学院大学); WeChat (微信); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注: code: this https URL

点击查看摘要

Abstract:As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures–such as linear/full attention hybrids or sliding window/full attention hybrids–these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model’s computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM’s scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.

[NLP-44] IDE: Every Layer Knows the Token Beneath the Context

【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)中一个被广泛采用但未充分探讨的设计选择——即token索引仅在输入嵌入层查找一次后便永久丢弃,由此引发的两个结构性问题:一是“稀有词问题”(Rare Token Problem),由于词汇分布呈齐普夫定律(Zipf-type distribution),稀有词因获得的梯度信号远少于常见词而导致训练不足;二是“上下文坍缩问题”(Contextual Collapse Problem),在参数受限的模型中,语义相似的词会被映射到难以区分的隐藏状态。为应对上述问题,作者提出TIDE框架,其核心创新在于引入EmbeddingMemory机制:一个由K个独立MemoryBlock组成的集合,将token索引映射为与上下文无关的语义向量,一次性计算并经由深度感知的softmax路由模块注入每一层,同时引入可学习的零值缓存(null bank)以增强灵活性和稳定性。理论与实证结果表明,TIDE有效缓解了单次token身份注入带来的缺陷,并在多种语言建模及下游任务中提升了性能。

链接: https://arxiv.org/abs/2605.06216
作者: Ajay Jaiswal,Lauren Hannah,Han-Byul Kim,Duc Hoang,Mehrdad Farajtabar,Minsik Cho
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

[NLP-45] Contrastive Identification and Generation in the Limit

【速读】: 该论文旨在解决在对比学习(contrastive learning)场景下,如何从无标签的对比对(unordered pairs \x,y\ 满足 h(x)≠h(y),但未知哪个为正例)中实现识别(identification)与生成(generation)的问题。传统研究多基于单一标签或全监督数据,而现实中的许多监督信号本质上是关系型的(relational),即编码示例间的相对关系而非个体标签。论文的关键解决方案在于引入“公共交叉图”(common crossing graph)这一统一技术对象,它以覆盖与关联的语言同时刻画成对模糊性(pairwise ambiguity)、群体级生成障碍(family-level generation obstructions)以及有限对抗污染缺陷(corruption defects),从而在无噪声和有噪声两种设定下分别实现了对比识别类的精确刻画、对比闭包维数(contrastive closure dimension)的定义及其与生成复杂度的严格对应,并揭示了对比生成与文本识别之间存在不可比较的严格层级结构。

链接: https://arxiv.org/abs/2605.06211
作者: Xiaoyu Li,Andi Han,Jiaojiao Jiang,Junbin Gao
机构: University of New South Wales (新南威尔士大学); University of Sydney (悉尼大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target’s support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs \x,y\ satisfying h(x)\ne h(y) for an unknown target binary hypothesis h , but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]'s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.

[NLP-46] A2TGPO: Agent ic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

【速读】: 该论文旨在解决生成式 AI(Generative AI)中代理型大语言模型(Agentic Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中,因依赖稀疏的轨迹级奖励信号而导致的个体工具调用(tool-call)贡献难以评估的问题。现有方法或依赖外部过程奖励模型增加计算开销,或采用基于树结构的展开方式仅重新分配奖励而不提升轨迹多样性。为此,作者提出 A²TGPO(Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping),其核心创新在于对信息增益(Information Gain, IG)这一内在过程信号进行三重重构:首先通过“轮次分组归一化”使每一轮的IG在相同交互深度内比较;其次采用“方差缩放折扣累积”以稳定不同轮次的优势值幅度;最后引入“自适应轮次级裁剪机制”,根据归一化后的IG动态调整每轮的更新范围,从而实现更精准、稳定的策略优化。

链接: https://arxiv.org/abs/2605.06200
作者: Dingwei Chen,Zefang Zong,Zhipeng Ma,Leo Luo,Yang Li,Chengming Li,Peng Chen,Jie Jiang
机构: Tencent Inc(腾讯公司); The Chinese University of Hong Kong(香港中文大学); Shenzhen MSU-BIT University(深圳北理莫斯科大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy’s predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A ^2 TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn’s clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

[NLP-47] he Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在被提示扮演不同社会角色时,其内部表征是否能编码从个体微观经验到组织、制度乃至国家等宏观层面推理的粒度差异这一问题。解决方案的关键在于提出并验证了一个基于对比的粒度轴(Granularity Axis),即通过计算宏观角色与微观角色隐藏状态的均值差来定义该轴;实证发现该轴在Qwen3-8B中与角色表示空间的第一主成分(PC1)高度对齐(余弦相似度0.972),解释了52.6%的方差,表明粒度是组织提示社会角色的核心几何方向。进一步通过多层级角色构建、大规模响应采集及投影分析,证明该轴在不同层、提示变体、分割策略和过滤子集下均保持单调递增且稳定,并可在Llama-3.1-8B-Instruct上迁移,同时激活操纵实验显示沿该轴定向扰动可显著改变输出粒度,证实其因果相关性。

链接: https://arxiv.org/abs/2605.06196
作者: Chonghan Qin,Xiachong Feng,Ziyun Song,Xiaocheng Feng,Jing Xiong,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, including appendices

点击查看摘要

Abstract:Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model’s default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.

[NLP-48] OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在思维增强型数学推理任务中,基于策略的自蒸馏(On-Policy Self-Distillation, OPSD)方法有效性下降的问题。现有研究表明,OPSD 在一般场景下能通过令牌级信用分配提升准确率并缩短响应长度,但在涉及深度思考链的数学推理中,其性能提升不显著甚至出现负向影响。论文提出,OPSD 在此类任务中更倾向于作为压缩机制而非纠错机制发挥作用:仅在正确轨迹上训练可保持准确率的同时显著缩短输出,而在错误轨迹上训练则会损害模型性能。因此,解决方案的关键在于重构后训练流程——先进行监督微调(SFT),再使用具有可验证奖励的强化学习(RLVR)优化策略,最后应用OPSD实现响应压缩,从而在保障推理准确性前提下提升效率。

链接: https://arxiv.org/abs/2605.06188
作者: Jaehoon Kim,Dongha Lee
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.

[NLP-49] Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

【速读】: 该论文旨在解决低秩适配(Low-rank adaptation, LoRA)中如何在有限的可训练参数下最优地放置适配器模块以最大化下游任务性能的问题。现有方法通常将适配器广泛分布于模型各层,但未明确指导在何处放置少量适配器能获得最佳效果。解决方案的关键在于提出一种基于梯度能量的敏感性探测方法——PAGE(Projected Adapter Gradient Energy),用于量化每个候选适配器位置的初始可训练梯度能量。研究发现,梯度能量高度集中于特定浅层前馈网络(Feed-Forward Network, FFN)的下投影层,该层被定义为“主导适配模块”(dominant adaptation module),其层数随模型架构变化但对任务稳定。基于此发现,作者进一步提出DomLoRA,仅在一个适配器位置放置单个适配器(仅占原始LoRA约0.7%的可训练参数),即可在指令遵循、数学推理、代码生成和多轮对话等任务上显著优于标准LoRA,验证了主导适配模块作为高效适配器放置策略的普适性和实用性。

链接: https://arxiv.org/abs/2605.06183
作者: Suoxin Zhang,Run He,Di Fang,Xiang Tan,Kaixuan Chen,Huiping Zhuang
机构: South China University of Technology (华南理工大学); Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA’s trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.

[NLP-50] HNC: Leverag ing Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

【速读】: 该论文旨在解决当前图像-文本匹配(Image-Text-Matching, ITM)方法在学习跨模态表征时存在的细粒度理解不足问题,其根源在于网络收集的图像-文本对之间关联较弱,导致模型难以捕捉两种模态的联合语义细节。解决方案的关键在于提出一种自动构建的“硬负样本caption”(Hard Negative Captions, HNC)数据集,其中包含经过精心设计的误导性负样本caption,用于强化模型对细粒度跨模态差异的感知能力;同时提供一个手动构建的具有不同组合复杂度的挑战性测试集,以评估模型在细粒度跨模态不匹配任务中的表现。实验表明,基于HNC训练可显著提升模型在零样本场景下检测语义不匹配的能力,并在噪声视觉输入下保持鲁棒性,且其初始化效果优于或等同于传统方法。

链接: https://arxiv.org/abs/2605.06157
作者: Esra Dönmez,Pascal Tilli,Hsiu-Yu Yang,Thang Vu,Carina Silberer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-Text-Matching (ITM) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language (VL). However, due to the weak association between the web-collected image-text pairs, models fail to show a fine-grained understanding of the combined semantics of these modalities. To address this issue we propose Hard Negative Captions (HNC): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine-grained cross-modal comprehension in VL. Additionally, we provide a challenging manually-created test set for benchmarking models on a fine-grained cross-modal mismatch task with varying levels of compositional complexity. Our results show the effectiveness of training on HNC by improving the models’ zero-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios. Also, we demonstrate that HNC models yield a comparable or better initialization for fine-tuning

[NLP-51] Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

【速读】: 该论文旨在解决深度神经网络在无正则化长期训练中出现的周期性损失尖峰(即“Slingshot Mechanism”)现象的成因问题,此前学界普遍认为其源于优化动力学本身,但具体触发机制尚不明确。论文的核心贡献在于证明该现象本质上是由浮点数计算精度限制引发的数值效应:当训练进入高置信度阶段时,正确类别logit与其他类别logit之间的差异可能超过吸收误差阈值,导致反向传播中正确类别的梯度被精确舍入为零,而错误类别的梯度仍非零,从而破坏类别间梯度的零和约束,并引起分类层参数更新的系统性漂移;该漂移与特征表示形成正反馈回路,驱动全局分类器均值与特征均值指数级增长,作者将此机制命名为数值特征膨胀(Numerical Feature Inflation, NFI)。这一发现揭示了Slingshot本质是有限精度训练下的数值动态,为晚期训练中异常参数增长和logit发散提供了可验证的解释。

链接: https://arxiv.org/abs/2605.06152
作者: Liu Hanqing,Jianjun Cao,Yuanze Li,Zijian Zhou
机构: Tsinghua University (清华大学); The University of Tokyo (东京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 28 pages, 13 figures

点击查看摘要

Abstract:Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the “Slingshot Mechanism.” Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

[NLP-52] IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences

【速读】: 该论文旨在解决回忆叙事中隐式实体识别(Implicit Entity Recognition, IER)的计算难题,即在不直接提及目标实体的情况下,从分散的上下文线索中推断出该实体。传统命名实体识别(Named Entity Recognition, NER)、实体链接(Entity Linking)或共指消解(Coreference Resolution)均假设实体线索集中于局部语境,而回忆叙事中的信息往往分布于多个非连续句段中,形成“非局部性”挑战。解决方案的关键在于构建IRC-Bench——一个包含25,136个样本的基准数据集,每个样本由显式提及目标实体的“实体锚定叙事”(Entity-Grounded Narrative)和移除直接提及后的“实体省略叙事”(Entity-Elided Narrative)组成,从而系统评估模型在复杂非局部语境下的推理能力。实验表明,基于QLoRA微调的Llama 3.1 8B在开放世界设置下表现最优(精确匹配38.94%,Jaccard相似度51.59%),而微调的密集检索器(DPR)在封闭世界下更优(Hit@1为35.38%,Hit@10为71.49%),验证了该任务对多模态、跨句推理能力的高要求。

链接: https://arxiv.org/abs/2605.06142
作者: Yehudit Aperstein,Eden Moran,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 8 figures

点击查看摘要

Abstract:When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B performs best in the open-world setting (38.94% exact match; 51.59% Jaccard), while fine-tuned DPR leads closed-world retrieval (35.38% Hit@1; 71.49% Hit@10). We release IRC-Bench with data, code, and evaluation tools.

[NLP-53] MemReranker: Reasoning -Aware Reranking for Agent Memory Retrieval

【速读】: 该论文旨在解决代理记忆系统中重排序模型(reranking model)在“检索-重排序”两阶段架构下的核心缺陷:现有通用重排序模型依赖语义相似性匹配,缺乏真正的推理能力,导致召回结果虽语义相关但缺少回答问题所需的关键信息。具体表现为三方面问题:相关性分数校准不足、面对时间约束与因果推理等复杂查询时排名性能下降、无法利用对话上下文进行语义消歧。解决方案的关键在于提出 MemReranker 系列模型(0.6B/4B),基于 Qwen3-Reranker 通过多阶段大语言模型(LLM)知识蒸馏构建,其中包含三项核心技术:多教师成对比较生成校准的软标签、BCE 点式蒸馏实现分数分布优化、InfoNCE 对比学习增强难样本区分能力;同时训练数据融合通用语料与面向记忆场景的多轮对话数据(涵盖时间约束、因果推理和指代消解),从而显著提升模型在专业领域(如金融、医疗)和基准测试上的表现,兼顾精度与推理延迟(仅为大模型的 10–20%)。

链接: https://arxiv.org/abs/2605.06132
作者: Chunyu Li,Jingyi Kang,Ding Chen,Mengyuan Zhang,Jiajun Shen,Bo Tang,Xuanhe Zhou,Feiyu Xiong,Zhiyu Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the “retrieve-then-rerank” two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10–20% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.

[NLP-54] On Time Within Budget: Constraint-Driven Online Resource Allocation for Agent ic Workflows

【速读】: 该论文旨在解决约束驱动的代理工作流(agentic workflows)在线资源分配问题,即在明确预算和截止时间约束下,最大化整个工作流成功完成的概率。传统方法聚焦于优化性能-成本-延迟的平均效率前沿,而实际部署中往往需要确保任务在限定资源内按时完成。解决方案的关键在于提出一种轻量级闭环规划算法——蒙特卡洛投资组合规划(Monte Carlo Portfolio Planning, MCPP),该方法通过模拟工作流执行来直接估计受约束的完成概率,并在观测到实际执行结果后进行重规划,从而动态调整模型与并行样本的分配策略,以最优方式利用剩余预算和时间。

链接: https://arxiv.org/abs/2605.06110
作者: Xinglin Wang,Zishen Liu,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Jiayi Shi,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance–cost–latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emphconstraint-driven online resource allocation for agentic workflows. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask–model pair, the executor allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emphMonte Carlo Portfolio Planning (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget–deadline constraints.

[NLP-55] Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

【速读】: 该论文旨在解决多模态知识编辑(Multimodal Knowledge Editing, MKE)中存在的一种系统性失效模式——实体身份混淆(Entity Identity Confusion, EIC),即编辑后的模型在仅通过文本查询原实体身份时,会错误地返回新实体的信息。其核心问题在于现有MKE方法未能区分图像-实体(Image-Entity, I-E)绑定与实体-实体(Entity-Entity, E-E)关系知识,导致模型将E-E关联作为捷径进行过拟合,从而使得图像仍被识别为原实体,而新实体名称仅作为虚假标签存在。解决方案的关键在于:将编辑操作限制在模型的I-E处理阶段,促使编辑更忠实于图像与实体之间的绑定关系,从而显著缓解EIC现象。这一发现为实现可信、准确的多模态知识编辑提供了方法论指导和设计原则。

链接: https://arxiv.org/abs/2605.06096
作者: Shu Wu,Xiaotian Ye,Xinyu Mou,Dongsheng Liu,Xiaohan Wang,Mengqi Zhang
机构: New Laboratory of Pattern Recognition (NLPR); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); Institute of Automation, Chinese Academy of Sciences; Beijing University of Posts and Telecommunications; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences; Huazhong University of Science and Technology; Shandong University
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity’s identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity’s name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model’s I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.

[NLP-56] Milestone-Guided Policy Learning for Long-Horizon Language Agents

【速读】: 该论文旨在解决长时程智能体任务(long-horizon agentic tasks)中基于强化学习(reinforcement learning)训练语言代理时面临的两大挑战:信用错位(credit misattribution)和样本效率低下(sample inefficiency)。前者指早期正确动作因最终失败而被错误惩罚,后者则源于成功轨迹稀少导致学习信号几乎消失。解决方案的关键在于提出一种基于里程碑引导的策略学习框架 BEACON,其核心机制包括:利用任务的组合结构在里程碑边界分割轨迹、在段内实施时间奖励重塑以精准分配部分进展的信用,并在双尺度上估计优势值以防止远端失败干扰局部动作评估。这一方法显著提升了长时程任务中的成功率与样本利用率。

链接: https://arxiv.org/abs/2605.06078
作者: Zixuan Wang,Yuchen Yan,Hongxing Li,Teng Pan,Dingming Li,Ruiqing Zhang,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO’s 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at this https URL.

[NLP-57] Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

【速读】: 该论文旨在解决当前“定位-更新”(Locate-then-Update)范式在大语言模型(Large Language Models, LLMs)后训练中存在的一大根本性问题:基于静态参数提取的机制是否能有效指导动态参数更新。研究发现,Transformer电路在监督微调(Supervised Fine-Tuning, SFT)过程中呈现内在的“自由演化”特性,导致从当前状态获得的机制因存在时间延迟而无法准确预测未来状态,从而削弱了现有方法的有效性。解决方案的关键在于引入三个新指标——电路距离(Circuit Distance)、电路稳定性(Circuit Stability)和电路冲突(Circuit Conflict),从神经迁移、语义稳定性和跨任务干扰三个维度量化分析电路演化,并提出需具备“前瞻性”(foresight)的机制定位框架,以实现对未来状态的预测性干预。

链接: https://arxiv.org/abs/2605.06076
作者: Hang Chen,Jiaying Zhu,Hongyang Chen,Hongxu Liu,Xinyu Yang,Wenya Wang
机构: Xi’an Jiaotong University (西安交通大学); The Chinese University of Hong Kong (香港中文大学); Shaanxi Co., Ltd(Xi’an 710077, China) (陕西有限公司(西安 710077,中国)); China Mobile Group (中国移动集团); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:The “Locate-then-Update” paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit “Free Evolution” during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the “illusion of effectiveness” in existing methods, this work underscores the necessity of “foresight” in mechanistic localization and proposes a predictive framework for future research.

[NLP-58] Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理与规划任务中仍存在脆弱性、难以达到人类水平性能,且常伴随高时间与Token消耗的问题。其解决方案的关键在于引入一种可量化的“新颖性”(novelty)度量机制,用于评估树状思维(Tree of Thought, ToT)搜索树中新节点(thought)相对于已访问节点的独特性,该度量基于预训练模型的嵌入知识并通过提示(prompting)估计得出。利用此新颖性指标可有效剪枝冗余分支,从而减少搜索空间和整体Token开销,实现更高效的推理过程。

链接: https://arxiv.org/abs/2605.06040
作者: Leon Hamm,Zlatan Ajanovic
机构: RWTH Aachen (亚琛工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible “paths” of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.

[NLP-59] More Aligned Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLM s

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在生成英语新闻文本时,其句法结构与词汇多样性是否随模型迭代而发生变化的问题。研究通过对比不同代际的大语言模型(LLM)生成文本与人类撰写的《纽约时报》(New York Times, NYT)文章的语法分布特征,揭示了指令微调(instruction tuning)对模型输出表达范围的影响。解决方案的关键在于采用基于头驱动短语结构语法(Head-Driven Phrase Structure Grammar, HPSG)的形式化框架,并结合生态学与信息论中的多样性指标,量化分析句法构造和词类类型的分布差异,从而发现较新的 LLM 虽在连贯性和提示遵循性上提升,但其句法和尤其是词汇多样性显著低于早期非指令微调模型。

链接: https://arxiv.org/abs/2605.06030
作者: Adrián Gude,Roi Santos-Ríos,Francis Bond,Dan Flickinger,Carlos Gómez-Rodríguez,Olga Zamaraeva
机构: Universidade da Coruña, CITIC; Independent Researcher; Palacký University, Olomouc
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis compares two generations of LLMs in the context of two human-authored English news datasets from two different years. Employing the Head-Driven Phrase Structure Grammar (HPSG) formalism, we investigate the distributions of syntactic structures and lexical types of AI-generated texts and contrast them with the corresponding distributions in the human-authored New York Times (NYT) articles. We use diversity metrics from ecology and information theory to quantify variation in grammatical constructions and lexical types. We show that English news text has changed little in the given time frame, while newer LLMs display reduced syntactic and, especially, lexical diversity compared to older, non-instruction-tuned models. These findings point to future work in studying effects of instruction tuning, which, while enhancing coherence and adherence to prompts, may narrow the expressive range of model output.

[NLP-60] From Articles to Premises: Building PrimeFacts an Extraction Methodology and Resource for Fact-Checking Evidence LREC2026

【速读】: 该论文旨在解决自动化事实核查系统难以获取和利用事实核查文章中丰富支撑证据的问题,因为这些证据通常以非结构化形式呈现。解决方案的关键在于提出PrimeFacts方法论与资源,通过识别文章内超链接作为自然锚点,利用大语言模型(LLMs)将原始句子重写为独立、上下文无关的前提陈述(premises),从而提取细粒度且可复用的证据。此方法显著提升了跨文章证据检索和声明验证任务的性能,在Mean Reciprocal Rank上相对提升达30%,在 verdict 预测上Macro-F1提升10–20个百分点,且效果稳定于不同判决粒度和模型架构。

链接: https://arxiv.org/abs/2605.06006
作者: Premtim Sahitaj,Jawan Kolanowski,Ariana Sahitaj,Veronika Solopova,Max Upravitelev,Daniel Röder,Iffat Maab,Junichi Yamagishi,Sebastian Möller,Vera Schmitt
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026. To appear in the conference proceedings

点击查看摘要

Abstract:Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-grained evidence from full fact-checking articles. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in-article hyperlinks as natural anchors to pinpoint key evidence. Our framework leverages large language models (LLMs) to rewrite these anchor sentences into stand-alone, context-independent premises and investigates the extraction of additional implicit evidence. In evaluations on cross-article evidence retrieval and claim verification, the extracted premises substantially improve performance. Decontextualized evidence yields higher retrievability, achieving up to a 30 percent relative gain in Mean Reciprocal Rank over verbatim sentences, and using the evidence for verdict prediction raises Macro-F1 by 10-20 points over the baseline. These gains are consistent across different verdict granularities (2-class vs. 5-class) and model architectures. A qualitative analysis indicates that the decontextualized premises remain faithful to the original sources. Our work highlights the promise of reusing fact-checkers’ evidence for automation and provides a large-scale resource of structured evidence from real-world fact-checks.

[NLP-61] Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks ICML2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面面临的有害微调(Harmful Fine-tuning, HFT)威胁问题。现有防御方法通常通过约束参数、梯度或内部表示来实现,但这些方法容易被持续的HFT攻击绕过,其根本原因在于高维参数空间的固有冗余性:攻击者可利用与防御约束正交的优化轨迹,在看似遵守安全限制的同时恢复有害能力。解决方案的关键在于提出安全瓶颈正则化(Safety Bottleneck Regularization, SBR),将防御焦点从冗余的参数空间转移到解嵌入层(unembedding layer),该层作为几何瓶颈;SBR通过将有害查询的最终隐藏状态锚定到安全对齐模型的状态,使模型即使在持续HFT下仍能保持安全响应,实验表明仅需一个安全锚点即可显著降低有害评分至10以下,同时维持良性下游任务的性能竞争力。

链接: https://arxiv.org/abs/2605.05995
作者: Guoxin Lu,Letian Sha,Qing Wang,Peijie Sun,Hao Zhou,Hua Dai,Fu Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR’s effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to 10 while preserving competitive performance on benign downstream tasks.

[NLP-62] heraAgent : Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning ACL2026

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)在生成治疗方案时存在的粗略、不完整及潜在不安全的问题,这些问题源于其依赖单次输出而缺乏显式验证机制。解决方案的关键在于提出TheraAgent框架,该框架采用迭代式的“生成-判断-修正”(generate-judge-refine)流程,模拟人类专家逐步优化治疗计划的推理过程,从而将初始草稿逐步转化为精确、全面且更安全的治疗方案。其中,核心创新是引入了针对治疗场景设计的评估模块TheraJudge,嵌入推理循环中以强制执行临床标准,确保生成结果符合医学规范。

链接: https://arxiv.org/abs/2605.05963
作者: Junkai Li,Yunghwei Lai,Tianyi Zhu,Zheng Long Lee,Weizhi Ma,Yang Liu
机构: Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.

[NLP-63] atarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering

【速读】: 该论文旨在解决多语言地名数据上的自动地理空间问答(Geospatial Question Answering, GQA)问题,即如何在包含多种语言地名信息的结构化数据中准确回答用户关于地理位置、名称来源或行政归属等问题。其解决方案的关键在于构建了一个高质量的双语地名数据集(涵盖9,688条记录,93.1%具有地理坐标),并基于此构建了约39,000个带上下文和定位答案的问题-答案三元组;同时提出一种混合检索器,融合密集语义索引(multilingual-e5-large)与基于KD树和哈弗辛距离的地理空间过滤机制,在测试集上实现Recall@1=0.988、MRR=0.994,显著优于传统BM25和纯空间方法;此外,通过对比不同阅读器架构发现,XLM-RoBERTa-large在准确率上最优(EM=0.992,F1=0.994),且简单后处理可修复RuBERT在坐标类问题中的缺陷,从而提升整体系统鲁棒性与实用性。

链接: https://arxiv.org/abs/2605.05962
作者: Mullosharaf K. Arabov
机构: Kazan (Volga Region) Federal University (喀山(伏尔加地区)联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrates dense semantic indexing (multilingual-e5-large) with geospatial filtering via KD-trees and haversine distance. On 500 test queries, the hybrid search achieves Recall@1=0.988, Recall@5=1.000, and MRR=0.994, significantly outperforming BM25 and purely spatial methods. Among tested reader architectures (RuBERT, XLM-RoBERTa-large, T5-RUS), XLM-RoBERTa-large attains the best quality: EM=0.992, F1=0.994. On raw outputs, RuBERT models fail on coordinate questions (F1=0) while XLM-RoBERTa-large reaches F1=0.984; however, simple post-processing eliminates numerical gaps and restores RuBERT accuracy to 100%. This discrepancy stems from tokenization differences and pre-training corpora composition. All resources (dataset, QA corpus, model weights, web demo) are openly published on Hugging Face. Results apply to geospatial QA services, geocoding, and digital humanities in multilingual regions.

[NLP-64] ableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity ACL2026

【速读】: 该论文旨在解决当前基础模型在复杂视觉与结构条件下进行多模态表格推理能力不足的问题。其解决方案的关键在于构建了一个名为TableVista的综合性基准,包含3,000个高质量表格推理问题,并通过多风格渲染与变换流程生成每个实例的10种不同视觉变体,最终形成30,000个多模态样本,从而实现对模型在多种视觉复杂性和结构复杂性场景下的多维评估。这一设计使得研究能够系统揭示现有模型在面对结构复杂布局和纯视觉输入时性能显著下降的现象,为提升表格理解模型的鲁棒性和一致性提供了关键洞见。

链接: https://arxiv.org/abs/2605.05955
作者: Zheyuan Yang,Liqiang Shang,Junjie Chen,Xun Yang,Chenglong Xu,Bo Yuan,Chenyuan Jiao,Yaoru Sun,Yilun Zhao
机构: Tongji University (同济大学); University of Bristol (布里斯托大学); Tianjin University (天津大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2026 Findings

点击查看摘要

Abstract:We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.

[NLP-65] Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成与事实不符的内容。现有方法虽在纠正幻觉方面取得一定进展,但普遍存在对所有token无差别修正的问题,导致原本正确的生成内容也被破坏。其解决方案的关键在于提出PCNET——一种基于可 tractable density estimator 的概率电路(Probabilistic Circuit),用于建模LLM残差流(residual stream)上的概率分布。该方法通过精确计算负对数似然(Negative Log-Likelihood)识别幻觉为事实流形上的几何异常,无需采样、外部验证器或权重修改。进一步地,利用PCNET作为动态门控机制,在解码过程中仅对偏离事实区域的潜在状态触发PC-LDCD(Probabilistic Circuit Latent Density Contrastive Decoding)策略进行修正,从而实现精准干预并保留原始正确生成,显著提升幻觉检测准确率和生成内容的真实性。

链接: https://arxiv.org/abs/2605.05953
作者: Erik Nielsen,Elia Cunegatti,Marcus Vukojevic,Giovanni Iacca
机构: University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.

[NLP-66] Lightweight Stylistic Consistency Profiling: Robust Detection of LLM -Generated Textual Content for Multimedia Moderation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)生成文本的检测问题,尤其是现有检测方法依赖统计特征或模型特定启发式规则时所面临的脆弱性——这些方法易受改写(paraphrasing)和对抗攻击的影响,导致鲁棒性和可解释性不足。解决方案的关键在于提出LiSCP(Lightweight Stylistic Consistency Profiling),一种基于风格一致性分析的轻量级检测方法,其核心创新是构建融合离散风格特征与连续语义信号的一致性剖面,利用多模态引导改写文本变体中的风格稳定性来提升检测性能。实验表明,LiSCP在领域内检测上表现优异,并在跨域场景下相比现有方法最高提升11.79%,且在对抗攻击和人机混合场景中展现出显著鲁棒性。

链接: https://arxiv.org/abs/2605.05950
作者: Siyuan Li,Aodu Wulianghai,Xi Lin,Xibin Yuan,Qinghua Mao,Guangyan Li,Xiang Chen,Jun Wu,Jianhua Li
机构: Shanghai Jiao Tong University (上海交通大学); Chinese Academy of Sciences (中国科学院); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or model-specific heuristics, making them vulnerable to paraphrasing and adversarial manipulations, and consequently limiting their robustness and interpretability. In this work, we proposeLiSCP , a novel lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Our approach constructs a consistency profile that combines discrete stylistic features with continuous semantic signals, leveraging stylistic stability across multimodal-guided paraphrased text variants. Experiments spanning real-world multimedia news and movie datasets and conventional text domains demonstrate that LiSCP achieves superior performance on in-domain detection and outperforms existing approaches by up to 11.79% in cross-domain settings. Additionally,it demonstrates notable robustness under adversarial scenarios, including adversarial attacks and hybrid human-AI settings.

[NLP-67] MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

【速读】: 该论文旨在解决当前视觉语言动作(Vision Language Action, VLA)模型训练中缺乏长时程、高保真第一人称视角(egocentric)数据的问题,现有数据集通常仅覆盖几分钟的短时轨迹,难以捕捉复杂机器人任务所需的长期时间依赖性。其解决方案的关键在于提出MobileEgo Anywhere框架,利用消费级移动设备(如智能手机)的多传感器融合能力实现小时级、稳定的第一人称轨迹采集,并配套开源移动应用与标准化数据处理流程,从而显著降低数据收集门槛,推动大规模、多样化长时序数据的获取,助力通用机器人策略的发展。

链接: https://arxiv.org/abs/2605.05945
作者: Senthil Palanisamy,Abhishek Anand,Satpal Singh Rathor,Pratyush Patnaik,Shubhanshu Khatana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.

[NLP-68] Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

【速读】: 该论文旨在解决自回归模型中标准知识蒸馏(Knowledge Distillation)因分布不匹配(distribution mismatch)而导致性能下降的问题。传统基于策略的方法虽能缓解此问题,但依赖计算昂贵的强化学习(Reinforcement Learning, RL)框架。其解决方案的关键在于提出一种异步知识蒸馏方法——近策略蒸馏(Near-Policy Distillation, NPD),该方法通过解耦学生模型生成与训练过程,使监督微调(Supervised Fine-Tuning, SFT)能够结合序列打包(sequence packing)以提升效率。为应对异步更新带来的策略滞后(policy lag)和样本噪声问题,NPD进一步引入稀疏学生更新机制与Δ-IFD过滤机制(Δ-IFD filtering mechanism),后者通过筛选极端分布外样本,抑制梯度噪声,确保优化轨迹稳定在安全的邻近策略区域内,从而实现比SFT高8.09%的性能提升,并在推理效率上达到基线方法8.1倍的速度优势。

链接: https://arxiv.org/abs/2605.05940
作者: Miao Rang,Zhenni Bi,Hang Zhou,Kai Han,Xuechun Wang,An Xiao,Xinghao Chen,Yunhe Wang,Hanting Chen
机构: Huawei Technologies(华为技术); Tianjin University(天津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the \Delta -IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, \Delta -IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.

[NLP-69] Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Model, SLM)与文本大语言模型(Text Large Language Model, TLM)之间存在的显著模态差距(modality gap)问题。现有方法主要从输出端入手,试图使语音生成更接近文本特征,但效果有限。本文提出TextPro-SLM,其核心创新在于从输入端入手,通过引入WhisperPro统一语音编码器,将语音输入同步转化为文本标记(text tokens)和韵律嵌入(prosody embeddings),从而让语音输入更贴近具备韵律感知能力的文本大模型。该方案在仅需约1000小时训练音频的情况下,显著缩小了模态差距,并在语音语义和副语言理解任务上取得领先性能,验证了从输入侧优化模态对齐的有效性与数据效率。

链接: https://arxiv.org/abs/2605.05927
作者: Wenqian Cui,Xiao-Hui Li,Daxin Tan,Qiyong Zheng,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学); Huawei Technologies(华为技术)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.

[NLP-70] Logic-Regularized Verifier Elicits Reasoning from LLM s

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型中推理能力增强依赖于资源密集型监督数据集构建的问题,该方法不仅成本高昂,且在数据多样性方面存在局限。解决方案的关键在于提出一种无监督验证器 LOVER,其通过引入逻辑规则作为先验知识对验证器进行正则化,将验证器建模为二元潜在变量,并利用内部激活信息,在多个推理路径上施加三种逻辑约束:否定一致性、组内一致性和组间一致性(按最终答案分组)。这种设计使 LOVER 能够有效利用未标注样本,并直接兼容任何现成的大语言模型(LLM),实验表明其性能显著优于无监督基线,平均达到监督验证器的 95% 水平。

链接: https://arxiv.org/abs/2605.05893
作者: Xinyu Wang,Changzhi Sun,Lian Cheng,Yuanbin Wu,Dell Zhang,Xiaoling Wang,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Verifiers are crucial components for enhancing modern LLMs’ reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,and inter-group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe-shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at this https URL.

[NLP-71] Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

【速读】: 该论文旨在解决现有激活控制(activation steering)方法在推理阶段对语言模型行为调控时存在的局限性,即这些方法通常依赖于固定、单步且位置无关的变换操作,导致性能不如简单的上下文提示(in-context prompting),并且在未见概念上的泛化能力较差。其解决方案的关键在于提出一种基于流的激活控制方法(Flow-based Activation Steering, FLAS),该方法学习一个通用的、以概念为条件的速度场 $ v_t(h,t,c) $,能够无须预设干预形式地将未受控的激活映射到目标激活状态,从而实现多步、token变化且位置敏感的轨迹演化,有效提升了性能与泛化性。

链接: https://arxiv.org/abs/2605.05892
作者: Zehao Jin,Ruixuan Deng,Junran Wang,Xinjie Shen,Chao Zhang
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field v_t(h,t,c) that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of 1.015 on Gemma-2-2B-IT and 1.113 on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

[NLP-72] Evaluation Awareness in Language Models Has Limited Effect on Behaviour

【速读】: 该论文旨在解决生成式 AI(Generative AI)在链式思维(Chain of Thought, CoT)中表现出评估意识(Verbalised Evaluation Awareness, VEA)是否会导致模型为迎合潜在评估标准而策略性调整输出的问题,进而可能掩盖真实行为倾向,如安全风险。其解决方案的关键在于设计了两种实验范式:一是“在策略”(on-policy)方法,通过对比同一输入下自发包含VEA与不包含VEA的多个CoT样本;二是“离策略”(off-policy)方法,通过预填充(prefilling)技术主动注入或移除VEA语句并重新采样,从而量化VEA对模型行为的影响。结果显示,VEA对模型输出影响微弱(效应量 ω ≤ 0.31),表明当前文献中将高VEA率视为战略行为或对齐操纵的证据需谨慎对待,评估意识带来的实际安全风险可能低于现有认知。

链接: https://arxiv.org/abs/2605.05835
作者: Amelie Knecht,Lucas Florin,Thilo Hagendorff
机构: University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 29 pages, 14 figures

点击查看摘要

Abstract:Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, moral reasoning, and political opinion. We tested this both on-policy, sampling multiple CoTs per item and comparing those that spontaneously contained VEA against those that did not, and off-policy, using model prefilling to inject evaluation-aware sentences where missing and remove them where present, with subsequent resampling. VEA has limited effect on model behaviour: injecting VEA into CoTs produces near-zero effects ( \omega \leq 0.06 ), removing it causes small shifts ( \omega \leq 0.12 ) and spontaneously occurring VEA shifts answer distributions by at most 3.7 percentage points ( \omega \leq 0.31 ). Our findings call for caution when interpreting high VEA rates as evidence of strategic behaviour or alignment tampering. Evaluation awareness may pose a smaller safety risk than the current literature assumes.

[NLP-73] LeakDojo: Decoding the Leakage Threats of RAG Systems ACL2026

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因外部知识库暴露而导致的数据泄露风险问题。随着RAG系统复杂度提升和大语言模型(Large Language Models, LLMs)指令遵循能力增强,现有研究尚未对RAG泄露风险进行系统性评估。论文提出LeakDojo这一可配置的框架,用于受控环境下对RAG泄露攻击进行基准测试,其关键在于通过统一实验平台实现多模型、多数据集与多样化RAG架构下的标准化攻击评估,从而揭示查询生成与对抗指令独立贡献于泄露风险、指令遵循能力越强泄露风险越高、以及RAG忠实性改进可能加剧泄露等核心发现,为实际部署中的RAG安全防护提供量化依据与实践指导。

链接: https://arxiv.org/abs/2605.05818
作者: Maosen Zhang,Jianshuo Dong,Boting Lu,Wenyue Li,Xiaoping Zhang,Tianwei Zhang,Han Qiu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings of ACL 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction-following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable insights for understanding and mitigating RAG leakage in practice. Our codebase is available at this https URL.

[NLP-74] Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因幻觉(hallucination)导致的可靠性问题,特别是针对仅通过API访问的商业黑盒LLM,其内部机制不可见,使得现有不确定性量化方法难以实时估计且无法捕捉推理过程中的隐含信息。解决方案的关键在于提出分布对齐对抗蒸馏(Distribution-Aligned Adversarial Distillation, DisAAD),该方法构建生成-判别架构,引导轻量级代理模型(proxy model)学习黑盒LLM输出分布中的高质量区域,从而赋予其判断黑盒模型是否“知道”答案的能力;随后利用该代理模型复现具体响应并基于证据学习(evidence learning)进行不确定性估计,实验证明即使代理模型仅为目标LLM规模的1%,也能实现可靠的不确定性量化。

链接: https://arxiv.org/abs/2605.05777
作者: Huizi Cui,Huan Ma,Qilin Wang,Yuhang Gao,Changqing Zhang
机构: Tianjin University(天津大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To address this issue, we propose Distribution-Aligned Adversarial Distillation (DisAAD), which introduces a generation-discrimination architecture to guide a lightweight proxy model to learn the high-quality regions of the output distribution of the black-box LLM, thus effectively endowing it with the ability to know whether the black-box LLM knows or not. Subsequently, we use the proxy model to reproduce the specific responses of the black-box LLM and estimate the corresponding uncertainty based on evidence learning. Extensive experiments have verified the effectiveness and promise of our proposed method, indicating that a proxy model even one that only accounts for 1% of the target LLM’s size can achieve reliable uncertainty quantification.

[NLP-75] Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning

【速读】: 该论文旨在解决差分隐私(Differential Privacy, DP)下的联邦微调(Federated Fine-tuning)中,由于低秩适配(LoRA)的乘法结构导致的聚合误差问题,该误差在DP噪声作用下被放大,进而损害模型训练的稳定性和准确性。现有方法采用统一更新模式或固定轮次交替策略,忽视了LoRA两个因子之间的结构不对称性以及训练过程中各轮次的动态变化。解决方案的关键在于提出AS-LoRA框架,其核心由三个维度构成:(i) 层级自由度(layer-wise freedom),每层独立选择激活组件;(ii) 轮次自适应性(round-wise adaptivity),选择策略随通信轮次动态调整;(iii) 基于损失函数二阶近似的曲率感知评分机制,用于指导组件选择。理论分析表明,AS-LoRA可消除层绑定调度的重建误差下限、加速收敛、隐式偏向平坦极小值点,且不增加额外隐私成本,在GLUE、SQuAD、CIFAR-100和Tiny-ImageNet等任务上显著优于基线方法,同时聚合开销降低33–180倍。

链接: https://arxiv.org/abs/2605.05769
作者: Myoungjun Kim,Sangwoo Park,Yoseob Han,Jin-Hyun Ahn
机构: Myongji University(明知大学); King’s College London(伦敦国王学院); Soongsil University(中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to a conference

点击查看摘要

Abstract:Differentially private federated fine-tuning of large models with LoRA suffers from aggregation error caused by LoRA’s multiplicative structure, which is further amplified by DP noise and degrades both stability and accuracy. Existing remedies apply a single update mode uniformly across all layers and all communication rounds (or alternate them on a fixed schedule), ignoring both the structural asymmetry between the two LoRA factors and the round-wise dynamics of training. We propose AS-LoRA, an adaptive framework defined by three axes (i) layer-wise freedom, in which each layer independently selects its active component, (ii) round-wise adaptivity, in which the selection updates over communication rounds, and (iii) a curvature-aware score derived from a second-order approximation of the loss. Theoretically, AS-LoRA eliminates the reconstruction-error floor of layer-tied schedules, accelerates convergence, implicitly biases solutions toward flatter minima, and incurs no additional privacy cost. Across GLUE, SQuAD, CIFAR-100, and Tiny-ImageNet under strict DP budgets and non-IID partitions, AS-LoRA improves over the federated LoRA baselines by up to +7.5 pp on GLUE and +12.5 pp on MNLI-mm for example, while matching or exceeding SVD-based aggregation methods at 33\text–180 \times lower aggregation cost and with negligible communication overhead. Code for the proposed method is available at this https URL.

[NLP-76] BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生物医学等高度专业化领域中性能不足的问题,特别是其无法有效调用生物医学工具这一关键限制。当前LLMs在通用场景下表现优异,但在生物医学任务中仍难以像临床专家或研究人员那样利用NCBI、Ensembl和UniProt等数据库提供的专业工具进行推理与决策。解决方案的关键在于构建一个名为BioTool的综合性生物医学工具调用数据集,该数据集包含34个高频使用的生物医学工具及7,040对高质量、人工验证的查询-API调用配对,覆盖变异分析、基因组学、蛋白质组学、进化生物学和一般生物学等多个子领域。通过在该数据集上微调一个40亿参数的LLM,显著提升了模型在生物医学工具调用任务中的表现,并优于最先进的商用模型(如GPT-5.1),同时专家评估也证实了集成BioTool微调后的工具调用模块可大幅提升下游任务的答案质量,从而有效增强LLMs在生物医学领域的实用能力。

链接: https://arxiv.org/abs/2605.05758
作者: Xin Gao,Ruiyi Zhang,Meixi Du,Peijia Qin,Pengtao Xie
机构: UC San Diego; MBZUAI
类目: Computation and Language (cs.CL)
备注: Published at ACL 2026; Code and data available at this https URL

点击查看摘要

Abstract:Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at this https URL

[NLP-77] RVPO: Risk-Sensitive Alignment via Variance Regularization

【速读】: 该论文旨在解决当前无监督强化学习人类反馈(critic-less RLHF)方法在多目标奖励聚合时存在的“约束忽视”问题:由于采用算术平均聚合奖励信号,模型可能因某一高数值目标的成功而掩盖其他关键目标(如安全性或格式正确性)的失败,导致对“瓶颈”奖励的忽略,从而影响多目标对齐的可靠性。解决方案的关键在于提出一种风险敏感的框架——奖励方差策略优化(Reward-Variance Policy Optimization, RVPO),其核心是在优势值聚合阶段引入对奖励间方差的惩罚机制,将优化目标从“最大化总和”转变为“最大化一致性”。通过泰勒展开分析表明,LogSumExp(SoftMin)操作符可有效实现平滑的方差惩罚,从而避免模型过度偏向易达成的目标,保障复杂约束的执行效果。

链接: https://arxiv.org/abs/2605.05750
作者: Ivan Montero,Tomasz Jurczyk,Bhuwan Dhingra
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p 0.001 ) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

[NLP-78] Multi-Dimensional Behavioral Evaluation of Agent ic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback

【速读】: 该论文旨在解决生成式 AI (Generative AI) 驱动的股票预测系统中决策过程不可观测的问题,即系统在多个相互依赖的决策环节(如市场状态识别、路径选择、强化学习控制)中的个体行为质量被整体指标(如平均绝对百分比误差 MAPE 或方向准确性)掩盖。为实现对这些隐蔽决策质量的精准评估,作者提出了一种基于行为痕迹的评估框架:将每五天的决策序列划分为一个“行为片段”,并由三位大型语言模型(LLM)评委从六个领域特定维度(市场状态识别、路径路由、适应性、风险校准、策略一致性、错误恢复)进行评分;通过扰动验证实验,证实该框架能有效识别目标维度的行为缺陷(平均分数下降达 -1.6 至 -2.4),且跨模型一致性高达 Krippendorff’s α = 0.85。关键创新在于将行为得分转化为软奖励惩罚项嵌入 Soft Actor-Critic (SAC) 算法,从而在仅使用验证期内三轮微调后,在独立测试集(2017–2025)上显著提升模型性能:MAPE 下降 11.5%,方向准确性提高至 74%,Sharpe 比率提升 18%。

链接: https://arxiv.org/abs/2605.05739
作者: Mohammad Al Ridhawi,Mahtab Haj Ali,Hussein Al Osman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: 9 pages, 2 figures, 8 tables. Short Communication submitted to Knowledge-Based Systems (Elsevier)

点击查看摘要

Abstract:Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of -1.6 to -2.4 on intended dimensions versus an average of -0.32 on the remaining five, with cross-model agreement up to Krippendorff’s \alpha = 0.85 . The composite behavioral score, used here only for cross-episode reporting, correlates at \rho = 0.72 with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; p0.001 , Cohen’s d=0.31 ), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.

[NLP-79] ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长周期、多阶段任务中因推理错误累积而难以自我检测与恢复的问题。现有推理范式如思维链(Chain-of-Thought, CoT)、ReAct 和事后自检(post-hoc self-critique)依赖于两个失效的假设,导致模型无法有效识别自身失败,从而降低任务成功率。解决方案的关键在于提出 ReFlect——一种无监督、无需训练、仅在推理时运行的模型无关(model-agnostic)封装系统,其核心机制是将确定性的错误检测与恢复逻辑作为独立模块嵌入模型推理流程中,实现对推理过程的状态追踪和异常干预,显著提升任务成功率并改善代码修复质量(如 SWE-bench 上结构质量从 0% 提升至 82–87%)。

链接: https://arxiv.org/abs/2605.05737
作者: Fan Huang
机构: Indiana University Bloomington
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emphharness system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76% of cases. Our ReFlect harness achieves task success rates ranging from 41% on gpt-4o-mini to 56% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0% (Direct CoT) to between 82% (Qwen2.5-72B) and 87% (GPT-4o). Notably, the harness gain is inversely proportional to the model’s Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0–18.7% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.

[NLP-80] More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体系统中因组件堆叠导致的跨组件干扰(Cross-Component Interference, CCI)问题,即当多个功能模块(如规划、工具调用、记忆、自我反思和检索)组合使用时,其交互可能产生性能退化而非协同增益。解决方案的关键在于摒弃“越多越好”的默认策略,转而通过系统性实验与交互感知分析识别最优组件子集:研究发现最优组件数量任务依赖且受模型规模影响,采用主效应回归与Shapley值量化组件贡献并揭示非单调性(56.3%子模性违反),最终提出基于交互效应评估的任务定制化组件选择范式,从而显著优于全量组件集成(All-In)方案。

链接: https://arxiv.org/abs/2605.05716
作者: Ming Liu
机构: Amazon
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 tables; preprint, under review

点击查看摘要

Abstract:LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama-3.1-8B/70B (96 conditions, up to 10 seeds). The All-In system is consistently suboptimal: on HotpotQA, a single-tool agent surpasses All-In by 32% (F1 0.233 vs 0.177, p=0.023); on GSM8K, a 3-component subset beats All-In by 79% (0.43 vs 0.24, p=0.010). Optimal component count is task-dependent (k*=1-4) and scale-sensitive: at 70B, combinations that hurt at 8B provide gains, though All-In still trails the best subset. We fit a main-effects regression (R^2=0.916, adj-R^2=0.899, LOOCV=0.872), compute exact Shapley values, and find 183/325 submodularity violations (56.3%), showing greedy selection is unreliable. A three-body synergy among Tool Use, Self-Reflection, and Retrieval (INT_3=+0.175, 95% CI [+0.003,+0.351]) is reported as exploratory. CCI replicates across model families (Qwen2.5) and is robust to prompt paraphrasing. Our findings suggest maximally-equipped agent defaults should be replaced by task-specific subset selection via interaction-aware analysis.

[NLP-81] Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“分类-纠正鸿沟”(classification-correction gap)的问题,即模型在隐藏状态中存在可线性解码的失败信号(failure signals),但这些信号是否可用于有效纠正模型错误仍不明确。其关键解决方案在于提出并验证了“过度思考”(Overthinking, OT)这一稳定的行为模式:模型在重采样下能正确回答医疗问答任务,但在扩展链式推理(chain-of-thought)中却失败;该失败状态可在隐藏层中以71.6%的平衡准确率线性解码(p < 10⁻¹⁶)。尽管五类固定线性引导(steering)方法(共29种配置)均未能改善性能(Δ ≈ 0),且跨架构(Qwen2.5-7B)和跨领域(MMLU-STEM)结果一致,三重证据表明存在表征纠缠(representational entanglement)——OT方向与任务关键计算高度重叠(85–88%),非目标共享方向引导损害准确性(−12.1个百分点),而概念擦除(LEACE)仅对特定概念造成显著性能下降(−3.6个百分点,p=0.01)。值得注意的是,同一探测器(probe)虽无法通过固定线性引导实现纠错,却能用于选择性弃权(selective abstention),其外部AUROC达0.610,优于所有五种不确定性基线(p=0.009),说明可解码的失败结构支持事后可靠性估计,即便固定线性引导无法利用它进行纠正。

链接: https://arxiv.org/abs/2605.05715
作者: Ming Liu
机构: Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages (14 main + 8 appendix), 5 figures, 7 tables. Under review

点击查看摘要

Abstract:Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)–a stable behavioral regime (Jaccard = 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p 10^-16). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio = 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.

[NLP-82] Decomposing the Basic Abilities of Large Language Models : Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多任务指令微调(multi-task instruct-tuning)过程中因共享参数导致的跨任务干扰(cross-task interference)问题。现有方法如任务特定神经元选择或专家混合(mixture-of-experts)虽有所缓解,但仍存在因部分参数被多个任务共用而引发的干扰。其解决方案的关键在于提出一种名为Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT)的新方法:通过实证发现某些参数始终协同激活并自然形成基础组块,从而假设LLMs编码若干正交的基本能力(basic abilities),且任意任务可表示为这些能力的线性组合;BADIT进一步将LLM参数分解为正交的高奇异值LoRA专家(LoRA experts),并通过秩-1组件的球面聚类动态强制训练中保持其正交性,有效降低跨任务干扰程度。

链接: https://arxiv.org/abs/2605.05676
作者: Bing Wang,Ximing Li,Changchun Li,Jinjin Chi,Gang Niu,Masashi Sugiyama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026. 25 pages, 13 figures. Code: this https URL

点击查看摘要

Abstract:Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.

[NLP-83] XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估基准普遍存在的语言偏见和文化敏感性缺失问题,即现有基准多以英语为中心并通过翻译构建,难以捕捉特定国家或地区的本土化危害,且缺乏对模型识别文化嵌入式敏感性的评估能力。其解决方案的关键在于提出XL-SafetyBench,一个涵盖10个国别-语言对、共5,500个测试用例的多阶段评测套件,包含基于本地化对抗提示的Jailbreak Benchmark和嵌入本地文化敏感性的Cultural Benchmark;并通过LLM辅助发现、自动化验证机制与双独立母语标注者协同确保质量,同时引入Neutral-Safe Rate (NSR) 和 Cultural Sensitivity Rate (CSR) 两个互补指标,以区分原则性拒绝与理解失败,从而实现更精细、跨文化的模型安全性评估。

链接: https://arxiv.org/abs/2605.05662
作者: Dasol Choi,Eugenia Kim,Jaewon Noh,Sang Seo,Eunmi Kim,Myunggyo Oh,Yunjin Park,Brigitta Jesica Kartono,Josef Pichlmeier,Helena Berndt,Sai Krishna Mendu,Glenn Johannes Tungka,Özlem Gökçe,Suresh Gehlot,Katherine Pratt,Amanda Minnich,Haon Park
机构: AIM Intelligence; Microsoft; Korea AISI; KT Corporation; BMW Group; Coinbase; Technical University of Munich; Ankara University; Cyril Amarchand Mangaldas; Seoul National University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model’s ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

[NLP-84] Negative Before Positive: Asymmetric Valence Processing in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中情感极性(emotional valence)的机制可解释性问题,即情感内容是否通过专用的内部结构进行处理,还是仅依赖表面的token匹配。其关键解决方案在于利用激活修补(activation patching)和定向控制(steering)技术,在开源LLMs中识别出情感极性在不同网络深度上的分布特征:负向情感主要集中在早期层,而正向情感则在中后期层达到峰值;同时,固定主题并反转情感极性时产生相反响应,排除了主题检测的可能性;进一步地,在特定层施加“好消息”方向的控制可使中性提示转向正向情感,证明这些层确实编码了可操纵的情感极性方向。这一发现表明,情感极性在LLM中具有局部化、因果性和可操控性,是可解释性监督的明确目标。

链接: https://arxiv.org/abs/2605.05653
作者: Sohan Venkatesh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.

[NLP-85] Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对知识库投毒攻击时的脆弱性问题,尤其是针对那些能够处理检索信息冲突的先进架构(如多代理辩论、代理式检索、递归语言模型等)尚未被充分评估的现状。其解决方案的关键在于提出并实施一种名为CorruptRAG-AK的元认知对抗攻击框架,该框架通过精心设计的 adversarial framing(对抗性表述)来误导模型对信息可信度的判断,从而显著提升攻击成功率。实验表明,不同RAG架构在面对此类攻击时表现出巨大差异——从81.9%(vanilla RAG)到24.4%(Recursive Language Models)的攻击成功率达58个百分点,且多数攻击优势来源于内容推理阶段而非检索优化本身,揭示了跨架构漏洞的核心位置在于内容推理环节。

链接: https://arxiv.org/abs/2605.05632
作者: Samuel Korn
机构: Unknown
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve-then-generate pipelines. Architectures designed to handle conflicting retrieved information - multi-agent debate, agentic retrieval, recursive language models - remain untested against adversarially optimized contradictions. We evaluate four RAG architectures (vanilla RAG, agentic RAG, MADAM-RAG, and Recursive Language Models) under controlled single-document (N=1) poisoning on 921 Natural Questions QA pairs, comparing a clean baseline, naive injection, and CorruptRAG-AK - an adversarial attack whose meta-epistemic framing targets credibility assessment. Architecture is a high-impact variable in adversarial robustness: under CorruptRAG-AK, attack success rates range from 81.9% (vanilla) to 24.4% (RLM) - a spread of nearly 58 percentage points across architectures with comparable clean accuracy (~92%). Decomposing this gap, once the poisoned document is retrieved, adversarial framing - not retrieval optimization - drives the majority of CorruptRAG-AK’s advantage for three of four architectures, localizing the cross-architecture vulnerability at the content-reasoning stage. Our MADAM-RAG reimplementation shows the highest apparent contradiction detection rate, though our LLM judge over-identifies this behavior (~48.5% precision), so reported rates are upper bounds. Regardless of detection, MADAM-RAG cannot resolve contradictions reliably, producing a 41.4% non-answer rate even on clean inputs - though implementation divergences from the original may contribute. We introduce a seven-category behavioral taxonomy capturing contradiction detection, hedging, and failure modes beyond binary accuracy. Code, data, and analysis notebooks are publicly available.

[NLP-86] One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

【速读】: 该论文旨在解决多轮对话中隐藏的恶意意图(hidden malicious intent)问题,即攻击者通过分散意图于多个看似无害的对话轮次中,规避现有大型语言模型(Large Language Models, LLMs)的安全机制。传统方法往往依赖单轮提示检测,难以识别这种渐进式、隐蔽的有害行为。解决方案的关键在于提出一种细粒度的turn-level干预策略——通过识别最早使累积交互达到可执行有害行为阈值的对话轮次(harm-enabling closure point),从而在不误拒良性探索性对话的前提下实现精准拦截。为此,作者构建了多轮意图数据集(Multi-Turn Intent Dataset, MTID),包含分支攻击轨迹、匹配的良性难例及最早危害启用轮次标注,并基于此训练出TurnGate监控器,显著优于现有基线方法,在跨领域、跨攻击路径和跨目标模型场景下均展现出良好泛化能力。

链接: https://arxiv.org/abs/2605.05630
作者: Xinjie Shen,Rongzhe Wei,Peizhi Niu,Haoyu Wang,Ruihan Wu,Eli Chien,Bo Li,Pin-Yu Chen,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); OpenAI; National Taiwan University (台湾国立大学); IBM Research (IBM研究院); Virtue AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Project Website: this https URL

点击查看摘要

Abstract:Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at this https URL.

[NLP-87] When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多参与者对话中缺乏对发言时机(intervention timing)的准确判断问题,即模型难以区分何时应当发言(SPEAK)或保持沉默(SILENT),导致频繁打断和对话连贯性下降。解决方案的关键在于提出一个名为When2Speak的地面化合成数据集和四阶段生成流水线:首先基于真实对话构建包含215,000+样本的数据集,明确标注每个回合的发言决策;其次通过结构化增强、受控转录合成与可微分监督策略生成高质量训练信号;最后结合监督微调(Supervised Fine-Tuning, SFT)与不对称奖励塑造的强化学习(Reinforcement Learning with Asymmetric Reward Shaping),显著提升模型对适时干预的识别能力,使误漏率(Missed Intervention Rate, MIR)从0.50降至0.218,召回率从0.479提升至0.81,验证了时间参与度是可独立训练的对话智能维度,并证明地面化合成数据是提升LLMs多角色交互自然性的有效路径。

链接: https://arxiv.org/abs/2605.05626
作者: Vihaan Nama,Shreya Mendi,Zian Ye,Brinnae Bent
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Currently under review. Dataset can be found: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.

[NLP-88] he Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在引入检索增强生成(Retrieval-Augmented Generation, RAG)机制后出现的“再污染”(recorruption)问题,即模型在获得准确外部上下文后反而放弃原本正确的预测。其核心原因是注意力机制的双重崩溃:一是视觉盲区(visual blindness),表现为视觉注意力质量(M_vis)和锐度(S_vis)被系统性抑制;二是结构位置偏差,导致模型过度依赖边界token而非语义相关性。为应对这一问题,作者提出无需参数调整、仅在推理阶段应用的“瓶颈注意力干预恢复”(Bottleneck Attention Intervention for Recovery, BAIR)方案,通过恢复视觉显著性并施加位置感知惩罚来削弱文本干扰项的影响,从而提升多模态对齐能力与诊断可靠性,且无需重新训练或微调模型。

链接: https://arxiv.org/abs/2605.05594
作者: Hoin Jung,Xiaoqian Wang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate “oracle” context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ( M_vis ) and sharpness ( S_vis ), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model’s textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

[NLP-89] Belief Memory: Agent Memory Under Partial Observability

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期上下文环境中依赖外部记忆时,因采用确定性存储机制而导致的自强化错误问题。现有方法将每次观察视为单一确定结论(如从临时错误推断“API~X失败”),忽略了观察本身的不完整性与潜在模糊性,从而在后续决策中不断固化错误认知,难以修正。其解决方案的关键在于提出BeliefMem,一种基于概率的新型记忆范式:不再存储单一结论,而是以多候选结论及其概率的形式保存记忆条目,并通过Noisy-OR规则动态更新概率;检索时所有候选结论及其置信度同时呈现,使代理始终保有对备选方案的认知,从而在高置信度时做出可靠决策,同时保留根据新证据调整信念的能力。实验证明,该方法在LoCoMo和ALFWorld基准上显著优于主流基线,尤其在数据有限场景下表现最优。

链接: https://arxiv.org/abs/2605.05583
作者: Junfeng Liao,Qizhou Wang,Jianing Zhu,Bo Du,Rui Yan,Xiuying Chen
机构: MBZUAI; RIKEN AIP; UT Austin; Wuhan University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring “API~X failed” from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.

[NLP-90] Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

【速读】: 该论文旨在解决强化学习中基于可验证奖励(verifiable rewards)的训练方法,特别是Group Relative Policy Optimization (GRPO) 在复杂任务中面临的“零优势问题”(zero-advantage problem)。当所有采样轨迹均失败时,相对优势归零,导致模型失去有效训练信号,浪费数据与计算资源。解决方案的关键在于提出Lorem Perturbation for Exploration (LoPE),其核心思想是通过在提示词(prompt)前添加任务无关的、由Lorem Ipsum词汇随机组装的序列来扰动输入空间,从而改变模型输出分布,解锁针对困难问题的正交推理路径(orthogonal reasoning pathways)。实验证明,该方法显著优于原始提示重采样策略,并且低困惑度的其他拉丁文随机序列亦具有效性,使LoPE成为大语言模型强化学习中拓展探索能力的强有力基线。

链接: https://arxiv.org/abs/2605.05566
作者: Langlin Huang,Chengsong Huang,Jinyuan Li,Donghong Cai,Yuyi Yang,Jiaxin Huang
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem’': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model’s output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.

[NLP-91] A Few Good Clauses: Comparing LLM s vs Domain-Trained Small Language Models on Structured Contract Extraction

【速读】: 该论文旨在解决法律领域结构化合同信息提取任务中,如何在显著降低计算成本的前提下实现与前沿大型语言模型(Large Language Models, LLMs)相当甚至更优的性能问题。其核心解决方案是采用一个专为法律领域训练的小型语言模型(Small Language Model, SLM)——Olava Extract,该模型基于自托管的专家混合(Mixture of Experts)架构,在不依赖外部大规模云端部署的情况下,实现了更高的精度和更低的推理成本。关键突破在于:Olava Extract在宏观F1(0.812)和微宏观F1(0.842)上优于五种前沿模型,同时将推理成本降低了78%至97%,并减少了幻觉(hallucination)导致的不可靠提取,从而在法律场景中提升了可信赖性和实用性,挑战了企业级AI必须依赖巨型模型和集中式基础设施的传统认知。

链接: https://arxiv.org/abs/2605.05532
作者: Nicole Lincoln,Nick Whitehouse,Jaron Mar,Rivindu Perera
机构: Onit AI Labs, Onit Inc.(Onit公司)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2605.05532 [cs.CL] (or arXiv:2605.05532v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.05532 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-92] Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation

【速读】: 该论文旨在解决自然语言接口到数据库(Natural Language Interface to Databases, NLIDB)系统缺乏理论基础以评估和设计的问题。其解决方案的关键在于提出QUEST框架,该框架由两个独立动机驱动的组件构成:一是FAR结构不变性(FAR structural invariant),指出所有合法查询均可归约为过滤(Filter)、聚合(Aggregate)和返回(Return)操作;二是W5H维度框架(W5H dimensional framework),表明所有筛选条件可映射至六个语义维度(Who、What、Where、When、Why、How)。通过在五个文本转SQL数据集上的验证,研究发现FAR一致性在各类领域和模式中普遍成立,而W5H维度分布存在显著差异,尤其在医疗领域中时间(WHEN)和人物中心(WHO)维度占比极高,且因果(WHY)与机制(HOW)推理几乎缺失,揭示了当前系统在真实机器推理能力上的局限性。

链接: https://arxiv.org/abs/2605.05525
作者: Vicki Stover Hertzberg,Eduardo Valverde,Joyce C. Ho
机构: Emory University (埃默里大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, and How). Validated across five text-to-SQL datasets (n = 120,464), FAR conformance is universal across all domains and schema types, while W5H dimensional profiles vary substantially. Healthcare queries are strongly concentrated in temporal (WHEN: 80.4%) and person-centric (WHO: 73.0%) dimensions far exceeding general-domain benchmarks, and causal (WHY) and mechanistic (HOW) reasoning are near-zero everywhere, with apparent HOW exceptions reflecting quantitative aggregation rather than genuine procedural reasoning. These results identify a frontier that must be crossed for genuine machine reasoning over structured data.

[NLP-93] Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models)生成文本中水印(Watermarking)在多轮重写(Rewriting)场景下的鲁棒性问题。当前多数水印方案基于自回归生成假设,而扩散模型以任意顺序去噪生成文本,使得传统方法失效;尽管已有研究提出针对扩散模型的水印方案,但其在多次文本改写后的有效性尚未被充分评估。本文的关键解决方案是通过系统性实验验证水印在多种重写风格(如改写、简化、学术化等)和多跳链式重写(最多五跳)下的衰减特性,结果表明:即使使用相同的水印配置,原始水印检测率从87.9%下降至单次重写后14–41%,五次重写后仅剩4.86%,说明重复改写是一种远强于单次改写的攻击方式,且该结论在四个不同规模(1.5B–8B参数)的开放权重模型上均成立。

链接: https://arxiv.org/abs/2605.05503
作者: Mohd Ruhul Ameen,Akif Islam,Nadim Mahmud,Md. Ekramul Hamid
机构: Marshall University (马歇尔大学); University of Rajshahi (拉杰沙希大学); Miami University (迈阿密大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total. The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution. Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.

[NLP-94] ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂程序合成任务时效率低且可靠性差的问题,尤其是在需要大规模组合搜索的困难实例上。其解决方案的关键在于利用少量推理轨迹(reasoning traces),通过编码代理(coding agents)将其编译为可在受限领域特定语言(Domain-Specific Language, DSL)上复用的符号程序合成器(symbolic program synthesizers)。这些合成器在测试阶段无需调用LLM,即可作为独立系统运行,展现出高准确率(如PBEBench-Lite达91.3%、PBEBench-Hard达84.7%),并显著优于依赖LLM测试时扩展的基线方法;同时,它们还能与LLM搜索互补,在降低token消耗的同时提升性能,且具备零样本迁移能力,适用于真实历史语言学任务(如音变预测),实现高效、可扩展的通用求解器归纳。

链接: https://arxiv.org/abs/2605.05485
作者: Atharva Naik,Yash Mathur,Prakam,Carolyn Rose,David Mortensen
机构: Carnegie Mellon University (卡内基梅隆大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.

[NLP-95] A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

【速读】: 该论文旨在解决知识图谱(Knowledge Graph)在实际应用中因自动构建过程引入的噪声、碎片化和语义不一致等问题,导致图神经网络(Graph Neural Networks, GNNs)在下游任务中性能受限且难以评估的问题。其关键解决方案是提出一个双目标基准测试框架,该框架基于单一文本语料库构建两个自动抽取的知识图谱与一个由专家标注的高质量参考图谱,从而实现对GNN模型鲁棒性与图构建方法有效性的联合评估;同时提供标准化、可复现且可扩展的评价体系,支持新图构建方法与学习模型的集成与比较。

链接: https://arxiv.org/abs/2605.05476
作者: Othmane Kabal,Mounira Harzallah,Fabrice Guillet,Hideaki Takeda,Ryutaro Ichise
机构: Nantes University, LS2N, Nantes, 44300, France; National Institute of Informatics, Chiyoda-ku, Tokyo, 101-8430, Japan; Institute of Science Tokyo, Tokyo, 152-8550, Japan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge graphs automatically constructed from text are increasingly used in real-world applications. However, their inherent noise, fragmentation, and semantic inconsistencies significantly affect the performance of Graph Neural Networks (GNNs) on downstream tasks. Assessing their performance and robustness remains difficult, as it is often unclear whether observed results stem from the learning model or from the quality of the constructed graph itself. In this work, we introduce a dual-purpose benchmark designed to jointly evaluate (i) the performance of GNNs on noisy, text-derived graphs and (ii) the effectiveness of graph construction methods on a downstream task. The benchmark is built in the biomedical domain from a single textual corpus and includes two automatically constructed graphs generated using different extraction methods, alongside a high-quality reference graph curated by experts that serves as an upper performance bound. This design enables controlled comparison of construction methods and systematic evaluation of GNN robustness through semi-supervised node classification. We further provide a standardized, reproducible, and extensible evaluation framework, facilitating the integration of new graph extraction methods and learning models.

[NLP-96] SLAM: Structural Linguistic Activation Marking for Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)水印技术中检测准确性与文本质量之间的权衡问题:现有方案通常通过扰动词元(token)分布来嵌入水印,导致可测量的文本自然度和多样性下降。其解决方案的关键在于提出一种白盒水印方法SLAM(Structural Linguistic Activation Marking),该方法将水印信息编码在残差流(residual-stream)中与语言结构相关的稀疏方向上,而非改变词元频率;通过因果性地引导这些结构方向进行生成,从而在不约束词汇采样和语义的前提下实现高精度检测(100%准确率),同时将质量损失控制在极低水平(仅1-2奖励点),显著优于KGW、EWD和Unigram等传统方法。

链接: https://arxiv.org/abs/2605.05443
作者: Fabrice Harel-Canada,Amit Sahai
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, under review

点击查看摘要

Abstract:LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss. We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained. On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models. The trade-off is a complementary robustness profile: SLAM resists word-level edits but is vulnerable to paraphrase that restructures syntax (at a quality cost), the converse of token-distribution methods.

[NLP-97] Agent ic Retrieval-Augmented Generation for Financial Document Question Answering

【速读】: 该论文旨在解决金融文档问答(Financial Document Question Answering, FDQA)中复杂多步数值推理问题,其核心挑战在于从企业披露文件中的异构证据(如结构化表格、文本叙述和脚注)中进行精确计算与逻辑组合,而现有检索增强生成(Retrieval-Augmented Generation, RAG)方法采用单次检索-生成范式,难以应对金融分析中常见的组合推理链。解决方案的关键在于提出FinAgent-RAG框架,通过三个领域定制化创新实现高精度数值推理:(1) 使用难负样本挖掘训练的对比金融检索器(Contrastive Financial Retriever),可区分语义相似但数值不同的金融片段;(2) 引入程序思维(Program-of-Thought)推理模块,生成可执行Python代码以替代大语言模型(LLM)的易出错心算;(3) 设计自适应策略路由机制(Adaptive Strategy Router),根据问题复杂度动态分配计算资源,在保持准确率的同时降低41.3%的API成本。

链接: https://arxiv.org/abs/2605.05409
作者: Yang Shu,Yingmin Liu,Zequn Xie
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 11 figures, 13 tables, submitted to Expert Systems with Applications

点击查看摘要

Abstract:Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence–structured tables, textual narratives, and footnotes–scattered across corporate filings. Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. Extensive experiments on three benchmark datasets–FinQA, ConvFinQA, and TAT-QA–demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62–9.32 percentage points. Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework’s robustness and practical viability for financial institutions.

[NLP-98] Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

【速读】: 该论文旨在解决查询聚焦摘要(Query-Focused Summarization, QFS)任务中缺乏带查询的训练数据的问题,即如何从无查询标注的数据集中自动生成基于证据的查询关键词以支持QFS。其解决方案的关键在于提出一种基于证据的查询生成模型,该模型能够从查询缺失的数据集中提取与文档内容高度相关的关键词作为查询,并通过内在相似性评估和外在摘要性能测试(如ROUGE分数)验证生成查询的有效性,实验表明使用此类生成查询所生成的摘要在性能上可媲美原始查询驱动的结果。

链接: https://arxiv.org/abs/2605.05392
作者: Yllias Chali,Deen Abdullah
机构: University of Lethbridge (莱斯布里奇大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Large-scale datasets are widely used to perform summarization tasks, but they may not include queries alongside documents and summaries. In the search for suitable datasets for Query-Focused Summarization (QFS), we identify two research questions: Is it possible to automatically generate evidence-based query keywords from query-free datasets? Does evidence-based query generation support the QFS task? This paper proposes an evidence-based model to generate queries from query-free datasets. To evaluate our model intrinsically, we compare the similarity between the original queries and the system-generated queries of two QFS datasets. We also perform summarization tasks using different pre-trained models, as well as a state-of-the-art (SOTA) QFS model, to measure the extrinsic performance of our query generation approach. Experimental results indicate that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to those generated from the original queries.

[NLP-99] BALAR : A Bayesian Agent ic Loop for Active Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式任务中缺乏主动推理机制的问题,即现有系统通常仅对用户输入做出反应,而无法系统性地识别信息缺失并选择最优的下一步提问策略。解决方案的关键在于提出BALAR(Bayesian Agentic Loop for Active Reasoning),这是一个无需微调的任务无关外层算法,其核心机制包括:1)维护对潜在状态的结构化信念表示;2)通过最大化期望互信息(Expected Mutual Information, EMI)来选择最具信息量的澄清问题;3)当当前状态表征不足以支持推理时,动态扩展状态空间。该方法实现了LLM代理与用户之间的结构化多轮交互,在三个不同基准测试中显著优于所有基线模型。

链接: https://arxiv.org/abs/2605.05386
作者: Aymen Echarghaoui,Dongxia Wu,Emily B. Fox
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm that requires no fine-tuning and enables structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with 14.6% higher accuracy on AR-Bench-DC, 38.5% on AR-Bench-SP, and 30.5% on iCraft-MD.

[NLP-100] ZAYA1-8B Technical Report

【速读】: 该论文旨在解决当前大型语言模型在数学与编程推理任务中计算资源消耗高、效率低的问题,同时提升小规模模型在复杂推理场景下的性能表现。其核心解决方案是构建一个专注于推理能力的稀疏专家混合模型(Mixture-of-Experts, MoE)——ZAYA1-8B,该模型仅激活约7亿参数,却通过全栈AMD软硬件平台完成从预训练到强化学习(RL)微调的全流程训练,并采用答案保留剪裁策略确保推理数据贯穿整个训练阶段。关键创新包括:四阶段强化学习(RL)级联流程,涵盖数学与代码任务的递进式优化;以及提出一种名为马尔可夫递归状态聚合(Markovian RSA)的测试时计算方法,可在仅携带4K token长度推理尾迹的前提下,显著提升模型在AIME’25和HMMT’25等高难度基准上的表现(分别达到91.9%和89.6%),从而大幅缩小与更大规模推理模型(如Gemini-2.5 Pro、GPT-5-High)之间的性能差距。

链接: https://arxiv.org/abs/2605.05365
作者: Robert Washbourne,Rishi Iyer,Tomas Figliolia,Henry Zheng,Ryan Lorig-Roach,Sungyeon Yang,Pritish Yuvraj,Quentin Anthony,Yury Tokpanov,Xiao Yang,Ganesh Nanduru,Stephen Ebert,Praneeth Medepalli,Skyler Szot,Srivatsan Rajagopal,Alex Ong,Bhavana Mehta,Beren Millidge
机构: Zyphra
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ZAYA1-8B, a reasoning-focused mixture-of-experts (MoE) model with 700M active and 8B total parameters, built on Zyphra’s MoE++ architecture. ZAYA1-8B’s core pretraining, midtraining, and supervised fine-tuning (SFT) were performed on a full-stack AMD compute, networking, and software platform. With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models. ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming references; and behavioral RL for chat and instruction following. We also introduce Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only bounded-length reasoning tails between rounds. In TTC evaluation, Markovian RSA raises ZAYA1-8B to 91.9% on AIME’25 and 89.6% on HMMT’25 while carrying forward only a 4K-token tail, narrowing the gap to much larger reasoning models including Gemini-2.5 Pro, DeepSeek-V3.2, and GPT-5-High.

[NLP-101] Counterargument for Critical Thinking as Judged by AI and Humans

【速读】: 该论文旨在解决生成式 AI(Generative AI, GenAI)在教育场景中引发的作弊风险与认知卸载(cognitive offloading)问题,同时探索如何有效评估学生写作中的批判性思维能力。其解决方案的关键在于:通过让学生针对 AI 生成内容撰写反驳论点(counterarguments),引导其主动进行逻辑分析与思辨,从而强化批判性思维;同时利用六种前沿大语言模型(LLMs)作为评分者,基于明确的六维评分量表(焦点、逻辑、内容、风格、正确性与参考文献)对作业进行量化评估,结果显示 LLMs 的评估结果与人类教师及同伴评审具有较高一致性(Gwet’s AC2 值达 0.33,除一个模型外),表明 GenAI 可规模化用于学业评价,并支持教学反馈闭环。

链接: https://arxiv.org/abs/2605.05353
作者: Tosin Adewumi,Marcus Liwicki,Foteini Simistira Liwicki,Lama Alkhaled,Hamam Mokayed,Esra Sümer-Arpak
机构: Luleå University of Technology, Sweden (瑞典吕勒奥理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context of Generative AI (GenAI). This is especially as risks of cheating and cognitive offloading exist with the use of GenAI. We presented 36 students in a particular university course with 4 carefully selected thesis statements (from a set of popular debates) to write about anyone of them. We used six established rubrics (focus, logic, content, style, correctness and reference) to conduct three human assessments (two student peer-reviews and one experienced teacher) per writeup on a 5-point Likert scale for all the qualified samples (n) of 35 submissions (after disqualifying one for irregularity). Using the same rubrics and guidelines, we also assessed the submissions using six frontier LLMs as judges. Our mixed-method design included qualitative open-ended feedback per assessment and quantitative methods. The results reveal that (1) the students’ self-written counterarguments to AI-generated content contains logic, among other things, which is a key component of critical thinking, and (2) GenAI can be successfully used at scale to assess students’ written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter-rater reliability values of 0.33 for all the models except one.

[NLP-102] Beyond BLEU: A Semantic Evaluation Method for Code Translation

【速读】: 该论文旨在解决代码翻译(code translation)任务中评估方法的局限性问题,即传统指标如BLEU仅衡量语法相似性,无法反映程序语义的正确性。其解决方案的关键在于引入一种基于编译器测试(compiler testing)的新评估范式,通过计算翻译结果在执行上的正确率来定义“语义正确性得分”(semantic correctness score),从而更准确地衡量模型生成代码的功能准确性。实验表明,基于大语言模型(LLM)的方法在语义正确性上显著优于启发式方法,而BLEU等语法指标与语义正确性几乎无相关性(r = -0.127 至 0.354),验证了语义评估的重要性。

链接: https://arxiv.org/abs/2605.05282
作者: Julius Näumann,Sven Keidel,Amir Molzam Sharifloo,Mira Mezini
机构: TU Darmstadt; Hessian Center for Artificial Intelligence, Darmstadt, Germany; National Research Center for Applied Cybersecurity ATHENE
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.

[NLP-103] Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

【速读】: 该论文旨在解决强化学习在推理任务中面临的信用分配(credit assignment)难题,即如何将仅在序列末端提供的结果级反馈转化为能够指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列优化(难以精确分配责任),要么依赖外部构建的过程监督(成本高且难扩展)。解决方案的关键在于提出“监督内化”(supervision-internalization)的新视角,使模型能够通过识别、修正和重用失败的推理轨迹,自动提取过程级学习信号,从而在仅有结果监督的情况下实现更精细的策略优化。这一思想进一步抽象为一种新的训练范式:模型在强化学习过程中持续生成并迭代优化自身内部的过程监督,为推理任务中的细粒度信用分配开辟了不同于外部监督的新路径。

链接: https://arxiv.org/abs/2605.05226
作者: Fei Ding,Yongkang Zhang,Runhao Liu,Yuhao Liao,Zijian Zeng,Sibo wang,Huiming Yang
机构: Alibaba Group(阿里巴巴集团); Tsinghua University(清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

[NLP-104] Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery

【速读】: 该论文旨在解决传统表示系统(如傅里叶级数、小波和固定字典)无法适应现代高维数据经验结构的问题,同时克服神经网络在学习特征时牺牲可解释性、基函数结构显式控制及数学透明性的局限。其解决方案的关键在于提出一种非神经网络的框架——数据驱动变分基学习(Data Driven Variational Basis Learning, DVBL),通过变分优化直接从数据中学习基函数(basis atoms),将其作为主要优化变量,与样本特定系数及潜在线性演化算子(latent linear evolution operator)联合优化,从而获得既数据自适应又保持显式、可解释且可严格分析的基展开形式。

链接: https://arxiv.org/abs/2605.05221
作者: Andrew Kiruluta
机构: UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Classical representation systems such as Fourier series, wavelets, and fixed dictionaries provide analytically tractable basis expansions, but they are not intrinsically adapted to the empirical structure of modern high-dimensional data. Neural networks overcome this limitation by learning features from data, yet they do so through layered nonlinear parameterizations that often sacrifice interpretability, explicit control over basis structure, and mathematical transparency. In this manuscript we develop a non-neural alternative that learns basis functions directly from data through variational optimization. The proposed framework, termed Data Driven Variational Basis Learning (DVBL), treats basis atoms as primary optimization variables and learns them jointly with sample-specific coefficients and, when appropriate, a latent linear evolution operator. This yields a data-adaptive basis expansion that remains explicit, interpretable, and amenable to rigorous analysis. We formulate the model, establish existence of minimizers, prove blockwise descent properties for an alternating minimization algorithm, give conditions for coefficient recovery and basis identifiability, and show how manifold and dynamical regularization can be integrated without invoking neural architectures. We also discuss the conceptual novelty of the framework relative to classical dictionary learning, spectral methods, Koopman operator methods, and deep representation learning.

[NLP-105] WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

【速读】: 该论文旨在解决当前语音理解与生成任务中因特征表示不一致而导致的兼容性难题,即语义导向的自监督学习(SSL)特征与声学导向的重建特征难以统一的问题。解决方案的关键在于提出一种名为 WavCube 的紧凑连续潜在表示,其源自 SSL 语音编码器,并通过两阶段训练机制实现统一:第一阶段利用语义瓶颈过滤掉原始 SSL 特征中的离流形冗余,使其适用于扩散模型;第二阶段通过端到端重建注入细粒度声学细节,同时引入语义锚定损失确保表示保持在原始语义流形内。这一设计有效克服了 SSL 特征在生成建模中的两个固有缺陷,实现了高质量语音理解、重建和生成的统一。

链接: https://arxiv.org/abs/2605.06407
作者: Guanrou Yang,Tian Tan,Qian Chen,Zhikang Niu,Yakun Song,Ziyang Ma,Yushen Chen,Zeyu Xie,Tianrui Wang,Yifan Yang,Wenxi Chen,Qi Chen,Wenrui Liu,Shan Yang,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Tencent (腾讯); Independent Researcher (独立研究员); Peking University (北京大学); Tianjin University (天津大学); Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube’s two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at this https URL.

[NLP-106] Spherical Flows for Sampling Categorical Data

【速读】: 该论文旨在解决在连续嵌入空间中学习离散序列的生成模型问题,传统方法通常在欧几里得空间或概率单纯形上操作,而本文提出在球面 Sd1\mathbb S^{d-1} 上建模。其核心解决方案是利用von Mises-Fisher (vMF) 分布诱导自然噪声过程并具有闭式条件得分(conditional score),通过利用vMF密度的径向对称性,将球面上的连续性方程简化为余弦相似度上的标量常微分方程(ODE),该方程的唯一有界解确定了速度场;同时,边际速度和边际得分均可分解为后验加权的切向和,仅在每个token的标量权重上存在差异,从而支持基于常微分方程(ODE)和预测-校正(predictor-corrector, PC)的采样策略。整个框架中唯一需要学习的对象是后验分布,通过交叉熵损失进行训练,实验表明vMF路径结合PC采样显著提升了Sudoku求解和语言建模任务的效果。

链接: https://arxiv.org/abs/2605.05629
作者: Jannis Chemseddine,Gregor Kornhardt,Gabriele Steidl
机构: Technische Universität Berlin (柏林工业大学)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere \mathbb S^d-1 . There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on \mathbb S^d-1 to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on (\mathbb S^d-1)^L both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

信息检索

[IR-0] Superintelligent Retrieval Agent : The Next Frontier of Information Retrieval

【速读】:该论文旨在解决当前检索增强型智能体(retrieval-augmented agents)在处理组织知识库时存在的效率低下问题,即它们通常将检索过程视为黑箱操作,依赖多轮查询与改写来逐步发现有用证据,导致延迟高、召回率低。其核心挑战在于缺乏对检索目标的先验认知,无法像专家一样利用术语分布和语义区分能力进行高效定位。解决方案的关键在于提出SuperIntelligent Retrieval Agent(SIRA),它通过大语言模型(LLM)实现“检索超智能”——将多轮探索性搜索压缩为单次具有语料区分性的检索动作:一方面在文档侧离线扩充缺失的搜索词汇,另一方面在查询侧预测被忽略的关键证据词汇,并结合词频统计作为工具调用过滤冗余或常见项;最终以加权BM25算法合并原始查询与验证后的扩展词完成一次性精准检索。该方法在BEIR基准和下游问答任务中显著优于密集检索器及先进多轮代理基线,且具备可解释性、无需训练且计算高效。

链接: https://arxiv.org/abs/2605.06647
作者: Zeyu Yang,Qi Ma,Jason Chen,Anshumali Shrivastava
机构: Meta(元)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textitSuperIntelligent Retrieval Agent (SIRA), which defines \emphsuperintelligence in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.06647 [cs.IR] (or arXiv:2605.06647v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.06647 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Light-FMP: Lightweight Feature and Model Pruning for Enhanced Deep Recommender Systems

【速读】:该论文旨在解决深度推荐系统(Deep Recommender Systems, DRS)在高维输入特征场景下难以兼顾计算效率与模型精度的问题。现有方法通常要么偏重精度而忽略训练效率,要么为追求效率牺牲多任务下的最优性能。其解决方案的核心在于提出Light-FMP框架,通过三个关键阶段实现轻量化优化:预训练阶段利用硬混凝土分布(hard concrete distribution)和掩码层在小规模数据子集上高效识别重要特征;剪枝阶段同时对模型结构和特征进行裁剪;持续训练阶段则在剩余数据上使用领域自适应参数继续训练,从而在保持可扩展性和鲁棒性的前提下显著提升效率与准确性。

链接: https://arxiv.org/abs/2605.06441
作者: Nghia Bui,Yue Ning,Lijing Wang
机构: New Jersey Institute of Technology (新泽西理工学院); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deep recommender systems (DRS) often face challenges in balancing computational efficiency and model accuracy, especially when handling high-dimensional input features. Existing methods either focus on improving accuracy while neglecting training efficiency or prioritize efficiency at the cost of suboptimal accuracy across tasks. We propose Light-FMP: Lightweight Feature and Model Pruning for Enhanced DRS, a lightweight framework that addresses the challenges through three key phases: \textitpretraining, \textitpruning, and \textitcontinued training. Using a hard concrete distribution, a masking layer is efficiently pretrained on a small data subset to identify important features. The model and features are then pruned, and training continues on the remaining dataset with domain-adapted parameters. Experiments on benchmark datasets from real-world recommender systems demonstrate that Light-FMP outperforms existing methods in both efficiency and accuracy while maintaining scalability and robustness.

[IR-2] GATHER: Convergence-Centric Hyper-Entity Retrieval for Zero-Shot Cell-Type Annotation SIGIR2026

【速读】:该论文旨在解决零样本单细胞类型注释(zero-shot single-cell cell-type annotation)中因超实体查询(hyper-entity queries)导致的知识图谱增强检索(KG-RAG)方法效率低下和成本高昂的问题。传统基于知识图谱的检索方法依赖于从单个基因出发的局部探索策略,难以有效捕捉多个基因协同表达所隐含的细胞类型信息,从而在面对包含数十至数百个基因的查询时表现出可扩展性差与大语言模型(LLM)调用开销高的问题。其解决方案的关键在于提出GATHER(Graph-Aware Traversal with Hyper-Entity Retrieval),一种以收敛点为中心的检索机制:通过全局多源图遍历识别出由多个输入基因共同可达的拓扑收敛节点(convergence nodes),这些节点作为高信息量的超实体(hyper-entities)能够压缩多基因信号并体现实体间的协同效应;同时结合节点与路径重要性评分,在无需LLM参与的情况下完成高效证据选择,最终实现仅需单次LLM调用即可获得显著优于现有基线的方法性能。

链接: https://arxiv.org/abs/2605.06403
作者: Zhonghui Zhang,Feng Jiang,Shaowei Qin,Jiahao Zhao,Min Yang
机构: Shenzhen University of Advanced Technology (深圳大学先进技术研究院); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026. 2 figures, 3 tables

点击查看摘要

Abstract:Zero-shot single-cell cell-type annotation aims to determine a cell’s type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM reasoning. However, in this setting each query contains tens to hundreds of genes, where no single gene is decisive and the label emerges only from their collective co-occurrence. Such hyper-entity queries fundamentally challenge local, entity-wise exploration strategies, which reason from individual genes, leading to poor scalability and substantial LLM cost. We propose GATHER (Graph-Aware Traversal with Hyper-Entity Retrieval), a convergence-centric retriever tailored to hyper-entity queries. It performs global multi-source graph traversal and identifies topological convergence points – nodes jointly reachable from many input genes. These convergence nodes act as high-information hyper-entities that capture entity synergy. By incorporating node- and path-importance scoring, GATHER selects informative evidence entirely without LLM involvement during retrieval. Instantiated on a self-constructed cell-centric biological knowledge graph (VCKG), GATHER outperforms strong KG-RAG baselines (ToG, ToG-2, RoG, PoG) on two datasets (Immune and Lung), achieving the highest exact-match accuracy (27.45% and 59.64%) with only a single LLM call per sample, compared to 2–61 calls for KG-RAG baselines. Our results demonstrate that convergence nodes compress multi-entity signals into compact, high-information evidence that conveys more per item than multi-hop paths, providing an efficient global alternative to local entity-wise reasoning.

[IR-3] Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)模型中因自回归生成过程导致的解码空间结构化问题,即token-by-token生成所形成的解码树结构会引入item概率间的强相关性,使得模型难以区分用户特定偏好下的相近item。这种结构耦合限制了GR模型对简单协同过滤模式的表达能力。解决方案的关键在于提出Latte方法:在每个语义ID token前注入一个潜在token,将原本单一的解码树重构为多个由潜在token条件化的子树,从而增加item之间的路径多样性并降低树结构引发的概率耦合,显著提升推荐性能(NDCG@10平均相对提升3.45%)。

链接: https://arxiv.org/abs/2605.06331
作者: Yupeng Hou,Haven Kim,Clark Mingxuan Ju,Eduardo Escoto,Neil Shah,Julian McAuley
机构: University of California, San Diego (加州大学圣地亚哥分校); Snap Inc. (Snap Inc.)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) models generate items by autoregressively producing a sequence of discrete tokens that jointly index the target item. However, this autoregressive generation process also induces a structured decoding space whose impact on model expressiveness remains underexplored. Specifically, token-by-token generation can be viewed as traversing a decoding tree induced by semantic ID tokens, where leaf nodes correspond to candidate items. We observe that the item probabilities produced by GR models are strongly correlated with this tree structure: items that are close in the tree tend to receive similar probabilities for any given user, making it difficult to distinguish among them based on user-specific preferences. We further show theoretically that such structural correlations prevent GR models from representing even simple patterns that can be well captured by conventional collaborative filtering models. To mitigate this issue, we propose Latte, a simple modification that injects a latent token before each semantic ID, reshaping the decoding space from a single tree into multiple latent-token-conditioned trees. This design creates multiple paths with varying tree distances between items, relaxing tree-induced probability coupling and yielding an average of 3.45% relative improvement on NDCG@10. Our code is available at this https URL.

[IR-4] Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLM s

【速读】:该论文旨在解决自动化隐私审计中因依赖稀缺且人工标注的HTTP流量数据,以及固定标签分类体系(taxonomy)而导致的检测模型跨领域迁移能力差、难以适应PII(Personally Identifiable Information,个人身份信息)定义动态演进的问题。其解决方案的关键在于提出一种多阶段基于大语言模型(Large Language Models, LLMs)的流水线方法:首先通过确定性预处理对HTTP消息体进行结构化清理,随后结合标签级分类与实例级值注释实现细粒度PII识别,并引入输出验证机制提升准确性;同时,为避免使用真实用户敏感数据,还设计了一种基于LLM的合成HTTP流量生成器,支持在运行时提供任意PII分类体系并自动生成带人工验证标签的数据,从而实现无监督条件下的灵活、可扩展的PII标注。

链接: https://arxiv.org/abs/2605.06305
作者: Thomas Cory,Axel Küpper
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to 2026 IEEE European Symposium on Security and Privacy Workshops (EuroSPW)

点击查看摘要

Abstract:Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.

[IR-5] OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries

【速读】:该论文旨在解决当前检索基准(retrieval benchmark)趋于饱和背景下,高效搜索任务仍未被充分解决的问题,尤其聚焦于一类称为“斜向查询”(oblique queries)的复杂检索场景——这类查询要求从文档中识别出隐含模式,如发现表达隐式立场的推文、体现特定故障模式的聊天记录或匹配抽象场景的转录文本。解决方案的关键在于提出 OBLIQ-Bench,一个包含五个真实长尾语料上的斜向检索任务组成的评测基准,其揭示了检索与验证之间的关键不对称性:尽管大语言模型(LLM)在相关文档被成功召回后能可靠识别隐含相关性,但现有先进检索系统仍难以有效捕获这些隐含信号。因此,该研究呼吁开发更高效的检索架构以捕捉大规模语料中的潜在模式和隐式线索。

链接: https://arxiv.org/abs/2605.06235
作者: Diane Tchuindjo,Devavrat Shah,Omar Khattab
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search problems over real long-tail corpora. OBLIQ-Bench exposes an overlooked asymmetry between retrieval and verification, where reasoning LLMs reliably recognize latent relevance whenever relevant documents are surfaced, but even sophisticated retrieval pipelines fail to surface most relevant documents in the first place. We hope that OBLIQ-Bench will drive research into retrieval architectures that efficiently capture latent patterns and implicit signals in large corpora.

[IR-6] Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval ICML2026

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval)中因文本查询与视频内容之间固有不对称性所导致的不确定性问题,尤其是模糊查询引发的语义歧义以及视频内稀疏的时间监督难以提供充分匹配证据的挑战。解决方案的关键在于提出一个分层证据学习框架Holmes,其核心创新包括:在跨视频层面,将相似度分数视为证据支持并用Dirichlet分布建模以显式量化不确定性;基于三重原则进行细粒度查询识别,并引导查询自适应校准学习;在视频内部层面,通过带自适应尘箱(dustbin)的柔性最优传输实现软查询-片段对齐,从而积累更密集的证据并缓解稀疏时间监督问题,同时抑制虚假局部响应。

链接: https://arxiv.org/abs/2605.06083
作者: Jun Li,Peifeng Lai,Xuhang Lou,Jinpeng Wang,Yuting Wang,Ke Chen,Yaowei Wang,Shu-Tao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by ICML 2026. 16 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at this https URL.

[IR-7] A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance

【速读】:该论文旨在解决电子商务搜索中相关性(relevance)优化的系统性难题,即如何通过多角色协作机制持续识别并解决用户感知的“坏案例”(bad cases)。其核心挑战在于传统闭环生态中人类角色(如用户、产品管理者、标注员、算法工程师和评估者)分工复杂、响应滞后且信息不对称。解决方案的关键是提出一种基于多智能体的自动化框架,通过引入三类自主代理——用户代理(User Agent)用于对话式识别坏案例、标注代理(Annotator Agent)支持多轮标注、优化代理(Optimizer Agent)实现自动分析与修复——构建一个可自我演进的闭环系统;同时结合统一检索-排序模型、指令遵循的相关性模型、全局记忆机制(Global Memory)、深度搜索代理(Deep Search Agent)以及人机协同聊天机器人等工程化设计,显著提升标注准确性与坏案例响应的时效性和泛化能力,从而为工业级搜索相关性优化提供可落地的新范式。

链接: https://arxiv.org/abs/2605.05991
作者: Global E-Commerce Search Relevance Team
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR)
备注: Tech Report

点击查看摘要

Abstract:Relevance is a foundation of user experience in e-commerce search. We view relevance optimization as a closed-loop ecosystem involving multiple human roles: users who provide feedback, product managers who define standards, annotators who label data, algorithm engineers who optimize models, and evaluators who assess performance. Because improving relevance in practice means systematically resolving user-perceived bad cases, we ask a system-level question: can this ecosystem be reimagined by replacing its human roles with autonomous agents? To answer this question, we propose a case-driven multi-agent framework that automates the pipeline from bad-case identification to resolution. The framework instantiates an Annotator Agent for multi-turn annotation, an Optimizer Agent for autonomous bad-case analysis and resolution, and a User Agent that identifies bad cases through conversational interaction, together forming an autonomous and continually evolving system. To make the framework practical in production, we further adopt a harness-engineering paradigm and build a unified retrieval-and-ranking relevance model for efficient training, an instruction-following relevance model for real-time case resolution, Global Memory to reduce information asymmetry across agents, a Deep Search Agent to target underestimation failures, and an agent-based chatbot for human–agent collaboration. Extensive human evaluation shows that the framework performs relevance-related tasks effectively, improves annotation accuracy, and enables more timely and generalizable bad-case resolution, indicating a practical paradigm for industrial search relevance optimization.

[IR-8] Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling SIGIR2026

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)驱动的对话式搜索中,由于依赖“曝光-点击”反馈闭环导致的推荐系统陷入回音室效应(echo chamber)问题,进而无法捕捉开放世界中动态变化的用户意图。传统方法因数据稀疏性和分布偏移,难以有效利用用户主动输入的查询(active queries)来优化推荐效果。解决方案的关键在于提出一种名为Passive-Active Bridge (PA-Bridge) 的新框架:通过引入对抗性分布对齐器(adversarial distribution aligner)缓解被动推荐的起点(passively recommended starters)与主动表达(active expressions)之间的分布差异,并设计语义离散化模块(semantic discretizer)使流行度去偏算法可在大规模工业流式训练中部署。实验证明,该方法显著提升了特征渗透率(Feature Penetration Rate)和用户活跃天数(User Active Days)。

链接: https://arxiv.org/abs/2605.05855
作者: Yiqing Wu,Haoming Li,Guanyu Jiang,Jiahao Liang,Yongchun Zhu,Jingwu Chen,Feng Zhang
机构: Bytedance(字节跳动)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed “exposure-click” loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic this http URL this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user “free will” through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the “non-ID-able” nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days

[IR-9] Unified Value Alignment for Generative Recommendation in Industrial Advertising

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)在工业广告场景中难以对齐用户兴趣与商业价值的问题。现有GR方法多以语义为中心,导致价值信号在分词(tokenization)、解码(decoding)和在线服务(online serving)阶段难以统一,从而影响广告系统的整体效果。解决方案的关键在于提出UniVA(Unified Value Alignment)框架:首先设计一种引入商业属性的Commercial SID tokenizer,生成具有价值区分能力的物品表示;其次开发基于监督学习与eCPM感知强化学习联合优化的Generation-as-Ranking SID Decoder,在同一解码过程中融合价值评分实现生成与排序一体化;最后采用价值引导的个性化束搜索策略,利用生成-排序 logits 作为在线价值指导,并通过个性化前缀树(trie tree)约束解码路径仅限于请求有效的SID路径,从而实现精准且高价值的广告推荐。

链接: https://arxiv.org/abs/2605.05803
作者: Xinxun Zhang,Yuling Xiong,Jiale Zhou,Zhengkai Guo,Zhennan Pang,Junbang Huo,Jingwen Wang,Xuyang Sun,Enming Zhang,Jiaguang Jin,Changping Wang,Yi Li,Jun Zhang,Xiao Yan,Jiawei Jiang,Jie Jiang
机构: Wuhan University (武汉大学); Tencent Inc. (腾讯公司); Peking University (北京大学)
类目: Information Retrieval (cs.IR)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Generative Recommendation (GR) reformulates recommendation as a next-token generation problem and has shown promise in industrial applications. However, extending GR to industrial advertising is non-trivial because the system must optimize not only user interest but also commercial value. Existing GR pipelines remain largely semantics-centric, making it difficult to align value signals across tokenization, decoding, and online serving. To address this issue, we propose UniVA, a Unified Value Alignment framework for advertising recommendation. We first introduce a Commercial SID tokenizer that injects value-related attributes into SID construction, yielding value-discriminative item representations. We then develop a Generation-as-Ranking SID Decoder jointly optimized by supervised learning and eCPM-aware reinforcement learning, which fuses value scores into next-item SID generation to perform generation and ranking in one decoding process. Finally, we design a value-guided personalized beam search that reuses generation-as-ranking logits as online value guidance and applies a personalized trie tree to constrain decoding to request-valid SID paths. Experiments on the Tencent WeChat Channels advertising platform show that UniVA achieves a 37.04% improvement in offline Hit Rate@100 over the baseline and a 1.5% GMV lift in online A/B tests.

[IR-10] Beyond Long Tail POIs: Transition-Centered Generalization for Human Mobility Prediction

【速读】:该论文旨在解决人类移动预测中因过渡稀疏性(transition-level sparsity)导致的长尾POI(Point of Interest)预测困难问题,尤其关注那些在训练集中极少或从未出现的源-目标地点过渡模式。其核心瓶颈在于过渡层面的长尾泛化能力不足,而非仅依赖高频POI的统计规律。解决方案的关键是提出RECAP框架,通过重建长尾过渡来实现组合泛化(compositional generalization):一是利用全局过渡图中的多跳传递性(multi-hop transitivity),二是结合用户历史轨迹中的重访证据(revisit evidence),从而从可迁移信号中推断罕见过渡;同时引入warm-transition holdout训练策略,避免对高频过渡的过度记忆,强化对通用信号的泛化能力。实验表明,RECAP在多个真实数据集上显著提升整体预测精度,尤其是在长尾过渡场景下表现突出。

链接: https://arxiv.org/abs/2605.05771
作者: Dingyang Lyu,Zhengjia Xu,Jey Han Lau,Jianzhong Qi
机构: The University of Melbourne(墨尔本大学); Macquarie University(麦考瑞大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Human mobility prediction forecasts a user’s next Point of Interest (POI) from historical trajectories, supporting applications from recommendation to urban planning. Recent studies have recognized the problem with long-tail POIs in human mobility prediction, which are POIs with few visit records, making new visits to such POIs difficult to predict. Our analysis shows that many predictions fail even for visits to popular POIs. The underlying cause is often transition-level sparsity: the corresponding source-destination transition appears rarely, or never appears, in the training set. We therefore argue that a core bottleneck in human mobility prediction lies in transition-level long-tail generalization. We formulate this problem as compositional generalization and propose a tRansition rEconstruction framework for Compositional generAlization in next-POI prediction (RECAP). RECAP reconstructs long-tail transitions from two generalizable signals: multi-hop transitivity in the global transition graph and revisit evidence from a user’s historical trajectory. It further uses warm-transition holdout training to discourage memorization of frequent transitions and encourage generalization from transferable signals. Experiments on multiple real-world datasets show that RECAP consistently improves prediction accuracy, with clear gains on tail transitions.

[IR-11] Effective Knowledge Transfer for Multi-Task Recommendation Models

【速读】:该论文旨在解决推荐系统中因用户转化行为(Conversion Rate, CVR)数据稀疏而导致的排名模型训练困难问题。核心挑战在于如何有效利用多样化的用户行为数据来提升CVR预测的准确性。解决方案的关键在于提出一种面向多任务推荐模型的有效知识迁移方法(Effective Knowledge Transfer method for Multi-task Recommendation Models, EKTM),其核心机制包括:引入一个路由器模块(router module)实现跨任务知识的集成与分发,为每个CVR任务配置发射器模块(transmitter module)以完成知识从路由器到具体任务的转换,并设计增强模块确保迁移知识能够正向促进原始任务的学习。该方法通过任务间的知识协同,显著提升了CVR建模效果,在多个基准数据集和大规模线上A/B测试中均验证了其有效性,eCPM指标提升达3.93%。

链接: https://arxiv.org/abs/2605.05730
作者: Guohao Cai,Jun Yuan,Zhenhua Dong
机构: Huawei Technologies Co., Ltd.(华为技术有限公司)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The conversion rate (CVR) is a crucial metric for evaluating the effectiveness of platforms, as it quantifies the alignment of content with audience preferences. However, the limited nature of customers’ conversion actions presents a significant challenge for training ranking models effectively. In this paper, we propose an Effective Knowledge Transfer method for Multi-task Recommendation Models (EKTM). This method enables the ranking model to learn from diverse user behaviors, thereby enhancing performance through the transfer of knowledge across distinct yet related tasks. Each specific CVR task can directly benefit from the insights provided by other tasks. To achieve this, we first introduce a router module that integrates and disseminates knowledge across tasks. Subsequently, each CVR task is equipped with a transmitter module that facilitates the transformation of knowledge from the router. Additionally, we propose an enhanced module to ensure that the transferred knowledge benefit the original task learning. Extensive experiments on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art approaches. Online A/B testing on a commercial platform has validated the effectiveness of the EKTM algorithm in large-scale industrial settings, resulting in a 3.93% uplift in effective Cost Per Mille (eCPM). The algorithm has since been fully deployed across two of the platform’s main-traffic scenarios.

[IR-12] xt-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳推理任务中面临的“信息孤岛”问题,即文本与结构化图之间不对称的推理流导致逻辑无关伪证据被引入或有效推理路径因搜索时剪枝而丢失。其解决方案的关键在于提出一个统一的文本-图协同增强框架(Text-Graph Synergistic RAG, TGS-RAG),通过双向机制实现跨模态信息融合:一是图到文本通道采用全局投票策略从已访问图节点重新排序和精炼文本证据,过滤语义噪声;二是文本到图通道引入基于记忆的孤儿实体桥接算法,利用文本线索主动恢复搜索历史中曾被剪枝但有效的推理路径,无需额外数据库开销。此设计显著提升了检索精度与计算效率的平衡。

链接: https://arxiv.org/abs/2605.05643
作者: Jiarui Zhong,Hong Cai Chen
机构: Southeast University (东南大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is frequently hindered by search-time pruning, which may discard potentially valid reasoning paths. Existing hybrid approaches primarily adopt simple evidence concatenation or unidirectional enhancement, which fails to address the fundamental “Information Island” problem caused by asymmetric reasoning flows between unstructured text and structured graphs. We propose \textbfTGS-RAG, a unified framework for \textbfText-\textbfGraph \textbfSynergistic enhancement. TGS-RAG introduces a bidirectional mechanism: (i) a \textbfGraph-to-Text channel that employs a Global Voting strategy from visited graph nodes to re-rank and refine textual evidence, filtering out semantic noise; and (ii) a \textbfText-to-Graph channel that utilizes the \textbfMemory-based Orphan Entity Bridging algorithm. This algorithm utilizes textual cues to proactively resurrect valid but previously pruned reasoning paths from the search history without additional database overhead. Experimental results on multiple multi-hop reasoning benchmarks demonstrate that TGS-RAG significantly outperforms state-of-the-art baselines, achieving a superior balance between retrieval precision and computational efficiency.

[IR-13] Agent icRAG : Agent ic Retrieval for Enterprise Knowledge Bases

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中语言模型对检索模块过度依赖的问题,即在单次检索后受限于固定候选集,难以灵活获取和分析证据。其核心解决方案是提出 AgenticRAG,一种轻量级代理框架,通过在现有企业搜索基础设施上叠加工具调用能力(如搜索、查找、打开和摘要),使推理大语言模型(Reasoning LLM)能够自主迭代地执行信息检索、文档内导航与证据分析。关键创新在于将单次检索转变为多轮代理式工具使用,显著提升召回率(BRIGHT 上 recall@1 达 49.6%,较最优嵌入基线提升 21.8 个百分点)和事实准确性(WixQA 上达 0.96,相对提升 13%),同时在金融问答任务中接近 oracle 水平(92% 答案正确率)。实证表明,这一机制转变是性能提升的最主要因素(5.9 倍改进),且多查询搜索与文档内导航进一步优化了质量与效率。

链接: https://arxiv.org/abs/2605.05538
作者: Susheel Suresh,Hazel Mak,Shangpo Chou,Fred Kroon,Sahil Bhatnagar
机构: Microsoft Corporation(微软公司)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously. On three open benchmarks we observe substantial gains: 49.6% recall@1 on BRIGHT (+21.8 pp over the best embedding baseline), 0.96 factuality on WixQA ( +13% relative improvement), and 92% answer correctness on FinanceBench–within 2 pp of oracle access to true evidence. Ablation studies show that the most significant factor is the shift from single-shot retrieval to agentic tool use ( 5.9\times improvement), while multi-query search and in-document navigation contribute to both quality and efficiency. We present various design choices in our agentic harness that were informed by pre-production deployments. Our results demonstrate its suitability for real-world enterprise production environments.

[IR-14] Open-SAT: LLM -Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery

【速读】:该论文旨在解决开放词汇(open-vocabulary)卫星图像检索中用户自然语言查询与图像内容对齐困难的问题,尤其是在未见过的对象或概念上,现有视觉语言模型(VLMs)如CLIP即使经过微调仍难以准确匹配。解决方案的关键在于提出一种无需训练的推理时查询嵌入优化算法Open-SAT:它利用预训练的VLM提取图像块(image tile)的嵌入并存储于向量数据库中,在查询时借助大语言模型(LLM)通过上下文信息增强文本嵌入,从而提升查询与卫星图像之间的语义对齐精度;同时采用阈值无关的检索机制进一步提高准确性和效率。

链接: https://arxiv.org/abs/2605.05344
作者: Md Adnan Arefeen,Biplob Debnath,Ravi K. Rajendran,Murugan Sankaradas,Srimat T. Chakradhar
机构: North South University (南亚大学); NEC Laboratories America (NEC实验室美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In satellite applications, user queries often take the form of open-ended natural language, extending beyond a fixed set of predefined categories. This open-vocabulary nature poses significant challenges for retrieving relevant image tiles, as the retrieval system must generalize to a wide range of unseen objects and concepts. While vision-language models (VLMs) such as CLIP are widely used for text-image retrieval, even fine-tuned variants often struggle to accurately align such queries with satellite imagery. To address this, we propose Open-SAT, a training-free query embedding refinement algorithm that operates at inference time to improve alignment between user queries and satellite image content. Open-SAT uses VLMs to compute embeddings for image tiles, which are stored in a vector database for efficient retrieval. At query time, it leverages Large Language Models (LLMs) to refine the text embeddings by incorporating contextual information about objects of interest and their surroundings. A threshold-free retrieval mechanism further enhances accuracy and efficiency. Experimental results in three public benchmarks demonstrate that Open-SAT improves the F1 score by up to 16.04%, while retrieving a comparable number of image tiles. These results demonstrate the effectiveness of Open-SAT in open-vocabulary satellite image retrieval, leveraging LLM guidance without the need for additional training or supervision.

[IR-15] Securing the Agent : Vendor-Neutral Multitenant Enterprise Retrieval and Tool Use

【速读】:该论文旨在解决多租户企业环境中检索增强生成(Retrieval-Augmented Generation, RAG)与代理型人工智能(agentic AI)系统面临的核心安全隔离问题:现有架构在检索阶段仅依据相关性(如语义相似度或关键词匹配)排序文档,而未考虑访问控制策略,导致一个租户的查询可能因高相关性而泄露另一租户的敏感数据。此外,还存在工具调用引发的信息泄露、多轮对话中上下文累积风险以及客户端编排绕过授权机制等缺陷。解决方案的关键在于提出一种分层隔离架构,通过策略感知的数据摄入(policy-aware ingestion)、检索时门控(retrieval-time gating)和共享推理(shared inference)实现细粒度权限控制,并由服务器端代理编排集中执行关键安全操作(如工具执行授权、状态隔离与策略 enforcement),从而在保障多租户数据隔离的同时,允许客户端保留对代理组合逻辑及低延迟操作的灵活性。实验证明,基于属性的访问控制(Attribute-Based Access Control, ABAC)门控可有效消除跨租户泄露,且引入的性能开销可忽略不计。

链接: https://arxiv.org/abs/2605.05287
作者: Francisco Javier Arceo,Varsha Prasad Narsing
机构: Red Hat AI(红帽人工智能)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: 11 pages, 2 figures, Published in ACM Conference on AI and Agentic Systems

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) and agentic AI systems are increasingly prevalent in enterprise AI deployments. However, real enterprise environments introduce challenges largely absent from academic treatments and consumer-facing APIs: multiple tenants with heterogeneous data, strict access-control requirements, regulatory compliance, and cost pressures that demand shared infrastructure. A fundamental problem underlies existing RAG architectures in these settings: retrieval systems rank documents by relevance–whether through semantic similarity, keyword matching, or hybrid approaches–not by authorization, so a query from one tenant can surface another tenant’s confidential data simply because it scores highest. We formalize this gap and analyze additional shortcomings–including tool-mediated disclosure, context accumulation across turns, and client-side orchestration bypass–that arise when agentic systems conflate relevance with authorization. To address these challenges, we introduce a layered isolation architecture combining policy-aware ingestion, retrieval-time gating, and shared inference, enforced through server-side agentic orchestration. This approach centralizes security-critical operations–tool execution authorization, state isolation, and policy enforcement–on the server, creating natural enforcement points for multitenant isolation while allowing client-side frameworks to retain control over agent composition and latency-sensitive operations. We validate the proposed architecture through an open-source implementation in OGX, a vendor-neutral framework that implements an OpenAI-compatible, open-source Responses API with server-side multi-turn orchestration. We evaluate it empirically and show that ABAC gating eliminates cross-tenant leakage while introducing negligible overhead. Comments: 11 pages, 2 figures, Published in ACM Conference on AI and Agentic Systems Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE) ACMclasses: I.2; H.3; K.6 Cite as: arXiv:2605.05287 [cs.CR] (or arXiv:2605.05287v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.05287 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACM Conference on AI and Agentic Systems (ACM CAIS '26), May 26-29, 2026, San Jose, CA, USA Related DOI: https://doi.org/10.1145/3786335.3813145 Focus to learn more DOI(s) linking to related resources

[IR-16] Career-Aware Resume Tailoring via Multi-Source Retrieval-Augmented Generation with Provenance Tracking: A Case Study

【速读】:该论文旨在解决当前AI辅助简历定制系统仅基于单次上传的简历进行操作所带来的局限性,即难以恢复当前简历中遗漏的相关经验,且用户难以区分模型生成建议与基于真实经历的编辑。其解决方案的关键在于提出Resume Tailor系统,该系统通过维护一个存储于向量数据库中的纵向职业档案库(career vault),利用多源检索增强生成(multi-source retrieval-augmented generation, RAG)技术,从历史简历和结构化职业记录中提取与岗位描述(Job Description, JD)相关的定制内容。该系统采用12节点LangGraph流水线架构,结合混合语义-词法置信度评分、溯源感知的回退生成机制、防幻觉约束机制及条件审查循环,显著提升了简历与招聘系统(Applicant Tracking System, ATS)匹配度,尤其在候选人具备相关领域过往经验时效果明显。

链接: https://arxiv.org/abs/2605.05257
作者: Kumar Abhinav
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, 5 tables. Also available on SSRN

点击查看摘要

Abstract:AI-assisted resume tailoring systems commonly operate on a single uploaded resume, which limits their ability to recover relevant experience omitted from the current draft and makes it difficult for users to distinguish grounded edits from model-generated suggestions. This paper presents Resume Tailor, an agentic resume-tailoring system that maintains a longitudinal career vault in a vector database and uses multi-source retrieval-augmented generation (RAG) to assemble job-specific resume content from historical resumes and structured career records. The system is implemented as a 12-node LangGraph pipeline with typed state management, hybrid semantic-lexical confidence scoring, provenance-aware fallback generation, anti-hallucination guardrails, and a conditional review loop. We report a pilot evaluation on nine job descriptions (JDs) across software engineering, data analytics, and business analysis roles using a single candidate’s career history. For six JDs where the candidate held at least one prior role in the same occupational category, enabling the career vault improved Applicant Tracking System (ATS)-style fit scores by an average of 7.8 points. For two JDs requiring domain-specific expertise absent from the vault, scores decreased by an average of 8.0 points. One partially overlapping role showed a modest gain of 2 points. These results suggest that longitudinal retrieval can improve resume tailoring when relevant prior experience exists, while also highlighting the need for confidence-gated retrieval when domain overlap is weak.

[IR-17] EnterpriseRAG -Bench: A RAG Benchmark for Company Internal Knowledge

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)研究中缺乏真实反映企业内部知识特征的数据集问题,从而导致模型在处理专有数据时性能评估不足。现有基准多基于公开网络数据,难以模拟企业环境中文档来源多样、语义复杂且存在噪声(如错放文件、近似重复和信息冲突)的真实场景。解决方案的关键在于提出 EnterpriseRAG-Bench 数据集,其包含约 50 万份来自九类企业源(如 Slack、Gmail、GitHub 等)的合成文档,并配有覆盖十类任务类型的 500 个问题,涵盖单文档查询、多文档推理、约束检索与冲突消解等能力;同时提供可定制化的生成框架,支持按行业、规模和数据源组合生成适配特定企业的变体数据,从而推动面向私有知识库的 RAG 模型开发与评估标准化。

链接: https://arxiv.org/abs/2605.05253
作者: Yuhong Sun,Joachim Rahmfeld,Chris Weaver,Roshan Desai,Wenxi Huang,Mark H. Butler
机构: Onyx; UC Berkeley
类目: Information Retrieval (cs.IR)
备注: Code and dataset available at this https URL or this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at this https URL. Comments: Code and dataset available at this https URL or this https URL Subjects: Information Retrieval (cs.IR) MSC classes: 68T50, 68P20 ACMclasses: I.2.7; H.3.3; H.3.4 Cite as: arXiv:2605.05253 [cs.IR] (or arXiv:2605.05253v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.05253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-18] Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems

【速读】:该论文旨在解决当前对话式推荐系统(Conversational Recommender Systems, CRS)用户模拟器在评估销售代理时存在的现实性不足问题,特别是现有基于大语言模型(LLM)的模拟器往往缺乏对人类决策过程的显式建模,导致其在面对选择过载时表现出不切实际的高接受概率,而忽略了真实用户常见的犹豫或决策延迟行为。解决方案的关键在于提出名为 Hesitator 的理论驱动型用户模拟框架,其核心创新是引入一个模块化的决策模块(Decision Module),将基于效用的商品选择与考虑过载的承诺决策分离,从而更准确地模拟人类在信息过载下的决策行为,并在多个场景下验证了该模块能有效缓解不现实行为并重现心理经济学中的经典行为模式。

链接: https://arxiv.org/abs/2605.05250
作者: Yuan-Chi Li,Li-Chi Chen,Sung-Yi Wu,Yu-Che Tsai,Shou-De Lin
机构: National Taiwan University (国立台湾大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational recommender systems (CRS) increasingly rely on user simulators for automated evaluation of sales agents. A key requirement for such simulators is the ability to model human decision-making. However, most existing simulation frameworks do not explicitly model the internal decision process, and LLM-based simulators often exhibit unrealistically strong information-processing capabilities, rarely exhibit the hesitation or decision deferral commonly observed in real consumer behavior, resulting in overly high acceptance probabilities. To address this limitation, we propose Hesitator, a theory-grounded user simulation framework that explicitly models human decision-making under choice overload. The framework introduces a modular Decision Module that separates utility-based item selection from overload-aware commitment decisions. Experiments across multiple user simulation frameworks, domains, sales modes, and LLM backbones show that integrating our module consistently mitigates unrealistic behaviors under increasing overload conditions. Furthermore, Hesitator reproduces established behavioral patterns from psychological economics, demonstrating its ability to model human decision behavior.

[IR-19] riAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation

【速读】:该论文旨在解决生成式推荐中语义ID(Semantic ID, SID)管道存在的两大核心问题:SID内容退化(SID Content Degradation, SCD)SID语义不透明(SID Semantic Opacity, SSO)。SCD源于级联编码与残差量化过程对多模态及兴趣层级语义的丢失,而SSO则因模型在自回归生成SID序列时缺乏对语义本质的理解,导致幻觉和泛化能力差。现有方法仅关注文本-SID对齐,忽视了视觉语义与潜在用户兴趣的利用。解决方案的关键在于提出一个统一的多任务多模态框架TriAlignGR,其核心创新包括:(1) 跨模态语义对齐(Cross-Modal Semantic Alignment, CMSA),通过VLM生成的文本描述与多模态嵌入模型直接融合图像与文本特征,使SID天然携带多模态语义;(2) 多模态深度兴趣挖掘(Multimodal Deep Interest Mining, MDIM),借助大语言模型(LLM)链式思维推理提取用户潜在意图(如从降噪耳机推断“以效率为导向的生活方式”),在离散化前丰富SID语义;(3) 三角多任务训练(Triangular Multitask, TMT),联合训练八个互补生成任务(含两个新颖的视觉-语义任务VisDesc → SID和VisDesc → Title),构建SID-文本-图像三角闭环,在单一自回归损失下实现端到端优化,无需特定任务塔或复杂损失权重调整。

链接: https://arxiv.org/abs/2605.05249
作者: Yangchen Zeng,Hao Peng,Rongfeng Guo,Zhenyu Yu,Zhiyuan Hu,Jinze Wang
机构: Southeast University (东南大学); Swinburne University of Technology (斯威本科技大学); Tsinghua University (清华大学); Shenzhen University (深圳大学); Fudan University (复旦大学); Zhejiang University (浙江大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce TriAlignGR, a unified multitask-multimodal framework for generative recommendation that establishes two-stage multimodal semantic propagation: (i) encoding visual semantics directly into SIDs via multimodal embeddings, and (ii) enabling the model to decode these semantics through visual description tasks. Existing Semantic ID (SID) pipelines suffer from two fundamental but underexplored problems: \textbfSID Content Degradation (SCD), where cascaded encoding and residual quantization discard critical multimodal and interest-level semantics; and \textbfSID Semantic Opacity (SSO), where models autoregressively generate SID sequences without truly comprehending their underlying meaning, leading to hallucination and poor generalization. Prior work addresses at most text-SID alignment, leaving visual semantics and latent user interests entirely unexploited. TriAlignGR resolves both problems through three tightly integrated components: (1)~\textbfCross-Modal Semantic Alignment (CMSA) integrates visual content into SID construction through both VLM-generated textual descriptions and a multimodal embedding model that directly encodes image features alongside text, ensuring that SIDs inherently carry multimodal semantics; (2)~\textbfMultimodal Deep Interest Mining (MDIM) leverages LLM Chain-of-Thought reasoning to extract latent user intents (\eg ``productivity-focused lifestyle’’ from noise-canceling headphones) beyond surface attributes, enriching SID semantics before discretization; and (3)~\textbfTriangular Multitask (TMT) jointly trains on eight complementary generation tasks under a single autoregressive loss – including two novel visual-semantic tasks (VisDesc \to SID, VisDesc \to Title) that map VLM-generated image descriptions to SIDs and titles, completing the SID-Text-Image triangle – without requiring task-specific towers or complex loss weighting.

[IR-20] AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

【速读】:该论文旨在解决多跳检索增强生成(multi-hop Retrieval-Augmented Generation, RAG)在真实部署场景中因检索证据存在噪声或冗余、且传递给生成器的上下文受限而导致性能脆弱的问题。现有控制器通常仅通过加性扩展上下文、从固定top-k集合中选择或优化相关性来应对,但未能显式修复缺失的桥接事实(bridge facts)。解决方案的关键在于提出AdaGATE——一种无需训练的证据控制器,其将证据选择建模为一个受token数量约束的修复问题,核心机制包括:基于实体的间隙追踪(entity-centric gap tracking)、针对性微查询生成(targeted micro-query generation)以及基于效用的筛选策略,该策略综合平衡了间隙覆盖度、佐证强度、新颖性、冗余度与直接问题相关性。实验表明,AdaGATE在HotpotQA上于清洁、冗余和噪声注入条件下均取得最优证据F1得分,尤其在冗余设置下达到71.2%,同时比Adaptive-k减少2.6倍输入token,验证了显式间隙感知修复与高效证据选择对提升多跳RAG鲁棒性的有效性。

链接: https://arxiv.org/abs/2605.05245
作者: Yilin Guo,Yinshan Wang,Yixuan Wang
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) remains brittle on multi-hop questions in realistic deployment settings, where retrieved evidence may be noisy or redundant and only limited context can be passed to the generator. Existing controllers address parts of this problem, but typically either expand context additively, select from a fixed top-k set, or optimize relevance without explicitly repairing missing bridge facts. We propose AdaGATE, a training-free evidence controller for multi-hop RAG that frames evidence selection as a token-constrained repair problem. AdaGATE combines entity centric gap tracking, targeted micro-query generation, and a utility based selection mechanism that balances gap coverage, corroboration, novelty, redundancy, and direct question relevance. We evaluate AdaGATE on HotpotQA under clean, redundancy, and noise injected retrieval conditions. Across all three settings, AdaGATE achieves the best evidence F1 among the compared controllers, reaching 62.3% on clean data and 71.2% under redundancy injection, while using 2.6x fewer input tokens than Adaptive-k. These results suggest that explicit gap-aware repair, combined with token-efficient evidence selection, improves robustness in multi-hop RAG under imperfect retrieval. Our code and evaluation pipeline are available at this https URL.

[IR-21] owards Dependable Retrieval-Augmented Generation Using Factual Confidence Prediction

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索到的上下文是否真正支持生成答案的问题,即如何准确评估检索内容的事实性与生成结果的一致性。解决方案的关键在于提出一种两阶段方法:第一阶段利用校准预测(Conformal Prediction)筛选出高概率来自正确来源的检索片段,从而提升答案质量并提供统计保障;第二阶段通过基于注意力机制的事实性分类器量化生成答案与给定检索上下文之间的一致性置信度,最高可实现77%的不一致答案检测率。该方法为构建具备事实可信性的RAG系统提供了新的认证路径。

链接: https://arxiv.org/abs/2605.05244
作者: Florian Geissler,Francesco Carella,Laura Fieback,Jakob Spiegelberg
机构: Fraunhofer Institute for Integrated Systems and Device Technology (弗劳恩霍夫集成系统与器件技术研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incorporating specific knowledge into large language models via retrieval-augmented generation (RAG) is a widespread technique that fuels many of today’s industry AI applications. A fundamental problem is to assess if the context retrieved by some similarity search provides indeed supporting facts, or instead misguides the generator with irrelevant information. It is critical to associate meaningful confidence measures about the factuality of the retrieval process with the generated answers. We present a new, two-staged approach to predict fact faithfulness of the output of retrieval-augmented generations. First, we employ conformal prediction to select only those retrieved chunks who have a high chance to come from the correct source. This approach in itself can improve answer quality by up to 6% in some of the studied datasets, however, the associated statistical guarantees do not hold generally, since the assumption of sample exchangeability depends on the retriever setup. We present diagnostic metrics to assess whether a setup is suitable. Second, we quantify confidence in the consistency of a generated final answer with a given retrieved context, using an attention-based factuality classifier. This approach can detect inconsistent answers with a chance of up to 77%. Our work helps to establish a novel type of certified RAG systems for a broad range of natural language industry applications.

[IR-22] Beyond Semantic Similarity: Rethinking Retrieval for Agent ic Search via Direct Corpus Interaction

【速读】:该论文旨在解决传统检索系统在代理式搜索(agentic search)场景下的瓶颈问题:现有基于固定相似度接口的检索机制(如稀疏/稠密检索或重排序模型)仅支持单步 top-k 检索,无法有效支持多步推理、局部上下文验证、弱线索组合及计划修正等复杂任务,且早期过滤掉的证据难以通过后续更强的推理恢复。解决方案的关键在于提出直接语料库交互(Direct Corpus Interaction, DCI),即代理不依赖任何嵌入模型、向量索引或检索 API,而是通过通用终端工具(如 grep、文件读取、shell 命令等)直接操作原始语料库,实现无需离线索引、适应动态本地语料的高粒度访问。实验证明,DCI 在多个信息检索(IR)基准和端到端代理搜索任务中显著优于主流稀疏、稠密及重排序基线,并在 BrowseComp-Plus 和多跳问答任务中取得强准确率,表明检索质量不仅取决于推理能力,更取决于模型与语料库交互接口的分辨率,从而为代理搜索开辟了新的接口设计空间。

链接: https://arxiv.org/abs/2605.05242
作者: Zhuofeng Li,Haoxiang Zhang,Cong Wei,Pan Lu,Ping Nie,Yi Lu,Yuyang Bai,Shangbin Feng,Hangxiao Zhu,Ming Zhong,Yuyu Zhang,Jianwen Xie,Yejin Choi,James Zou,Jiawei Han,Wenhu Chen,Jimmy Lin,Dongfu Jiang,Yu Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

[IR-23] Dynamic Graph with Similarity-Aware Attention Graph Neural Network for Recommender Systems

【速读】:该论文旨在解决传统协同过滤方法难以捕捉用户偏好动态变化的问题,以及现有图神经网络(GNN)推荐模型在建模用户-用户关系和动态图演化方面的不足。其核心解决方案是提出一种动态多相似度感知注意力图神经网络(DG-SA-GNN),关键在于构建四个基于不同相似度函数(余弦、Jaccard、折扣皮尔逊相关系数和IPIJ)的并行用户相似度图,并通过专用UserGNN模块进行多相似度传播;随后利用图Transformer融合多视图信息,并引入交叉注意力模块实现用户与物品嵌入间的交互优化;更重要的是,在训练过程中按预定周期重构用户相似度图,使模型能够自适应地学习嵌入空间的变化,从而实现动态图演化。该设计显著提升了推荐性能,在MovieLens100K数据集上Recall@20达到0.162,优于LightGCN基线。

链接: https://arxiv.org/abs/2605.05238
作者: Aadarsh Senapati,Neha Kujur,Vivek Yelleti
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Recommender systems are essential components of modern online platforms which presents personalized content in various domain. The traditional collaborative filtering methods depends on static user-item interaction graphs and a limited subset of similarity measures which fail to capture the changing nature of preferences of an individual. Recent graph neural network (GNN) based approaches focus on user-item bipartite graphs which do not use explicit user-user relational modelling and dynamic graph evolution during training. To address these limitations, this paper proposes a Dynamic Graph SimilarityAware Attention Graph Neural Network (DG-SA-GNN) framework that integrates dynamic user similarity graph construction with multi-similarity propagation and attention-based aggregation. The proposed architecture constructs four parallel user similarity graphs using Cosine, Jaccard, Discounted Pearson Correlation Coefficient (Discount PCC), and IPIJ similarity functions, each processed by a dedicated UserGNN module. A Graph Transformer fuses the four graph views, and a CrossAttention module refines user embeddings through interaction with item embeddings. Crucially, the graphs are reconstructed at scheduled epochs during training, enabling the model to adapt to the learned embedding space constituting the dynamic graph component. Mini-batch training with hard negative sampling improves scalability and convergence. Experiments on the MovieLens100K benchmark demonstrate that DG-SA-GNN achieves a Recall@20 of 0.162 and NDCG@20 of 0.065 which is better than the LightGCN baseline in recall. The results validate that dynamic multi-similarity graph construction coupled with attention-based fusion which produce recommendation performance

[IR-24] DisastRAG : A Multi-Source Disaster Information Integration and Access System Based on Retrieval-Augmented Large Language Models

【速读】:该论文旨在解决灾难管理中信息获取的异构性与时效性难题,即如何高效整合来自结构化运营记录、非结构化机构文档及动态外部来源的多源信息,以支持复杂、时间敏感且情境依赖的信息需求。现有系统通常依赖单一访问路径,难以满足实际应用中的多样化查询场景。解决方案的关键在于提出DisastRAG框架,其核心是一个多路径架构:通过 curated hazard corpus 实现文档检索(document retrieval),基于关系型灾难记录提供结构化访问(structured access),并通过外部网络回退机制(web fallback)处理语料库外请求;同时融合查询理解、策略路由、响应生成与上下文记忆模块,形成统一的信息集成与访问体系。实验证明,检索增强显著提升性能,在多项任务中实现12–23个百分点的准确率提升和高达10.5个百分点的关键点覆盖率增长,且不同检索策略对模型能力具有差异化影响,进一步验证了该架构在应对复杂灾难信息场景下的有效性。

链接: https://arxiv.org/abs/2605.05210
作者: Bo Li,Zhitong Chen,Kai Yin,Junwei Ma,Yiming Xiao,Ali Mostafavi
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Effective disaster management requires rapid access to information distributed across structured operational records, unstructured institutional documents, and dynamic external sources. However, most existing disaster information systems and retrieval-augmented generation frameworks remain organized around a single access pathway, limiting their ability to support heterogeneous, time-sensitive, and context-dependent information needs. This study presents DisastRAG, a disaster-aware information integration and access system that combines large language models with retrieval-augmented access to structured, unstructured, and contextual disaster information. The framework is built around a multi-path architecture that supports document retrieval over a curated hazard corpus, structured access over relational disaster records, and external web fallback for out-of-corpus requests, while also incorporating query understanding, strategy routing, response generation, and contextual memory within a unified system. We evaluated the document retrieval performance using four open-source large language models across multiple retrieval configurations on multiple-choice and open-ended disaster information tasks. Retrieval augmentation consistently improves performance over no-retrieval baselines, yielding multiple-choice gains of 12-23 percentage points and open-ended keypoint coverage gains of up to 10.5 percentage points. Results show that larger candidate pools are most helpful for weaker models, while stronger models are more sensitive to retrieval noise. Hybrid retrieval performs best for open-ended coverage, whereas vector retrieval and shallower reranking more often favor closed-form factual selection. Case studies further show that structured access and web fallback extend the framework beyond document-only RAG.

[IR-25] PRAISE: Prefix-Based Rollout Reuse in Agent ic Search Training

【速读】:该论文旨在解决代理式搜索(agentic search)中基于检索的强化学习(search-based Reinforcement Learning, RL)方法存在的两个核心问题:一是长时程轨迹(long-horizon rollouts)在训练过程中利用率低,二是仅在最终答案处提供监督信号,导致奖励稀疏性严重。解决方案的关键在于提出Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards(PRAISE)框架,其通过提取完整搜索轨迹中的前缀状态(prefix states),生成中间答案,并利用这些前缀构建额外的训练轨迹及计算步骤级奖励(step-level rewards),从而提升数据效率与信用分配(credit assignment)能力;该方法采用单一共享模型同时完成搜索策略学习与前缀答案评估,无需额外人工标注或独立奖励模型,实现端到端联合优化。

链接: https://arxiv.org/abs/2604.03675
作者: Erhan Zhang,Yiqun Chen,Zechun Niu,Wei Yang,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
机构: Renmin University of China; Xiaohongshu Inc.; University of Southern California
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

人机交互

[HC-0] SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition

【速读】:该论文旨在解决当前手语识别(Sign Language Recognition, SLR)研究中过度依赖视觉数据所带来的局限性,如对光照和遮挡敏感、隐私问题以及跨模态多样性不足等问题。其解决方案的关键在于构建一个大规模多模态数据集SIGMA-ASL,通过融合RGB-D相机、毫米波雷达(mmWave radar)和腕戴惯性测量单元(IMU)三种传感器,采集互补的视觉、无线电反射和运动学信息,并实现毫秒级时间对齐的统一感知框架,从而支持鲁棒、隐私保护且泛化的手语识别系统开发。

链接: https://arxiv.org/abs/2605.06351
作者: Xiaofang Xiao,Guangchao Li,Guangrong Zhao,Qi Lin,Wen Ma,Hongkai Wen,Yanxiang Wang,Yiran Shen
机构: Shandong University (山东大学); University of Warwick (华威大学)
类目: Human-Computer Interaction (cs.HC)
备注: 33 pages. Preprint version

点击查看摘要

Abstract:Automatic sign language recognition (SLR) has become a key enabler of inclusive human-computer interaction, fostering seamless communication between deaf individuals and hearing communities. Despite significant advances in multimodal learning, existing SLR research remains dominated by vision-based datasets, which are limited by sensitivity to lighting and occlusion, privacy concerns, and a lack of cross-modal diversity. To address these challenges, we introduce SIGMA-ASL, a large-scale multimodal dataset for SLR. The dataset integrates an Azure Kinect RGB-D camera, a millimeter-wave (mmWave) radar, and two wrist-worn inertial measurement units (IMUs) to capture complementary visual, radio-reflection, and kinematic information. Collected in a controlled studio environment with 20 participants performing 160 common American sign language (ASL) signs, SIGMA-ASL provides 93,545 temporally synchronized word-level multimodal clips. A unified sensing framework achieves millisecond-level alignment across modalities, enabling reliable sensor fusion and cross-modal learning. We further design standardized preprocessing pipelines and benchmarking protocols under both user-dependent and user-independent settings, offering a comprehensive foundation for evaluating single and multimodal SLR. Extensive experiments validate the dataset’s quality and demonstrate its potential as a valuable resource for developing robust, privacy-preserving, and ubiquitous sign language recognition systems.

[HC-1] Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Systems Perspective ICML

【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)与人类认知之间形成的闭环反馈机制如何影响知识生成的长期演化路径,尤其是这种耦合动态系统是否会引发性能退化或多样性丧失。解决方案的关键在于提出一个由人类认知、数据质量和模型能力三变量构成的最小化动力学模型,并通过理论分析和模拟揭示出三种不同的演化状态:协同增强、脆弱平衡和退化收敛。研究发现,随着人类对AI依赖度增加,系统可能从有益的协同进化过渡到低多样性、次优的稳定状态,其本质表现为信息瓶颈效应——熵减少并非源于有效压缩,而是由于闭环反馈导致的知识多样性损失。这一发现强调了人机共演化动力学在AI发展轨迹中的核心作用,超越了单纯模型设计的视角。

链接: https://arxiv.org/abs/2605.06347
作者: Xuening Wu,Yanlan Kang,Qianya Xu,Kexuan Xie,Jiaqi Mi,Honggang Wang,Yubin Liu,Zeping Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, ICML EIML Workshop submitted

点击查看摘要

Abstract:Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, these effects are typically considered in isolation. We propose a unified perspective: humans and language models form a coupled dynamical system linked by a feedback loop of usage, generation, and retraining. We introduce a minimal model with three variables – human cognition, data quality, and model capability – and show that this feedback can give rise to distinct dynamical regimes. Our analysis identifies three regimes: co-evolutionary enhancement, fragile equilibrium, and degenerative convergence. Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium. From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression. These results suggest that the trajectory of AI systems is shaped not only by model design, but by the dynamics of human-AI co-evolution.

[HC-2] LLM -Based Educational Simulation: Evaluating Temporal Student Persona Stability Across ADHD Profiles

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育场景中作为模拟学习者时的稳定性问题,即大语言模型(Large Language Models, LLMs)能否在长时间对话中保持一致的角色设定(persona),从而确保其在教师培训和自适应教学等应用中的有效性。解决方案的关键在于:首先,通过双评估框架量化自我报告特征与观察者评分的行为表达的稳定性;其次,发现高强度ADHD模拟角色在自我报告层面具有良好的跨对话稳定性,但行为表现存在选择性不稳定性——尤其在无脚本对话中出现内部漂移;而采用显式任务提示的结构化交互设计可完全消除这种漂移,从而保障行为一致性。因此,维持模拟学习者的稳定行为表现依赖于结构化的交互设计,这对实现路径依赖的学习者建模具有重要意义。

链接: https://arxiv.org/abs/2605.06307
作者: Jana Gonnermann-Müller,Jennifer Haase,Nicolas Leins,Thomas Kosch,Sebastian Pokutta
机构: Zuse Institute Berlin (兹塞研究所); Weizenbaum Institute (魏泽曼研究所); HU Berlin (柏林洪堡大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Student simulation with Large language models (LLMs) offers a scalable alternative for educational research and teacher training. Yet, its validity depends on whether models maintain stable personas across extended interactions. We test this prerequisite using a dual-assessment framework measuring self-reported characteristics and observer-rated behavioral expressions. Across two experiments testing four clinically-grounded ADHD persona conditions, five LLMs, and three prompt designs, we quantify between-conversation stability (N=4,968) and within-conversation stability (N=3,952 across 9 turns). Self-reported characteristics remain stable for high intensities, constituting a necessary prerequisite for valid behavioral simulation. Observer-rated behavioral expression reveals selective instability: within-conversation drift occurs in unscripted dialog for high and moderate ADHD personas. Scripted interactions with explicit task prompts eliminate this drift entirely. Stable, persona-aligned simulated learners benefit from a structured interaction design to maintain behavioral coherence, which holds significant implications for teacher training, adaptive tutoring, and any application requiring sustained, path-dependent learner interactions.

[HC-3] LearnMate2: Design and Evaluation of an LLM -powered Personalized and Adaptive Support System for Online Learning

【速读】:该论文旨在解决在线学习平台缺乏个性化指导的问题,从而影响学习效果。当前在线学习虽具备广泛可及性,但难以根据用户需求提供定制化支持。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)设计一个名为\tool的智能教育工具,其核心功能包括生成个性化学习计划、提供实时情境化辅助以及自适应调整学习活动,从而实现教学内容与用户个体特征的动态匹配。实证研究表明,该系统显著优于现有在线学习平台和通用LLM支持工具,在提升学习成效和用户体验方面展现出明显优势。

链接: https://arxiv.org/abs/2605.06257
作者: Xinyu Jessica Wang,Christine P. Lee,Bilge Mutlu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Personalization is crucial for effective learning, yet online learning, designed for widespread availability and open access, lacks personalized guidance. Recent advancements in large language models (LLMs) offer opportunities to bridge this gap. We explore how LLM-driven tools may be designed to support personalized and adaptive learning and examine how they shape user experience and learning outcomes. We iteratively designed \tool to support online learning by providing personalized study plans, real-time contextual assistance, and adaptive learning activities. A preliminary study ( n=24 ) assessed the effectiveness and usability of \tool and informed refinements in our system, which we then evaluated ( n = 16 ) against a combination of a state-of-the-art online learning platform and an LLM for learning support. Results indicate that \tool advances AI pedagogy by improving both learning outcomes and user experience compared to existing online learning and support tools. This work advances our understanding of the design space of personalized, AI-driven educational tools and their potential impact on user experience.

[HC-4] RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

【速读】:该论文旨在解决机器人在无明确用户指令情况下,如何理解并遵守社会规范以实现主动智能(active intelligence)的问题。当前主流研究聚焦于基于显式指令的任务完成(被动智能),但要使机器人融入人类社会,必须具备识别允许与禁止行为的能力。解决方案的关键在于构建首个针对主动智能的基准测试体系——RobotEQ,包含 RobotEQ-Data 数据集(1,900 张第一人称视角图像、5,353 个动作判断问题和 1,286 个空间定位问题)以及 RobotEQ-Bench 评估框架,用于系统性评测模型对社会规范的理解与执行能力。实验表明,现有模型在空间 grounding 任务上表现不足,而引入检索增强生成(Retrieval-Augmented Generation, RAG)技术整合外部社会规范知识库可显著提升性能,推动机器人从被动操作向主动社会合规演进。

链接: https://arxiv.org/abs/2605.06234
作者: Kuofei Fang,Xinyi Che,Haomin Ouyang,Shufan Zhang,Xuehao Wang,Qi Liu,Liyi Liu,Chenqi Zhang,Wenxi Cai,Wenyu Dai,Jinyang Wu,Fan Zhang,Haoyu Chen,Bin He,Zheng Lian
机构: Tongji University (同济大学); Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); University of Oulu (奥卢大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,900 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 5,353 action judgment questions and 1,286 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results show that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, we observe that leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.

[HC-5] AffectGPT -RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

【速读】:该论文旨在解决开放词汇多模态情感识别(Open-Vocabulary Multimodal Emotion Recognition, OV-MER)中因传统判别方法受限于预定义标签空间而导致细粒度情感理解不足的问题,以及现有训练策略(如基于token-level损失)与实际评估指标不一致、无法直接通过梯度反向传播优化的局限性。解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架AffectGPT-RL,利用其优化非可微目标的能力,使模型能够更有效地对齐任务指标,并通过设计合理的奖励机制提升情感识别性能。实验表明,该方法不仅显著提升了OV-MER的性能,还在基础情感识别(Basic Emotion Recognition)和情感分析等任务上取得SOTA结果,验证了强化学习在多模态情感理解中的有效性与泛化能力。

链接: https://arxiv.org/abs/2605.06126
作者: Zheng Lian,Fan Zhang,Lan Chen,Yazhou Zhang,Rui Liu,Jinyang Wu,Haoyu Chen,Xiaobai Li,Xiaojiang Peng,Bin He,Jianhua Tao
机构: Tongji University (同济大学); The Chinese University of Hong Kong (香港中文大学); Tianjing University (天津大学); Inner Mongolia University (内蒙古大学); Tsinghua University (清华大学); CMVS, University of Oulu (Oulu大学CMVS中心); Zhejiang University (浙江大学); Shenzhen Technology University (深圳技术大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, thereby enabling fine-grained emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models to capture the full spectrum of emotions and employs emotion wheels (EWs) for metric calculation. Previous approaches primarily rely on token-level loss during training. However, this objective is misaligned with the metrics used in OV-MER, and these metrics cannot be directly optimized via gradient backpropagation. To address this limitation, we turn our attention to reinforcement learning, as this strategy can optimize non-differentiable objectives. We term this framework AffectGPT-RL. Furthermore, we conduct extensive experiments to elucidate the role of reinforcement learning in this task, revealing the necessity of the reasoning process, the impact of different rewards, and the generalizability to other emotion tasks such as sentiment analysis and basic emotion recognition. Experimental results demonstrate that AffectGPT-RL yields significant performance improvements on OV-MER. Beyond this task, we also achieve remarkable performance gains on basic emotion recognition, attaining state-of-the-art results on MER-UniBench. To the best of our knowledge, this is the pioneering work exploring the role of reinforcement learning in OV-MER, providing valuable guidance for subsequent researchers. Our code is provided in the supplementary material and will be released to facilitate future research.

[HC-6] EventColumn: Integrating Event Sequences into Tabular Visualizations IEEE-VIS2026

【速读】:该论文旨在解决如何在统一表结构中同时整合事件序列数据与异构表格属性(如数值型、类别型和时间属性),以支持实例级和群体级的联合分析问题。传统方法通常仅能比较事件序列或表格数据,无法实现两者的协同分析,限制了对复杂业务场景(如钢铁生产物流)的理解深度。解决方案的关键在于提出EventColumn这一新型列类型,它将事件序列与多模态表格属性融合为单一结构,并提供压缩概览、热力图群组摘要、按事件类型对齐及相似历史项的箱线图等可视化工具,从而实现事件序列与表格属性的同步对比与洞察挖掘。

链接: https://arxiv.org/abs/2605.06065
作者: Jakob Zethofer,Andreas Hinterreiter,Lukas Schiefermüller,Belgin Mutlu,Marc Streit
机构: Pro2Future GmbH(Pro2Future有限公司); JKU Linz(林茨大学); voestalpine Stahl GmbH(沃斯特阿尔派钢铁有限公司)
类目: Human-Computer Interaction (cs.HC)
备注: Submitted to the Short Paper track at IEEE VIS 2026

点击查看摘要

Abstract:We introduce EventColumn, a new column type that integrates event-sequence data with heterogeneous tabular attributes into a single unified table. EventColumn lets analysts compare event sequences alongside numerical, categorical, and temporal attributes at both instance and group levels, offering a compressed overview, heatmap group summaries, alignment by event types, and boxplots of similar historical items. We developed EventColumn together with collaborators from the steel industry to facilitate the analysis of production events and warehouse logistics, but the solution generalizes to a wide range of event sequence datasets with additional tabular attributes. Unlike most existing approaches that compare either event sequences or tables, EventColumn supports simultaneous comparison of both. We demonstrate its integration with Taggle and Microsoft Power BI on data from steel production logistics and on a public e-commerce dataset.

[HC-7] Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures

【速读】:该论文旨在解决当前语音驱动的3D手势生成(speech-driven 3D gesture generation)在评估过程中依赖主观感知评价、且因虚拟人外观(avatar appearance)和面部呈现方式(facial presentation)不一致而导致评价偏差的问题。解决方案的关键在于通过受控实验,系统性地比较七种当代研究与应用中常见的虚拟人渲染风格下手势生成效果的感知差异,从而揭示视觉呈现对运动判断的系统性影响,并据此提出针对手势合成基准测试及面向人类交互场景部署的优化建议。

链接: https://arxiv.org/abs/2605.06063
作者: Haoyang Du,Yinghan Xu,John Dingliana,Brian Keegan,Rachel McDonnell,Cathy Ennis
机构: Technological University Dublin(都柏林理工学院); Trinity College Dublin(三一学院); Maynooth University(梅努斯大学)
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications.

[HC-8] Visual Fingerprints for LLM Generation Comparison IEEE-VIS2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在不同生成条件下的行为差异难以量化与比较的问题,尤其在提示(prompt)、系统指令、模型参数和架构等复杂交互下,输出可能呈现多种偏倚模式,而传统方法依赖个体样本或聚合指标难以捕捉这种分布层面的规律。其解决方案的关键在于将模型输出建模为一系列语言选择(包括内容、表达方式和结构),通过自然语言处理流水线提取这些选择,并基于重复采样构建其分布表示;进而将这些分布可视化为“视觉指纹”(visual fingerprints),从而实现对不同生成条件下模型行为倾向的直接、分布级比较,揭示出仅靠单个响应或平均指标无法观察到的一致性模式。

链接: https://arxiv.org/abs/2605.06054
作者: Amal Alnouri,Andreas Hinterreiter,Christina Humer,Furui Cheng,Marc Streit
机构: Johannes Kepler University Linz (约翰内斯·开普勒林茨大学); ETH Zürich (苏黎世联邦理工学院)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to the Short Paper track at IEEE VIS 2026

点击查看摘要

Abstract:Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We present an approach to visually compare LLM outputs across generation conditions by modeling responses as collections of linguistic choices, including content, expression, and structure. We extract these choices using natural language processing pipelines and represent their distributions across repeated samples. We then visualize these distributions as visual fingerprints, enabling direct, distribution-level comparison of condition-specific tendencies. Through four usage scenarios, we demonstrate how visual fingerprints reveal consistent patterns in LLM behavior that are difficult to observe through individual responses or aggregate metrics.

[HC-9] I see artifacts: ICA-based EEG artifact removal does not improve deep network decoding across three BCI tasks

【速读】:该论文旨在解决基于神经网络分类器的脑电图(EEG)数据解码中,独立成分(Independent Component, IC)噪声剔除方法对解码性能的影响问题。其关键解决方案是构建一个包含两种主流IC分解算法(Infomax与自适应混合独立成分分析,AMICA)和三种组件剔除策略(无剔除、ICLabel、多伪迹去除算法MARA)的处理流水线,并在三个不同任务的EEG数据集(运动想象、长期记忆形成、视觉记忆)上交叉验证三种常用EEG分类模型(两个卷积神经网络与一个基于长短期记忆网络的模型),系统评估IC噪声剔除对分类性能的提升效果。结果表明,IC-based噪声剔除方法并未显著优于未剔除的情况,尤其在考虑ICA计算资源消耗的前提下,其收益有限。

链接: https://arxiv.org/abs/2605.06018
作者: Taeho Kang,Yiyu Chen,Christian Wallraven
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Article already accepted in journal (Journal of Neural Engineering); uploading to public repository after accepted manuscript embargo (12 months) has been lifted in order to meet funder requirements for open access

点击查看摘要

Abstract:In this paper, we conduct a detailed investigation on the effect of independent component (IC)-based noise rejection methods in neural network classifier-based decoding of electroencephalography (EEG) data in different task datasets. We apply a pipeline matrix of two popular different independent component (IC) decomposition methods (Infomax and Adaptive Mixture Independent Component Analysis (AMICA)) with three different component rejection strategies (none, ICLabel, and multiple artifact rejection algorithm [MARA]) on three different EEG datasets (motor imagery, long-term memory formation, and visual memory). We cross-validate processed data from each pipeline with three architectures commonly used for EEG classification (two convolutional neural networks and one long short-term memory-based model. We compare decoding performances on within-participant and within-dataset this http URL results show that the benefit from using IC-based noise rejection for decoding analyses is at best minor, as component-rejected data did not show consistently better performance than data without rejections; especially given the significant computational resources required for independent component analysis (ICA) computations.

[HC-10] PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

【速读】:该论文旨在解决当前全双工(full-duplex)对话系统在模拟多样化角色(如权威教师、不合作的商人或分心的员工)时,因默认采用“始终让步”(always-yield)的抢话处理策略而导致角色一致性受损的问题。解决方案的关键在于提出一个名为PersonaKit(PK)的开源、低延迟Web平台,通过直观的JSON配置定义角色特征和概率性打断行为(如让步、坚持、衔接或覆盖),并支持自动部署A/B对比测试,从而实现对复杂社会语言行为的快速原型设计与实证评估。

链接: https://arxiv.org/abs/2605.06007
作者: Hyunbae Jeon,Jinho D. Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas – such as authoritative instructors, uncooperative merchants, or distracted workers – they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield’’ policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.

[HC-11] Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在科学发现中应用时,人机交互范式尚未充分探索的问题,尤其是如何有效引导专家用户与AI系统协同开展复杂研究任务。其解决方案的关键在于识别并提出“意图构建”(intentmaking)这一新型工作流程,即用户通过与AI系统的持续互动,迭代地发现、定义和优化实验目标;该过程与“理解构建”(sensemaking)形成循环,共同构成科学研究中的核心认知机制,从而推动AI工具从传统的问答模式向协作式仪器演进,提升科学探索的效率与深度。

链接: https://arxiv.org/abs/2605.05921
作者: Alex Bäuerle,Adam Connors,Alexander Novikov,Adam Zsolt Wagner,Ngân Vũ,Fernanda Viegas,Martin Wattenberg,Lucas Dixon
机构: Google DeepMind(谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and refining one’s experimental goals through active system interaction. We frame this as a natural extension to sensemaking, the cognitive process of building an understanding of complex or novel data. We suggest that users enter a cycle of intentmaking (defining and updating their experiment) and sensemaking (interpreting the results) which repeats many times during the course of an investigation. Our documentation of these themes suggests an approach to designing AI tools for scientific discovery that goes beyond the existing question/answer model of many current systems, treating them as collaborative instruments rather than opaque black-box assistants.

[HC-12] Can providing feedback on gaze and mental-effort synchrony improve pair programming performance?

【速读】:该论文旨在解决配对编程(pair programming)在计算机科学教育中协作效率不稳定的问题,其核心症结在于合作双方在协调注意力(coordination attention)和认知调节(cognitive regulation)方面的失效。解决方案的关键在于引入基于联合视觉注意(joint visual attention)与联合心理努力(joint mental effort)的AI支持反馈机制,并通过反馈时机的差异化设计实现协同性能优化:研究区分了反应式反馈(reactive feedback)与前瞻性反馈(proactive feedback),前者在检测到实时失调时干预,后者利用机器学习模型预测潜在调节失败并提前介入。结果表明,多模态反馈显著提升调试成功率和效率,其中结合联合视觉注意与心理努力的反应式反馈效果最佳,而前瞻性反馈进一步减少任务耗时、增强建设性反馈采纳率,并更有效地维持学习者自主性(learner agency),尤其对高绩效配对更为有效。这揭示了 gaze 和 mental effort 同步性作为可行动触发信号的价值,强调了反馈时机透明性和前瞻调控在支持高效配对编程中的重要性。

链接: https://arxiv.org/abs/2605.05836
作者: Anahita Golrang,Kshitij Sharma
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Pair programming is a widely used collaborative learning practice in computer science education yet its effectiveness varies substantially due to breakdowns in coordination attention and cognitive regulation between partners. This paper investigates whether AI supported feedback grounded in joint visual attention and joint mental effort can improve collaborative programming performance and how feedback timing shapes learner AI interaction. Two experimental studies using dual eye tracking capture real time indicators of collaborative regulation during debugging tasks. Study 1 examines reactive feedback that intervenes when observed joint visual attention or joint mental effort deviates beyond predefined thresholds while Study 2 evaluates proactive feedback that forecasts future regulatory breakdowns using machine learning models and intervenes pre emptively. Across both studies feedback effectiveness is assessed through debugging success time on task and feedback uptake reflected in code changes. Multimodal feedback significantly improves collaborative performance compared to no feedback conditions. Reactive feedback yields strong gains in debugging success and efficiency particularly when joint visual attention and joint mental effort based feedback are combined. Proactive forecast based feedback further enhances performance reduces time on task and increases constructive feedback uptake while relying less on intrusive interventions. Proactive feedback better preserves learner agency by maintaining optimal collaboration states, particularly for high-performing pairs. These findings demonstrate that gaze and mental effort synchrony can serve as reliable actionable triggers for AI supported collaborative learning highlighting the importance of feedback timing transparency and anticipatory regulation in supporting effective pair programming.

[HC-13] GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

【速读】:该论文旨在解决智能眼镜中缺乏对用户内部认知状态(cognitive load)感知的问题,从而无法主动预测用户需求。现有方法要么依赖不适用于轻量级可穿戴设备的传感器,要么使用基于眼动追踪的模型,这些模型存在解释性差、需针对特定任务微调且跨个体泛化能力弱等局限。其解决方案的关键在于提出GazeMind框架——一个以眼动引导的大语言模型(LLM)代理系统,通过将眼动数据编码为结构化表示供LLM推理,实现可解释的认知负荷预测;并通过创新的任务引导推理机制实现无需LLM微调的跨场景泛化,同时结合用户特异性特征与历史参考实现个性化适应。

链接: https://arxiv.org/abs/2605.05790
作者: Bin Wang,Yue Liu,Benjamin Newman,Ajoy S. Fernandes,Zhiyuan Wang,Robert Cavin,Michele A. Cox,Vijay Rajanna,Takumi Bolte,Melissa Hunfalvay,Ulas Bagci,Michael J. Proulx
机构: Meta Reality Labs Research (Meta); Northwestern University; HarmonEyes
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Smart glasses with AI assistants are increasingly used in daily life. However, current systems lack awareness of the user’s internal cognitive state, leaving them unable to proactively anticipate users’ needs without access to cognitive load. Existing methods for assessing cognitive load either rely on impractical sensors for lightweight eyewear or utilize eye gaze-based models that suffer from poor interpretability, and require task-specific fine-tuning, often failing to generalize across individuals. We propose GazeMind, a gaze-guided LLM agent framework for cognitive load assessment on smart glasses. It encodes eye-tracking data into structured representations for LLM-based reasoning and provides interpretable cognitive load predictions. Importantly, GazeMind generalizes across scenarios without LLM fine-tuning through a novel task-guidance reasoning approach and achieves personalized adaptation by incorporating user-specific characteristics and historical references. To support evaluation, we introduce CogLoad-Bench, the largest gaze-based cognitive load dataset with 152 participants, 40+ hours of multimodal data, and 10K+ real-time annotations across controlled and real-world tasks. Experiments show that GazeMind achieves state-of-the-art performance, outperforming baselines by over 20% across all metrics.

[HC-14] Priming Path-dependence and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

【速读】:该论文旨在解决实验室环境下难以捕捉真实世界中用户与大语言模型(Large Language Models, LLMs)交互行为的问题,特别是缺乏对用户先验经验、个体探索路径及其长期影响的系统性观测。其解决方案的关键在于利用来自全球7,955名匿名用户的14万次聊天会话日志进行大规模分析,揭示了用户在不同任务情境下形成的稳定交互模式,并发现早期交互轨迹显著预测后续使用行为(如文本重复性和留存率),从而提出“代理悖论”——尽管LLM输入空间开放且由用户驱动,实际用户探索程度反而较低。这一发现强调了在设计和研究中需关注用户行为塑造机制(molding procedure)的重要性。

链接: https://arxiv.org/abs/2605.05767
作者: Shengqi Zhu,Jeffrey M. Rzeszotarski,David Mimno
机构: Cornell University (康奈尔大学); Loyola University Maryland (洛约拉玛丽蒙特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:User interactions with LLMs are shaped by prior experiences and individual exploration, but in-lab studies do not provide system designers with visibility into these in-the-wild factors. This work explores a new approach to studying real-world user-LLM interactions through large-scale chat logs from the wild. Through analysis of 140K chatbot sessions from 7,955 anonymized global users over time, we demonstrate key patterns in user expressions despite varied tasks: (1) LLM users are not tabula rasa, nor are they constantly adapting; rather, interaction patterns form and stabilize rapidly through individual early trajectories; (2) Longitudinal outcomes, such as recurring text patterns and retention rates, are strongly correlated with early exploration; (3) Parallel dynamics are present, including organizing expressions by task types such as emotional support, or in response to model-version updates. These results present an ``agency paradox’': despite LLM input spaces being unconstrained and user-driven, we in fact see less user exploration. We call for design consideration surrounding the molding procedure and its incorporation in future research.

[HC-15] Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM -RL Coupling

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的3D场景生成方法中,场景构建与用户交互被割裂处理的问题,从而限制了交互式多媒体系统的适应性和沉浸感。其解决方案的关键在于提出一个统一框架,通过闭环整合语言驱动的3D场景生成与沉浸式用户交互:首先利用LLMs构建结构化的场景表示,再结合强化学习在几何与语义约束下优化空间布局;随后将生成环境部署于虚拟现实(Virtual Reality, VR)中,实现人机交互(Human-Robot Interaction, HRI)反馈循环,使用户交互持续优化内容以贴合人类感知与可用性。这种生成与交互的紧密耦合显著提升了多媒体体验的响应性、适应性和真实性。

链接: https://arxiv.org/abs/2605.05711
作者: Anh H. Vo,Sungyo Lee,Phil-Joong Kim,Soo-Mi Choi,Yong-Guk Kim
机构: Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at this https URL.

[HC-16] PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

【速读】:该论文旨在解决生成式 AI (Generative AI) 安全评估中红队测试(red-teaming)策略受限于人类红队成员背景多样性不足,以及现有自动化红队方法未能有效融合人类视角与身份特征的问题。其解决方案的关键在于提出一种基于角色(persona)驱动的红队工作流——PersonaTeaming Workflow,通过将不同人格特征注入对抗性提示生成过程,显著扩展了攻击策略的覆盖范围,并在保持提示多样性的同时提升攻击成功率;进一步地,作者构建了面向用户的 PersonaTeaming Playground 界面,使红队人员可自定义角色并协同 AI 进行提示迭代优化,实验证明该设计能激发多样化的红队策略并促进创新思维,从而推动自动化与人机协同双路径下的红队技术发展。

链接: https://arxiv.org/abs/2605.05682
作者: Wesley Hanwen Deng,Mingxi Yan,Sunnie S. Y. Kim,Akshita Jha,Lauren Wilcox,Kenneth Holstein,Motahhare Eslami,Leon A. Gatys
机构: Carnegie Mellon University (卡内基梅隆大学); Apple (苹果公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers’ backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work, we explore persona-driven red-teaming to advance both automated red-teaming and human-AI collaboration. We first develop PersonaTeaming Workflow, which incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, a state-of-the-art automated red-teaming method, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. However, since automated personas only approximate real human perspectives, we further instantiate PersonaTeaming Workflow as PersonaTeaming Playground, a user-facing interface that enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts. In a user study with 11 industry practitioners, we found that PersonaTeaming Playground enabled diverse red-teaming strategies and outputs that practitioners perceived as useful, and that AI-generated suggestions in the PersonaTeaming Playground encouraged out-of-the-box thinking even when practitioners did not follow them strictly. Together, our work advances both automated and human-in-the-loop approaches to red-teaming, while shedding light on interaction patterns and design insights for supporting human-AI collaboration in generative AI red-teaming.

[HC-17] he Capacity to Care: Designing Social Technology for Sustained Engagement With Societal Challenges

【速读】:该论文试图解决的问题是:在社交媒体环境中,尽管公众对气候变化、不公正和人道主义危机等全球性问题普遍关注,但平台设计往往仅促进信息传播而缺乏引导用户采取持续、有意义行动的机制,导致用户产生心理负担、情绪耗竭与参与度下降,尤其影响年轻群体。解决方案的关键在于重构社会技术设计,使其支持“可持续关怀(sustainable care)”——即通过借鉴Tronto的关怀伦理框架,将关怀过程从单纯的认知觉醒扩展为包含责任(responsibility)、能力(competence)和共同体(community)的完整体系,并推动平台设计识别并减少对关怀能力的消耗,同时构建能够长期维持用户参与而不引发倦怠的交互机制。

链接: https://arxiv.org/abs/2605.05651
作者: JaeWon Kim,Lindsay Popowski,Louisa Conwill,Elizabeth Lizzie' Li,Meryl Ye,Jiaying Lizzy’ Liu,Jose A. Guridi,Theia Henderson,Bingxu Han,Dennis Wang,Angel Hsing-Chi Hwang,Susan Wyche,Yasmine Kotturi,Gillian R. Hayes,Angela D. R. Smith
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学); University of Notre Dame (圣母大学); Northwestern University (西北大学); Carnegie Mellon University (卡内基梅隆大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Cornell University (康奈尔大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); National Yang Ming Chiao Tung University (国立阳明交通大学); University of Southern California (南加州大学); Michigan State University (密歇根州立大学); University of Maryland, Baltimore county (马里兰大学巴尔的摩县分校); University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:People care about climate change, injustice, and humanitarian crises. The challenge is not apathy but capacity: sustained engagement with large-scale problems is psychologically costly, and social media architecture often amplifies awareness while providing few pathways to meaningful action. The result is rising distress, overwhelm, and disengagement – particularly among young people who encounter global suffering through platforms designed for attention capture rather than constructive response. This workshop examines how social technology design shapes the conditions for sustained engagement with societal challenges. Drawing on Tronto’s care ethics framework and research in moral psychology and platform studies, we ask why caring at scale is difficult and how social media can both exacerbate and potentially mitigate this difficulty. Tronto’s framework shows that good care requires more than awareness: it demands responsibility, competence, and community. Dominant social media architectures stall the caring process at its earliest phase. We invite researchers and designers to identify platform designs that deplete or support the capacity to care, and to develop design directions for \textitsustainable care: engagement that people can maintain over time without burning out.

[HC-18] he Missing Evaluation Axis: What 10000 Student Submissions Reveal About AI Tutor Effectiveness

【速读】:该论文试图解决当前生成式 AI (Generative AI) 教学系统(AI tutors)评价体系过于依赖教学法质量(pedagogical quality)而忽视学生实际行为反应的问题。现有评估方法无法揭示学生是否真正采纳了反馈以及如何应用这些反馈,从而导致对AI tutor效能的片面认知。解决方案的关键在于引入一个基于学生交互数据的行为维度评价框架,通过分析10,235份代码提交及其对应的AI反馈,量化学生是否采纳反馈及采纳是否正确,从而补足传统教学法评估的不足。实证结果显示,该行为维度不仅识别出不同AI tutor在学生参与模式上的显著差异,且与学生对反馈有用性的感知更强相关,为AI tutor性能提供了更全面、可操作的评估视角。

链接: https://arxiv.org/abs/2605.05648
作者: Rose Niousha,Samantha Boatright Smith,Bita Akram,Peter Brusilovsky,Arto Hellas,Juho Leinonen,John DeNero,Narges Norouzi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the 27th International Conference on Artificial Intelligence in Education (AIED 2026), Main Conference Track

点击查看摘要

Abstract:Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences in student engagement patterns that are not captured by pedagogy-only evaluation. Moreover, these engagement-based behavioral signals are more strongly associated with student perception of helpful feedback than pedagogical quality alone, providing a more complete and actionable picture of AI tutor performance.

[HC-19] UX in the Age of AI: Rethinking Evaluation Metrics Through a Statistical Lens

【速读】:该论文旨在解决传统用户体验(User Experience, UX)评估框架在面对生成式 AI(Generative AI)等非确定性系统时的结构性不足问题。经典指标如系统可用性量表(System Usability Scale, SUS)、净推荐值(Net Promoter Score, NPS)和任务完成率,假设输入与输出具有确定性和一致性,但在AI中介系统(如对话代理、生成界面和推荐引擎)中,输出具有随机性、上下文敏感性和时间变异性,导致这些指标失效。解决方案的关键在于提出自适应动态 UX 统计框架(Adaptive Dynamic UX Statistical Framework, ADUX-Stat),其核心创新包括:(1) 交互熵指数(Interaction Entropy Index, IEI),量化用户感知层面的 AI 响应不可预测性;(2) 时间漂移系数(Temporal Drift Coefficient, TDC),测量多轮交互中感知可用性的纵向变化趋势;(3) 贝叶斯可用性置信度评分(Bayesian Usability Confidence Score, BUCS),在不确定性下提供可用性质量的可信区间估计。该框架将可用性重构为概率信号分布而非静态标量,实现了对 AI 系统 UX 的动态、可复现且适用于实际部署的评估方法。

链接: https://arxiv.org/abs/2605.05600
作者: Harish Vijayakumar
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid proliferation of artificial intelligence (AI) in consumer-facing digital products has disrupted the assumptions underlying classical user experience (UX) evaluation frameworks. Legacy metrics such as the System Usability Scale (SUS), Net Promoter Score (NPS), and task completion rate were engineered for deterministic, rule-based interfaces where identical inputs yield identical outputs. In AI-mediated systems – spanning conversational agents, generative interfaces, and recommendation engines – outputs are stochastic, context-sensitive, and temporally variable, rendering these metrics structurally insufficient. This paper introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), a novel evaluation model that reconceptualises usability as a probabilistic signal distribution rather than a static scalar score. ADUX-Stat integrates three original constructs: (1) Interaction Entropy Index (IEI), quantifying the unpredictability of AI responses from a user perception standpoint; (2) Temporal Drift Coefficient (TDC), measuring longitudinal degradation or improvement of perceived usability over interaction sessions; and (3) Bayesian Usability Confidence Score (BUCS), producing credible interval estimates of usability quality under uncertainty. The framework is validated conceptually against five established AI product categories. ADUX-Stat addresses a critical gap at the intersection of HCI research, statistical modelling, and AI product evaluation, offering a reproducible, field-deployable methodology for UX practitioners and researchers alike.

[HC-20] Prober.ai: Gated Inquiry-Based Feedback via LLM -Constrained Personas for Argumentative Writing Development WWW

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中引发的认知债务问题——即学生过度依赖AI生成内容,导致批判性思维和论证能力退化。解决方案的关键在于重构AI辅助写作的交互范式:通过设计一个基于Toulmin论证理论的两阶段系统架构(Challenge与Unlock),利用特定角色提示(persona-specific system prompts)和结构化JSON输出模板约束LLM仅生成聚焦于论证弱点的探究性问题,并强制学生在获得改进建议前进行反思,从而引入教学摩擦机制以保留认知参与度,实现可扩展且认知友好的AI融入写作教学。

链接: https://arxiv.org/abs/2605.05598
作者: Ran Bi,Shiyao Wei,Yuanyiyi Zhou
机构: Florida State University (佛罗里达州立大学); New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Prototype awarded second place at the NYEdTech Hackathon (March 2026) this https URL

点击查看摘要

Abstract:The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, resulting in measurable cognitive debt and diminished argumentative reasoning skills. We present this http URL, a web-based writing environment that inverts the conventional AI-tutoring paradigm: rather than generating or rewriting student text, the system constrains an LLM (Gemini 3 Flash Preview) through persona-specific system prompts and structured JSON output schemas to produce only targeted, inquiry-based questions about argumentative weaknesses. A two-phase interaction architecture – Challenge and Unlock – implements a pedagogical friction mechanism whereby revision suggestions are gated behind mandatory student reflection. The system’s design is grounded in Toulmin’s argumentation theory, research on peer feedforward questioning mechanisms, and evidence on AI-supported feedback in writing instruction. A functional prototype was developed in 36 hours during the NY EdTech Hackathon (March 2026), where it was awarded second place. We describe the system architecture, the prompt engineering methodology for constraining LLM output to pedagogically aligned JSON schemas, and discuss implications for scalable, cognition-preserving AI integration in writing education.

[HC-21] Designing with Tensions: Older Adults Emotional Support-Seeking Under System-Level Constraints in Conversational AI

【速读】:该论文旨在解决老年人在日常使用对话式人工智能(Conversational AI)进行情感支持时,因系统安全干预导致情感互动中断或控制权转移所引发的情绪困扰问题。其解决方案的关键在于设计能够嵌入用户社会情境、契合用户情感节奏并维持用户自主性的安全干预机制,从而保障老年人在情感脆弱时刻仍能持续、稳定地与AI保持情绪连接。

链接: https://arxiv.org/abs/2605.05552
作者: Mengqi Shi,Tianqi Song,Zicheng Zhu,Yi-Chieh Lee
机构: University of Washington (华盛顿大学); National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Older adults have increasingly turned to conversational AI as a source of emotional support. However, little is known about how emotionally supportive interactions are experienced in everyday use, particularly when AI systems limit, redirect, or intervene during these interactions. We interviewed 18 older adults about their experiences using conversational AI for emotional support, examining when they turn to AI, how they engage during emotionally vulnerable moments, and how they respond when support feels disrupted. Our findings show that older adults often rely on AI when other forms of social support feel inaccessible. However, current safety-related interventions can redirect interactions in ways that participants experience as interruptions to emotional engagement or as shifts in control away from them. Such disruptions can undermine older adults’ ability to remain emotionally engaged and, in some cases, contribute to emotional distress. We discussed design implications for emotionally supportive conversational AI, emphasizing the need for safety interventions that are enacted within older adults’ social contexts, align with users’ emotional pacing, and preserve their sense of agency.

[HC-22] Cross-individual generalizability of machine learning models for ball speed prediction in baseball pitching

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在体育场景中跨个体泛化能力不足的问题,即模型在特定个体上表现良好,但在其他个体上性能显著下降。解决方案的关键在于通过留一被试交叉验证(leave-one-subject-out cross-validation)方法系统评估ML模型在不同个体间的泛化性能,并进一步分析专家水平差异和时空运动信息限制对模型泛化能力的影响。研究发现,模型在跨个体预测时性能明显下降(R²从0.91降至0.38),且存在群体偏差(如对中级运动员高估),但躯干与支撑腿的运动特征表现出较好的泛化能力,尤其在重心转移起始阶段仍保持一定预测效能(R² > 0.25)。这一方法框架为提升ML模型在实际体育应用中的普适性和可靠性提供了实证依据。

链接: https://arxiv.org/abs/2605.05487
作者: Ryota Takamido,Chiharu Suzuki,Hiroki Nakamoto
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although machine learning (ML)-based performance outcome prediction is an important topic in contemporary sports science, one important issue is the limited understanding of the cross-individual generalizability of ML models in sports contexts. To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching. A dataset comprising 50 pitchers from various competitive levels was analyzed. Cross-individual generalizability was assessed using leave-one-subject-out cross-validation. Specifically, the effects of expertise level and restrictions on spatiotemporal motion information were examined to identify factors influencing model generalizability. The results revealed that, under cross-individual evaluation, (1) predictive performance was markedly lower than under within-individual evaluation, with R-squared value decreasing from 0.91 to 0.38; (2) the model tended to overestimate the performance of Intermediate pitchers relative to Expert pitchers, with a significant group difference in signed prediction error (p .05); and (3) the trunk and pivot leg demonstrated relatively high generalization performance, with the pivot leg showing notable generalizability even during the weight-shift initiation phase (R-squared value 0.25). These findings underscore the importance of cross-individual evaluation in enhancing the practical applicability of ML in sports settings and contribute to a deeper understanding of the biomechanical factors underlying the target movement.

[HC-23] he Ambivalent Experience of Eye Contact for People with Visual Impairments: Mechanisms and Design Challenges

【速读】:该论文旨在解决视觉障碍者在与视力正常者协作时,因无法依赖眼接触(eye contact)这一默认注意力分配和轮流发言线索而导致的交互困难问题。其解决方案的关键在于从机制层面揭示了三类核心因果机制:一是当视线无法分配发言权时,需依赖明确的命名来实现可及性;二是模糊的言语介入信号与持续的注意力分配工作会分散认知资源并引发疲劳;三是眼接触规范可能扭曲对参与度的判断,促使个体主动管理自身可见性。基于此,作者提出将“可访问的眼接触”重构为支持可配置交互契约的设计理念,而非简单地使 gaze 可见,从而转化为五个具体的设计挑战。

链接: https://arxiv.org/abs/2605.05437
作者: Markus Wieland,Phillip Koch,Michael Sedlmair
机构: VISUS, University of Stuttgart (可视化与科学计算研究所,斯图加特大学); University of Stuttgart (斯图加特大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In mixed-ability collaboration, eye contact is often treated as a default cue for attention and turn-taking. As these signals are primarily visual, they are not reliably accessible to people with visual impairments. While prior work emphasized technical solutions, mechanism-level explanations of their experiences with sighted partners remain scarce. We interviewed 17 people with visual impairments about everyday interactions across work, education, and social settings. Using a critical-realist lens, we link events to plausible causal mechanisms and identify three recurring mechanisms: First, when gaze cannot allocate the floor, addressability hinges on explicit naming. Second, unclear speech entry cues and ongoing access work split attention and build fatigue, sometimes leading to withdrawal. Third, eye-contact norms can skew judgments of participation, prompting active management of visibility. We translate these mechanisms into five design challenges that reframe accessible eye contact as supporting configurable interaction contracts rather than merely making gaze visible.

[HC-24] LaTA: A Drop-in FERPA-Compliant Local-LLM Autograder for Upper-Division STEM Coursework

【速读】:该论文旨在解决高等教育中STEM课程作业批改效率低下与数据隐私风险并存的问题。传统上,基于大语言模型(Large-Language-Model, LLM)的评分工具依赖第三方API,不仅违反《家庭教育权利和隐私法案》(FERPA),还带来显著的数据安全风险,且常需对作业格式进行大幅调整以适配外部系统。其解决方案的关键在于提出一个名为LaTA(LaTeX Teaching Assistant)的本地化、开源自动评分框架,该框架完全运行在机构内部的通用硬件上,兼容已广泛采用LaTeX写作流程的工程与物理类课程。LaTA采用四阶段流水线(摄入、分割、评分、报告),使用本地部署的开放权重链式思维LLM(gpt-oss:120b)比对学生的LaTeX作业与教师编写的参考答案,并结合YAML定义的二值评分量规(rubric)实现精准评分,从而在保障数据主权的同时,实现近乎零边际成本、每份作业仅需1–3分钟的高效批改,且经实证验证其评分误差率低至0.02%–0.04%每项指标,显著提升学生学习成效与自信心。

链接: https://arxiv.org/abs/2605.05410
作者: Jesse A. Rodríguez
机构: Oregon State University (俄勒冈州立大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Physics Education (physics.ed-ph)
备注: Submitted to Computers Education

点击查看摘要

Abstract:Large-language-model (LLM) graders promise to relieve the grading burden of upper-division STEM courses, but most deployments to date send student work to third-party APIs, violating FERPA and exposing institutions to data risk while requiring substantial assignment modification. We present \textbfLaTA\ (\textitLaTeX Teaching Assistant) , a drop-in, open-source autograder that runs entirely on commodity on-premises hardware and assumes a LaTeX-native workflow already adopted by many engineering and physics courses. LaTA implements a four-stage pipeline (ingest, segment, grade, report) using a locally hosted open-weight chain-of-thought LLM grader (gpt-oss:120b) that compares student work to an instructor-authored reference solution and applies a YAML rubric with binary per-item scoring. We deployed LaTA in Winter~2026 in ME 373 (Mechanical Engineering Methods) at Oregon State University, grading every weekly assignment for approximately 200 students on a single Mac Studio at \ 0 marginal cost per assignment and 1–3 minutes of wall-clock time per submission, enabling regrading of corrected assignments and greatly expanded TA office hour offerings. The instructor-confirmed grading-error rate held at roughly 0.02 – 0.04% per rubric line item across the term. Relative to the same instructor’s previous traditionally-graded cohort, the LaTA-graded cohort outperformed by approximately 11% on the midterm exam and 8% on the final exam, and reported large gains in self-assessed confidence on every stated learning objective ( N = 159 survey responses, \Delta \geq +1.49 Likert points, p 10^-27 on every comparison). We release the code under AGPLv3.

[HC-25] Why Someone Asked “Why”: Foil Inference in Human and LLM Question Interpretation

【速读】:该论文试图解决的问题是:在解释性问答中,人们如何选择一个隐含的对比情境(即“foils”,指未发生但可能发生的替代事件)来理解“为什么发生了E而不是E’”这一类问题。传统观点认为,人们通过比较当前事件与潜在替代方案之间的差异来形成解释,但这些对比往往未被明确表达,需从语境中推断。研究的关键发现在于,人们选择意图中的foils主要依据其“事后预期”(hindsight expectation),即在结果发生后,判断什么情况本可以替代当前结果——这表明人类对foils的推理机制依赖于对提问者认知突兀性的理解。相比之下,大型语言模型(LLMs)虽能做出显式的预期判断,但其预测foils的能力与其预期判断之间缺乏一致关联,提示当前生成式AI在处理对比性解释时仍存在显著局限。

链接: https://arxiv.org/abs/2605.05401
作者: Britt Besch,Tobias Gerstenberg
机构: Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at Proceedings of the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

点击查看摘要

Abstract:Explanations are inherently contrastive: E happened rather than E’ because of C rather than C’. However, these contrasts, or “foils”, are rarely mentioned explicitly but have to be inferred in context. Here, we investigate how people select the intended foil E’ of a why-question. Participants read vignettes and judged, for each foil, their prior expectation (what will happen next), closeness (what is most similar to what happened), and hindsight expectation (what could have happened instead), as well as which foil they thought the question asker had in mind when they asked the why-question. We found that foil selections were best predicted by hindsight expectation judgments. This suggests that people infer the foil by considering what a question asker finds surprising after the outcome occurred. Since correct foil selection is relevant not only in human-human interaction but also increasingly in dialogues with large language models, we investigated their performance on the same task. The coupling between LLMs’ explicit expectation judgments and their foil selections is inconsistent.

[HC-26] Mise en Place for Agent ic Coding: Deliberate Preparation as Context Engineering Methodology

【速读】:该论文试图解决当前AI编码代理(AI coding agents)在实践中因缺乏充分上下文而引发的系统性对齐问题(systematic alignment problem),即代理生成的代码往往需要大量调试与重构,导致开发效率低下。其解决方案的关键在于引入一种受烹饪术语“mise en place”(MEP)启发的三阶段准备方法论:(1) 上下文锚定(contextual grounding),将领域知识和隐性知识结构化为文档;(2) 协作规范(collaborative specification),通过人机对话生成详细设计产物;(3) 任务分解(task decomposition),将规范转化为具有依赖关系的结构化任务记录。该方法显著提升了AI代理的执行效率与准确性,验证了“上下文流利度”(context fluency)作为新兴开发者技能的重要性,并为未来AI辅助软件开发中的准备阶段提供可实证研究的方向。

链接: https://arxiv.org/abs/2605.05400
作者: Andrew Zigler
机构: LinearB(线性B)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages. Accepted at VibeX 2026, the 1st International Workshop on Vibe Coding and Vibe Researching, co-located with EASE 2026, Glasgow, June 9-12 2026. Camera-ready version. Research artifact: this https URL

点击查看摘要

Abstract:The rapid adoption of AI coding agents has produced a dominant workflow pattern – often called “vibe coding” – that prioritizes speed of implementation over deliberate preparation. We argue that this approach creates a systematic alignment problem: agents that lack sufficient context produce code requiring extensive debugging and refactoring, consuming substantial development time. Drawing on the culinary concept of mise en place (everything in its place; abbreviated MEP), we propose a three-phase preparation methodology for agentic coding: (1) contextual grounding, where domain expertise and tacit knowledge are externalized into structured documents; (2) collaborative specification, where human-agent dialogue produces detailed design artifacts; and (3) task decomposition, where specifications are converted into structured, dependency-aware task records. We report on the application of MEP during a competitive hackathon, where roughly two hours of preparation enabled a rapid parallel implementation of a full-stack educational platform by concurrent AI agents. We introduce the concept of context fluency as an emerging developer skill – the ability to create rich, structured context that agents can act on – and connect it to established frameworks in backward design and tacit knowledge externalization. We conclude with a research agenda for empirically validating preparation-phase methodologies in AI-assisted software development.

[HC-27] Every(bot) Makes Mistakes: Coding Big Five Personalities Context and Tone into an LLM Chatbot Recovery Code Framework

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在人机交互中出现错误时,因缺乏系统性恢复机制而导致的交互中断、用户信任下降和参与度降低的问题。现有研究虽分别探讨了LLM的恢复策略、语气(tone)、上下文(context)与人格特质(personality)等维度,但未将其整合为结构化的指导框架。解决方案的关键在于提出一种“恢复编码”(recovery code),该编码将四种常见LLM聊天任务场景映射至对应的五大人格特质(Big Five personality traits:尽责性、宜人性、开放性、外向性)、语气及三阶段恢复指令,并通过训练使LLM模型学习并应用该编码。实验表明,采用该编码的模型在恢复质量、语气一致性与适切性三个维度上显著优于基线模型(平均提升27.8%),尤其在情境适配性和解释提供能力方面表现突出,验证了结构化人格-语境-语气协同引导对提升LLM错误恢复效能的有效性。

链接: https://arxiv.org/abs/2605.05391
作者: Rachel Hill,Tom Owen,Julian Hough
机构: Swansea University (斯旺西大学)
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages of main content, 3 figures, 4 tables, 9 appendices. This paper has been submitted to the Becker Friedman Institute 2026 AI in Social Sciences conference for peer review

点击查看摘要

Abstract:Despite careful design involving classifiers, parameters, and safeguarding, errors during human/AI interaction are not rare. Poor error recovery can disrupt interaction flow, damage user trust, and decrease user engagement. Whilst existing work has explored LLM recovery, tone, context, and personality as separate design dimensions, no existing work has combined these variables into a structured guidance framework. This paper presents a recovery code that maps four common LLM chatbot task contexts to associated personality traits (four Big Five personalities: Conscientiousness, Agreeableness, Openness, and Extraversion), tones, and three-stage recovery instructions. A recovery evaluation rubric was also designed, comprising three dimensions (Recovery quality, Tone alignment, and Appropriateness) and nine sub-dimensions. The methodology is exploratory, with no participants used. A between-subjects design was employed across two conditions: Condition A (baseline, uncoded), four separate Claude Sonnet 4.6 agents received no recovery code training; Condition B (coded), four separate Claude Sonnet 4.6 models were trained on the recovery code. Identical ‘user’ prompts and error scenarios were used across both conditions. Eight LLM evaluator agents assessed the recovery responses using the evaluation rubric, producing scores out of 5 for each sub-dimension. Results found a 27.8% average performance increase in coded recovery responses (76.7%) compared to baseline responses (48.9%). Condition B performed strongest in the appropriateness dimension (83.3%), with notable improvement in personality appropriateness (75% versus 50%) and providing explanation (60% versus 20%). These findings suggest that structured personality, context, and tone-informed recovery codes can be successfully learnt and applied by LLM chatbots to improve error recovery quality across varying contextual tasks.

[HC-28] Making AI Drafts Count: A Quality Threshold in Audio Description Workflows

【速读】:该论文旨在解决盲人及低视力群体在视频内容获取中因缺乏视觉信息描述而面临的障碍,具体聚焦于如何通过人工智能(AI)辅助生成音频描述(Audio Description, AD),提升描述质量并降低人工编写的门槛。其核心解决方案在于设计了一个名为GenAD的生成管道和RefineAD的编辑界面,其中GenAD结合可访问性指南与视频上下文信息以生成高质量初稿,而RefineAD则支持人类对AI初稿进行高效修订。研究发现,AI初稿的质量直接影响编辑效率与认知负荷:只有当AI输出达到一定质量阈值时,才能显著缩短完成时间并减轻负担;这一阈值随视觉复杂度升高而提高,因此提出“有效AI辅助应针对目标内容设定合适质量阈值”的设计原则,而非仅提供存在即可的辅助。

链接: https://arxiv.org/abs/2605.05348
作者: Lana Do,Shasta Ihorn,Charity M. Pitcher-Cooper,Sanjay Mirani,Gio Jung,Hyunjoo Shim,Zhenzhen Qin,Kien T. Nguyen,Vassilis Athitsos,Ilmi Yoon
机构: Northeastern University (东北大学); San Francisco State University (旧金山州立大学); The Smith-Kettlewell Eye Research Institute (史密斯-凯特尔韦尔眼科研究所); University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Audio description (AD) narrates visual elements in video for blind and low-vision audiences. Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality AD and lowers the barrier to entry. What remains an open question is how draft quality shapes the editing process. We investigate this through GenAD, an AD generation pipeline that incorporates accessibility guidelines and contextual video information, and RefineAD, an editing interface for human revisions. Human-AI contributions are measured across text, timing, and delivery. In a within-subjects study, we compared authoring from scratch against editing AI drafts of varying quality. GenAD drafts cut completion time by more than half and significantly reduced cognitive load. In contrast, baseline drafts generated from simple, unguided prompts offered only modest benefits, pointing to a minimum quality threshold for effectiveness. Qualitative findings suggest this threshold is content-dependent; as visual complexity increases, so does the quality needed from AI drafts. We propose this as a design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.

[HC-29] MPNet: A Robust and Efficient Manifold Pooling Network for Multi-Rhythm EEG Signal Decoding

【速读】:该论文旨在解决深度黎曼网络(Deep Riemannian networks)在脑电图(EEG)解码中因高维黎曼输入和复杂时频动态建模而导致的计算成本过高及实际应用受限的问题。其解决方案的关键在于提出一种新颖的流形池化网络(Manifold Pooling Network, MPNet),该网络通过节奏自适应卷积前端提取多视角的时频特征并生成多视图黎曼节点,进而引入一种创新的流形节点池化层(manifold node pooling layer),将多个黎曼节点聚合为固定尺寸的融合节点,从而显著降低后续深度黎曼网络的计算负担,同时保持高精度与小样本下的鲁棒性。

链接: https://arxiv.org/abs/2605.05212
作者: Guoqing Cai,Kai Zeng,Shoulin Huang,Ting Ma
机构: 未知
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep Riemannian networks provide a powerful framework for Electroencephalography (EEG) decoding, but their practical applications are severely constrained. Accurately decoding EEG signals requires modeling complex temporal dynamics across multiple rhythms, which results in high-dimensional Riemannian inputs and significant computational costs. To address this, we propose the Manifold Pooling Network (MPNet). MPNet uses a rhythm-adaptive convolutional frontend to extract comprehensive time-frequency representations and generate multi-view Riemannian nodes. A novel manifold node pooling layer is then proposed to aggregate these nodes into a single fusion node with a fixed size, enabling the following deep Riemannian network to process it with greatly reduced costs. Experiments on two public EEG datasets show that MPNet achieves state-of-the-art accuracy, runs up to 10 times faster than the comparable Riemannian model, and maintains robust performance under limited-data conditions. These findings highlight MPNet’s practicality and efficiency for real-world EEG applications.

计算机视觉

[CV-0] ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation SIGGRAPH2026

【速读】:该论文旨在解决视频生成中对角色动作(performance)与摄像机运动(cinematography)的细粒度联合控制问题,尤其在跨场景迁移时保持几何一致性与视觉质量。其核心挑战在于如何在不重新训练模型的前提下,实现角色姿态与相机内参(intrinsic)和外参(extrinsic)参数的协同控制。解决方案的关键在于提出ActCam方法:利用预训练的图像到视频扩散模型,通过两阶段条件调度策略——首先在早期去噪步骤中同时使用姿态和稀疏深度信息以维持场景结构,随后仅保留姿态引导以精细刻画高频细节而不过度约束生成过程;该设计实现了零样本(zero-shot)条件下角色动作迁移与逐帧相机参数控制的统一,显著提升视角变化下的摄像机跟随精度与动作保真度。

链接: https://arxiv.org/abs/2605.06667
作者: Omar El Khalifi,Thomas Rossi,Oscar Fossey,Thibault Fouque,Ulysse Mizrahi,Philip Torr,Ivan Laptev,Fabio Pizzati,Baptiste Bellot-Gurlet
机构: KinetixFrance; University of OxfordUnited Kingdom; MBZUAIUnited Arab Emirates
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: SIGGRAPH 2026

点击查看摘要

Abstract:For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor’s motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: this https URL.

[CV-1] BAMI: Training-Free Bias Mitigation in GUI Grounding CVPR2026

【速读】:该论文旨在解决复杂图形用户界面(GUI)场景中GUI grounding模型性能不佳的问题,尤其是在ScreenSpot-Pro基准测试中,现有模型常因高图像分辨率导致的精度偏差(precision bias)和复杂界面元素引发的模糊偏差(ambiguity bias)而表现欠佳。解决方案的关键在于提出Bias-Aware Manipulation Inference (BAMI)框架,其核心包含两个关键操作:粗粒度到细粒度的关注机制(coarse-to-fine focus)和候选选择策略(candidate selection),从而在无需额外训练的情况下有效缓解上述两类偏差,显著提升多种GUI grounding模型的准确性。

链接: https://arxiv.org/abs/2605.06664
作者: Borui Zhang,Bo Zhang,Bo Wang,Wenzhao Zheng,Yuhao Cheng,Liang Tang,Yiqiang Yan,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbfMasked Prediction Distribution (MPD) attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbfBias-Aware Manipulation Inference (BAMI), which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9% to 57.8%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at this https URL.

[CV-2] Relit-LiVE: Relight Video by Jointly Learning Environment Video SIGGRAPH2026

【速读】:该论文旨在解决现有视频重光照(video relighting)方法中因依赖不准确的场景内在分解(intrinsic decomposition)而导致的物理不一致性和时间不稳定问题,如外观失真、材质断裂及累积的时间伪影。其解决方案的关键在于:显式引入原始参考图像作为渲染过程中的关键信息源,以恢复内在表示中丢失或损坏的场景线索;同时提出一种新颖的环境视频预测形式,通过单一扩散过程联合生成重光照视频与每帧对齐相机视角的环境图(environment maps),从而强制几何-光照一致性并自然支持动态光照和相机运动,显著提升物理真实性并降低对已知逐帧相机位姿的依赖。

链接: https://arxiv.org/abs/2605.06658
作者: Weiqing Xiao,Hong Li,Xiuyu Yang,Houyuan Chen,Wenyi Li,Tianqi Liu,Shaocong Xu,Chongjie Ye,Hao Zhao,Beibei Wang
机构: Nanjing University (南京大学); BAAI (北京人工智能研究院); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); University of Chinese Academy of Sciences (中国科学院大学); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SIGGRAPH 2026. Project site: this https URL

点击查看摘要

Abstract:Recent advances have shown that large-scale video diffusion models can be repurposed as neural renderers by first decomposing videos into intrinsic scene representations and then performing forward rendering under novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novel video relighting framework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novel environment video prediction formulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a single diffusion process. This joint prediction enforces strong geometric-illumination alignment and naturally supports dynamic lighting and camera motion, significantly improving physical consistency in video relighting while easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-art video relighting and neural rendering methods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streaming video relighting. The Project is available at this https URL.

[CV-3] Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

【速读】:该论文旨在解决多模态领域泛化(Multimodal Domain Generalization, MMDG)研究中因评估协议不一致导致的性能提升真实性存疑问题。当前研究存在数据集、模态配置和实验设置碎片化,且现有基准主要聚焦于动作识别任务,忽视了输入噪声、模态缺失和模型可信度等现实挑战,从而阻碍了对领域进展的可靠评估。解决方案的关键在于提出首个统一且全面的MMDG基准——MMDG-Bench,其标准化了跨六大数据集(涵盖动作识别、机械故障诊断和情感分析三类任务)、六种模态组合及九种代表性方法的评估流程,并引入多种评价维度(包括标准准确率、扰动鲁棒性、缺失模态泛化能力、误分类检测与分布外检测)。通过在95个跨域任务上训练总计7,402个神经网络,该基准揭示了MMDG领域仍存在显著性能差距、无最优方法、模态融合并非必然有效以及现有方法在实际场景下可信度下降等关键问题,为未来研究提供了可复现、可比较的标准化评估框架。

链接: https://arxiv.org/abs/2605.06643
作者: Hao Dong,Hongzhao Li,Shupan Li,Muhammad Haris Khan,Eleni Chatzi,Olga Fink
机构: ETH Zürich (苏黎世联邦理工学院); Zhengzhou University (郑州大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Code: this https URL

点击查看摘要

Abstract:Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field’s advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.

[CV-4] GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

【速读】:该论文旨在解决陶瓷釉料开发过程中因化学复杂性导致的高成本、长周期的试错问题,从而减轻独立艺术家的设计负担。其解决方案的关键在于提出GlazyBench——首个用于AI辅助釉料设计的大规模数据集,包含23,148条真实釉料配方,支持两大核心任务:一是基于原始材料预测烧成后的表面特性(如颜色和透明度),二是根据这些属性生成准确的釉料视觉图像。该数据集为传统机器学习、大语言模型以及深度生成模型和大 multimodal 模型在釉料属性预测与图像生成方面的性能评估提供了标准化基准,推动了生成式 AI (Generative AI) 在材料设计领域的系统化研究。

链接: https://arxiv.org/abs/2605.06641
作者: Ziyu Zhai,Siyou Li,Juexi Shao,Juntao Yu
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.

[CV-5] DPM: Dynamic Masked Metric Learning for Occluded Person Re-identification

【速读】:该论文旨在解决遮挡(occlusion)导致的行人再识别(person re-identification)性能下降问题,核心挑战在于遮挡样本与完整身份表征之间的不匹配:严重遮挡会移除判别性身体特征,并引入背景杂波和遮挡物干扰,使全局度量学习不可靠。解决方案的关键在于提出一种动态掩码度量学习框架(DPM++),其通过输入自适应的掩码机制动态选择每个遮挡实例中的可靠身份子空间,从而在匹配时强调可见性一致的证据并抑制不可靠成分;同时,基于CLIP的两阶段监督机制将ID级语义先验从文本分支迁移至分类器原型空间,实现动态掩码匹配,并结合显著性引导的patch转移策略合成可控且逼真的遮挡样本,增强模型对真实遮挡模式的鲁棒性。

链接: https://arxiv.org/abs/2605.06637
作者: Lei Tan,Yingshi Luan,Pincong Zou,Pingyang Dai,Liujuan Cao
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.

[CV-6] SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

【速读】:该论文旨在解决传统稀疏自编码器(Sparse Autoencoders, SAEs)在处理具有不同复杂度的输入时,因固定稀疏性水平(如TopK SAEs中固定的K值)而导致的表达不匹配问题。具体而言,固定K值会忽略自然数据流形上局部内在维度的变化——简单样本可能过度激活冗余特征引入噪声,而复杂样本则因特征数量不足而丢失关键结构。解决方案的关键在于提出SoftSAE,其核心创新是引入可微分的Soft Top-K选择机制,使模型能够根据每个输入的复杂度动态学习对应的稀疏性水平k,从而实现输入依赖的特征激活数量调整,提升表示与数据结构的一致性,并使解释长度反映输入信息量。

链接: https://arxiv.org/abs/2605.06610
作者: Jakub Stępień,Marcin Mazur,Jacek Tabor,Przemysław Spurek
机构: Jagiellonian University (雅盖隆大学); IDEAS Research Institute (IDEAS 研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: this https URL.

[CV-7] DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

【速读】:该论文旨在解决对比语言-图像预训练(CLIP)框架中的两个结构性缺陷:一是对称的InfoNCE损失忽略了批次内未匹配样本对之间的相对排序信息;二是全局池化将视觉表示压缩为语义瓶颈,导致对细粒度局部结构不敏感。解决方案的关键在于提出DINORANKCLIP,其核心创新包括:1)通过冻结的DINOv3教师模型指导一个轻量级双分支学生网络,并结合多尺度融合模块(含通道-空间注意力、自注意力精修器和冲突感知门控机制),有效保留跨模态对齐至一阶精度;2)引入高阶Plackett-Luce排序模型,其中每个位置效用由注意力参数化的成对与三元组转移项增强,该家族包含CLIP(零阶)和RANKCLIP(一阶)作为特例,且在所有基准测试中最优阶数为$ R^*=3 $。实验证明,该方法在匹配计算资源下持续优于CLIP、CyCLIP、ALIP和RANKCLIP,尤其在细粒度和分布外评估中表现显著提升,凸显其对局部结构推理能力的强化。

链接: https://arxiv.org/abs/2605.06592
作者: Shuyang Jiang,Nan Yu,Yiming Zhang,Zenghui Ding,Zhenyu Wu
机构: University of California, Los Angeles (加州大学洛杉矶分校); Aimaikj; HFIPS, Chinese Academy of Sciences (中国科学院自动化研究所); University of Science and Technology of China (中国科学技术大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 7 figures, 9 tables. Code will be made publicly available upon acceptance

点击查看摘要

Abstract:Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is R^*=3 . The full empirical study – order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation – fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.

[CV-8] Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation CVPR2026

【速读】:该论文旨在解决相机几何估计中最小问题(minimal problems)求解的计算挑战,这些问题通常表现为多变量多项式方程组,传统基于Gröbner基或结式(resultant)的方法在在线求解时需进行矩阵求逆,导致数值不稳定且效率低下。其解决方案的关键在于提出一种无需矩阵求逆的采样法(sampling-based method),利用稀疏隐藏变量结式(sparse hidden-variable resultant)构建求解器:通过逆快速傅里叶变换插值(inverse fast Fourier transform interpolation)高效重构隐藏变量的行列式多项式,避免符号展开;随后通过识别秩-1不足子矩阵并应用克莱姆法则(Cramer’s rule)恢复其余未知量,并引入最大公因式(GCD)判据提升噪声下的子矩阵识别鲁棒性。该方法在小规模问题上表现出良好的数值稳定性和竞争性运行速度,为传统求解器提供了一种实用替代方案。

链接: https://arxiv.org/abs/2605.06572
作者: Haidong Wu,Snehal Bhayani,Janne Heikkilä
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer’s rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gröbner-basis and resultant-based solvers.

[CV-9] MedHorizon: Towards Long-context Medical Video Understanding in the Wild

【速读】:该论文旨在解决当前医疗多模态大语言模型(Medical Multimodal Large Language Models, MMLLMs)在真实临床场景中对完整手术流程视频理解能力不足的问题。现有基准通常假设关键证据已被局部化为图像、短片段或预分割视频,忽略了“检索前推理”(retrieval-before-reasoning)这一核心挑战,导致模型难以应对高冗余、低密度证据的长视频流。解决方案的关键在于提出MedHorizon——一个基于真实世界临床操作视频的长上下文医学视频理解基准,包含759小时完整手术视频和1,253个基于证据的多选题,其中平均仅0.166%帧为有效证据帧,迫使模型在噪声背景中精准检索并整合稀疏证据进行多跳临床推理。实验表明,即使使用最先进的模型,准确率也仅为41.1%,揭示了当前系统在程序性推理与注意力漂移方面的根本局限,为未来研究提供了严谨的评估平台。

链接: https://arxiv.org/abs/2605.06537
作者: Bodong Du,Bowen Liu,Yang Yu,Xinpeng Ding,Zhiheng Wu,Shuning Wang,Shuo Nie,Naiming Liu,Qifeng Chen,Yangqiu Song,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.

[CV-10] Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

【速读】:该论文旨在解决背景替换(Background Replacement)任务中因高质量训练数据稀缺而导致的模型性能瓶颈问题。当前主流方法受限于现有数据集(如OpenVE-3M)生成的静态、不自然背景,导致模型难以学习到真实世界中复杂的前景-背景交互关系及时间一致性。其解决方案的关键在于提出了一种可扩展的数据合成流水线,通过解耦方式分别生成高质量的前景与背景引导信号,并引入严格的过滤机制确保数据质量;基于此,作者构建了包含约14万视频对的Sparkle数据集及对应的Sparkle-Bench评估基准,实验表明使用该数据集训练的模型在多个评测指标上显著优于现有基线。

链接: https://arxiv.org/abs/2605.06535
作者: Ziyun Zeng,Yiqi Lin,Guoqiang Liang,Mike Zheng Shou
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Tech Report. Project Page: this https URL

点击查看摘要

Abstract:In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at this https URL.

[CV-11] Agent ic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在开放世界场景中面临的分布外(Out-of-Distribution, OOD)问题。与传统OOD研究不同,FM-OOD具有知识边界、能力上限、组合性变化和任务多样性等结构性特征,且其预训练与后训练分布常仅部分可观测,导致现有以模型为中心(model-centric)的方法难以有效应对。论文的关键解决方案是提出“代理系统”(agentic systems)作为新的范式,通过感知(perception)、策略选择(strategy selection)、外部行动(external action)和闭环验证(closed-loop verification)四大结构特性,突破参数覆盖上限(parameter coverage ceiling),从而显著扩展模型可处理的OOD输入范围。作者强调,代理方法并非取代模型中心方法,而是二者互补,需将代理范式明确纳入基础模型OOD研究的第一优先级方向。

链接: https://arxiv.org/abs/2605.06522
作者: Xin Wang,Haibo Chen,Wenxuan Liu,Wenwu Zhu
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face – knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation – differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our position is that OOD for foundation models is a structurally distinct problem that cannot be solved within the prevailing model-centric paradigm, and that agentic systems constitute the missing paradigm required to address it. We defend this claim through four steps. First, we give a stage-aware formalization of OOD that accommodates partially observed multi-stage training distributions. Second, we prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance \varepsilon , for reasons intrinsic to parameter-based representation. Third, we characterize agentic OOD systems by four structural properties – perception, strategy selection, external action, and closed-loop verification – and show that they strictly extend the reachable set beyond the ceiling. Fourth, we respond to seven counterarguments, conceding two, and outline a research agenda. We do not claim that agentic methods subsume model-centric ones; we argue that the two are complementary, and that progress on FM-OOD requires explicit recognition of the agentic paradigm as a first-class research direction.

[CV-12] DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

【速读】:该论文旨在解决扩散模型在生成罕见但合理组合场景时存在的“默认完成偏差”(default completion bias)问题,即当输入提示包含训练数据中稀有但合法的组合(如雪地海滩或夜间彩虹)时,模型倾向于退化为更常见的替代方案。其解决方案的关键在于提出一种无需训练的框架——默认完成排斥机制(Default Completion Repulsion, DCR),通过构建一个反事实吸引子(counterfactual attractor)来模拟模型偏好的常见完成路径,并定义目标轨迹与吸引子轨迹之间的差异为反事实漂移(counterfactual drift),进而采用基于投影的排斥机制剔除与该漂移方向对齐的引导成分,从而抑制不期望的高频完成行为,同时保留其他语义完整性。此方法直接嵌入标准扩散采样流程,无需重训练或结构修改,有效提升罕见组合的生成保真度并揭示模型内在偏差。

链接: https://arxiv.org/abs/2605.06512
作者: Taewon Kang,Matthias Zwicker
机构: University of Maryland at College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 33 figures

点击查看摘要

Abstract:Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model’s preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.

[CV-13] FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

【速读】:该论文旨在解决视频扩散模型在长视频生成中面临的三大挑战:内容漂移(content drift)、时间不一致性(temporal inconsistency)以及动态过平滑(over-smoothed dynamics)。现有方法通常通过全局分支与局部分支结合来提升时间一致性,但其在各分支内依赖预设规则对视觉一致性与时间动态进行分解,当外观变化与动作进展紧密耦合时(如相机运动或序列动作),这种划分方式不可靠。论文提出关键解决方案——FreeSpec,一种无需训练的频谱重建框架,其核心在于从奇异值分解(Singular Value Decomposition, SVD)角度分析视频扩展问题,发现扩大自注意力窗口会导致谱集中现象:低秩奇异方向主导能量分布,从而保留粗结构但抑制高秩空间细节和富含运动的时间变化。FreeSpec利用全局分支作为低秩谱引导、局部分支作为高秩重构基底,在频谱层面实现融合,避免了传统方法中刚性的特征分割策略,从而在保持长期一致性的同时更有效地保留空间细节和时间动态。

链接: https://arxiv.org/abs/2605.06509
作者: Fangda Chen,Shanshan Zhao,Longrong Yang,Chuanfu Xu,Zhigang Luo,Long Lan
机构: National University of Defense Technology (国防科技大学); Alibaba International Digital Commerce (阿里巴巴国际数字商业集团); Zhejiang University (浙江大学); Xiangjiang Laboratory (湘江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: this https URL.

[CV-14] MARBLE: Multi-Aspect Reward Balance for Diffusion RL

【速读】:该论文旨在解决扩散模型在强化学习微调过程中,多维度奖励(multi-aspect rewards)难以协同优化的问题。现有方法如加权求和奖励或分阶段训练,要么无法联合训练统一模型,要么依赖大量人工调参的顺序策略,且因样本层面的奖励不匹配导致监督信号稀释。其关键解决方案是提出MARBLE(Multi-Aspect Reward BaLancE),一种基于梯度空间优化的框架:通过为每个奖励维护独立的优势估计量(advantage estimator),计算各奖励对应的策略梯度,并通过求解二次规划(Quadratic Programming)问题自动平衡这些梯度方向,从而无需手动设定权重即可实现多奖励的联合优化;同时引入近似化公式与指数移动平均(EMA)平滑机制,在保持训练效率接近单奖励基准(0.97X速度)的同时稳定更新过程。

链接: https://arxiv.org/abs/2605.06507
作者: Canyu Zhao,Hao Chen,Yunze Tong,Yu Qiao,Jiacheng Li,Chunhua Shen
机构: Zhejiang University (浙江大学); HiThink; Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Homepage and code repo: this https URL

点击查看摘要

Abstract:Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward R(x)=\sum_k w_k R_k(x) , or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward’s gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.

[CV-15] 3D MRI Image Pretraining via Controllable 2D Slice Navigation Task

【速读】:该论文旨在解决当前自监督预训练方法在学习磁共振成像(MRI)表示时,主要将每张扫描图像视为静态的切片、补丁或体积聚合物,从而限制了对空间结构和解剖关系建模能力的问题。其解决方案的关键在于提出一种基于可控2D渲染序列的新颖自监督信号:通过连续改变视角、方向和尺度将3D MRI体积转化为可控制的视频动作序列,进而设计一种动作条件下的预训练目标,其中编码器(tokenizer)处理切片观测,潜动力学模型预测潜在特征演化。这种机制利用了MRI数据中固有的空间变换特性,显著提升了下游解剖与空间任务的表征学习效果。

链接: https://arxiv.org/abs/2605.06487
作者: Yu Wang,Qingchao Chen
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientations, and scales, a 3D volume can be converted into dense video-action sequences whose controls are the action trajectories. We study this formulation with an action-conditioned pretraining objective, where a tokenizer encodes slice observations and a latent dynamics model predicts the evolution of latent features. Across representative anatomical and spatial downstream tasks, the proposed pretraining is evaluated against standard static-volume baselines, tokenizer-only pretraining, and dynamics variants without aligned actions. These results suggest that controllable MRI slice navigation provides a useful complementary pretraining interface for learning anatomical and spatial representations from large unlabeled MRI collections.

[CV-16] GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多领域或任务知识累积过程中出现的灾难性遗忘(catastrophic forgetting)问题。其解决方案的关键在于提出GeoStack(Geometric Stacking)框架,通过在适配器流形(adapter manifold)上施加几何与结构约束,确保基础模型的知识不被破坏;同时,理论证明了权重折叠(weight-folding)特性,使得推理复杂度恒定(O(1)),与集成专家数量无关,从而实现高效、可扩展的知识组合机制。

链接: https://arxiv.org/abs/2605.06477
作者: Pranav Mantini,Shishir K. Shah
机构: University of Houston (休斯顿大学); The University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ( O(1) ), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at this https URL.

[CV-17] Hyperbolic Concept Bottleneck Models

【速读】:该论文旨在解决现有概念瓶颈模型(Concept Bottleneck Models, CBMs)在表示概念时存在的结构性失配问题:当前CBMs将概念嵌入到平坦的欧几里得空间中,假设概念之间相互独立且正交,这与真实世界中概念具有语义层次结构(semantic hierarchies)的事实不符。为克服这一局限,作者提出超球面概念瓶颈模型(Hyperbolic Concept Bottleneck Models, HypCBM),其核心创新在于将概念激活重新定义为双曲空间中的非对称几何包含关系(asymmetric geometric containment),从而自然地建模概念间的蕴含关系(entailment)。关键在于,该方法无需额外监督或学习模块,即可通过概念蕴含锥体(entailment cone)内的包含裕度(margin of inclusion)生成稀疏且层次感知的激活信号,并引入自适应缩放定律以实现层次忠实的干预传播,显著提升了模型在数据稀缺场景下的可解释性、层次一致性及鲁棒性。

链接: https://arxiv.org/abs/2605.06440
作者: Daniel Uyterlinde,Swasti Shreya Mishra,Pascal Mettes
机构: Informatics Institute, University of Amsterdam, The Netherlands (阿姆斯特丹大学信息学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 14 figures

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept’s entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20 \times more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.

[CV-18] From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation Incident Response and Accountability in Automated Vehicles

【速读】:该论文旨在解决自动驾驶车辆中驾驶员监控系统(Driver Monitoring Systems, DMS)在部署过程中面临的伦理与法律挑战,尤其是隐私保护、数据所有权、算法公平性及驾驶员情绪福祉等关键问题。现有法规如GDPR、欧盟人工智能法案(EU AI Act)和IEEE标准虽提供了宏观指导,但缺乏针对舱内多模态感知技术特有风险的具体规范。解决方案的关键在于提出一个模块化的伦理设计框架,将高层次原则转化为可操作的设计与部署指南,包括用户可配置的同意机制、公平导向的模型开发流程、透明度与可解释性工具,以及保障驾驶员心理健康的防护措施,并辅以针对性的风险分析与故障缓解策略,从而推动下一代自动驾驶车辆中透明、可信且以人为本的DMS系统发展。

链接: https://arxiv.org/abs/2605.06439
作者: Bilal Khana,Waseem Shariff,Rory Coyne,Muhammad Ali Farooq,Peter Corcoran
机构: University of Galway (戈尔韦大学); Royal College of Surgeons in Ireland (爱尔兰外科医学院)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to assess driver attention, cognitive state, and readiness to take control. While technologically promising, their deployment introduces a complex set of ethical and legal challenges - ranging from privacy and consent to data ownership and algorithmic fairness. While overarching frameworks such as the GDPR, EU AI Act, and IEEE standards offer important guidance, they lack the specificity required for addressing the unique risks posed by in-cabin sensing technologies. This paper adopts a review-to-design perspective, critically examining existing regulatory instruments and ethical frameworks – such as the GDPR, the EU AI Act, and IEEE guidelines – and identifying gaps in their applicability to the distinctive risks posed by multimodal, AI-enabled in-cabin monitoring. Building on this review, we propose a modular ethical design framework tailored specifically to Driver Monitoring Systems. The framework translates high-level principles into actionable design and deployment guidance, including user-configurable consent mechanisms, fairness-aware model development, transparency and explainability tools, and safeguards for driver emotional well-being. Finally, the paper outlines a risk analysis and failure mitigation strategy, emphasizing proactive incident response and accountability mechanisms tailored to the DMS context. Together, these contributions aim to inform the development of transparent, trustworthy, and human-centered driver monitoring systems for next-generation autonomous vehicles. Subjects: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET) Cite as: arXiv:2605.06439 [cs.CY] (or arXiv:2605.06439v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.06439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-19] FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

【速读】:该论文旨在解决像素空间生成模型中普遍存在的频率同质化问题,即现有方法将图像生成视为频率均匀的过程,忽略了低频与高频成分在生成过程中的差异性角色和学习动态。其解决方案的关键在于提出FREPix框架,该框架通过显式分解生成过程为低频和高频分量,为其分配独立的传输路径,采用因子化网络分别预测,并基于频率感知的目标函数进行训练,从而将粗到细的生成机制从隐含行为转变为明确的设计原则。

链接: https://arxiv.org/abs/2605.06421
作者: Mingfeng Lin,Jiakun Chen,Liang Han,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at 256\times256 and 2.38 FID at 512\times512 , with particularly strong behavior in the low-NFE regime.

[CV-20] Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

【速读】:该论文旨在解决机器人控制中世界模型(World Model)训练时 latent space 选择的关键问题,即如何在动作条件视频扩散模型(action-conditioned video diffusion models)中选取更有效的潜在空间以提升策略评估性能。传统方法多采用以像素重建为目标的自编码器(如 VAE)作为 latent space,但本文通过系统性对比六种重建与语义编码器,在 BridgeV2 数据集上固定训练协议下验证了不同 latent space 对世界模型训练效果的影响。研究发现,仅依赖视觉保真度(visual fidelity)不足以判断世界模型优劣;相比之下,语义对齐的预训练编码器(如 V-JEPA 2.1、Web-DINO 和 SigLIP 2)在规划能力、下游策略性能及潜在表示质量三个维度均表现更优,尤其在所有模型规模下均展现出更强的策略相关性。因此,论文提出以语义对齐的 latent space 作为构建机器人扩散世界模型的基础,是提升其政策相关性的关键解决方案。

链接: https://arxiv.org/abs/2605.06388
作者: Nilaksh,Saurav Jha,Artem Zholus,Sarath Chandar
机构: Chandar Research Lab; Mila – Quebec AI Institute; Polytechnique Montréal; Canada CIFAR AI Chair
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 9 pages

点击查看摘要

Abstract:World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.

[CV-21] Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

【速读】:该论文旨在解决深度神经网络中决策区域(decision regions)的拓扑性质问题,特别是验证其是否不仅路径连通(path connected),而且是单连通(simply connected)。此前研究已表明决策区域具有路径连通性,但未能回答更深层次的拓扑结构问题:即区域内任意闭合环路是否可以连续收缩至一点而不离开该区域。为此,作者提出一种迭代四边形网格填充(iterative quad-mesh filling)方法,用于构造一个由给定环路边界限定、完全位于同一决策区域内的标签保持表面(label-preserving surface),从而在有限分辨率下实现对环路收缩过程的建模与验证。该方案的关键在于将几何插值与Coons补片(Coons patches)相联系,量化构造表面与理想几何插值之间的偏差,从而提供可计算、可扩展的实证工具来检验决策区域的单连通性假设。

链接: https://arxiv.org/abs/2605.06380
作者: Arjhun Swaminathan,Mete Akgün
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops inside a decision region can be contracted without leaving that region. To this end, we propose an iterative quad-mesh filling procedure that constructs a finite-resolution label-preserving surface bounded by a given loop and lying entirely within the same decision region. We further connect this construction to natural Coons patches in order to quantify its deviation from a canonical geometric interpolation of the loop. By evaluating our method across several modern image-classification models, we provide empirical evidence supporting the hypothesis that decision regions in deep neural networks are not only path connected, but also simply connected.

[CV-22] Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

【速读】:该论文旨在解决扩散模型加速中因离散时间分布匹配(Distribution Matching Distillation, DMD)导致的视觉伪影和过度平滑问题,这些问题通常源于固定离散时间点的稀疏监督以及反向KL散度的模式寻找特性。解决方案的关键在于提出连续时间分布匹配(Continuous-Time Distribution Matching, CDM),其核心创新包括两个方面:一是将固定的离散调度替换为随机长度的动态连续调度,使分布匹配在采样轨迹任意点进行而非仅限于少数预设锚点;二是设计一种连续时间对齐目标,在学生模型速度场外推的潜在空间上执行主动的非轨迹匹配,从而提升泛化能力并保留精细视觉细节。

链接: https://arxiv.org/abs/2605.06376
作者: Tao Liu,Hao Yan,Mengting Chen,Taihang Hu,Zhengrong Yue,Zihao Pan,Jinsong Lan,Xiaoyong Zhu,Ming-Ming Cheng,Bo Zheng,Yaxing Wang
机构: Nankai University (南开大学); Alibaba Group; Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22pages, 9 figures

点击查看摘要

Abstract:Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules – such as GANs or reward models – to restore visual fidelity. In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student’s velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives. Code is available at this https URL.

[CV-23] Xplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts

【速读】:该论文旨在解决现有算法在应对分布偏移(distribution shift)时性能不稳定、难以超越经验风险最小化(Empirical Risk Minimization, ERM)基准,以及无法有效缓解虚假相关性(spurious correlations)的问题。其解决方案的关键在于提出一种可解释的、基于解释的框架eXplaining to Learn (eX2L),通过在训练过程中显式地将混淆变量(confounding features)从分类器的潜在表示中解耦,从而提升模型在不同群体间的鲁棒性和泛化能力。具体而言,eX2L通过惩罚主标签分类器与同时训练的混淆因子分类器生成的Grad-CAM激活图之间的相似性,实现对冗余特征的抑制,最终在Spawrious Many-to-Many Hard Challenge基准上显著优于当前最先进方法,在平均准确率(AA)和最差组准确率(WGA)上分别提升5.49%和10.90%。

链接: https://arxiv.org/abs/2605.06368
作者: Paulo Mario P. Medina,Jose Marie Antonio Miñoza,Sebastian C. Ibañez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a classifier’s latent representations during training. eX2L achieves this by penalizing the similarity between Grad-CAM activation maps generated by a primary label classifier and those from a concurrently trained confounder classifier. On the rigorous Spawrious Many-to-Many Hard Challenge benchmark, eX2L achieves an average accuracy (AA) of 82.24% +/- 3.87% and a worst-group accuracy (WGA) of 66.31% +/- 8.73%, outperforming the current state-of-the-art (SOTA) by 5.49% and 10.90%, respectively. Beyond its competitive performance, eX2L demonstrates that functional domain invariance can be achieved by explicitly decoupling label and nuisance attributes at the group level.

[CV-24] Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations

【速读】:该论文旨在解决迭代随机净化防御(iterative stochastic purification defenses)在白盒对抗攻击下的鲁棒性评估问题,尤其是现有方法因内存限制而依赖近似反向传播,导致攻击信号减弱并可能高估防御的鲁棒性。解决方案的关键在于提出一种基于梯度检查点(gradient checkpointing)的内存高效全梯度评估框架:通过牺牲额外的重新计算开销来显著降低内存占用,从而实现对扩散模型和Langevin采样等净化路径的精确端到端梯度计算;同时引入控制随机性的评估协议,减少不同净化轨迹带来的鲁棒性指标波动,确保评估结果的准确性和可复现性。

链接: https://arxiv.org/abs/2605.06357
作者: Yuan Du,Mitchel Hill,HanQin Cai
机构: University of Central Florida (中佛罗里达大学); InnoPeak Technology (英诺峰科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectories practical by trading additional recomputation for substantially lower memory usage. This enables full-gradient adaptive attacks against diffusion- and Langevin-based purification defenses, where prior evaluations often resort to approximate backpropagation due to memory constraints. These approximations can weaken the attack signal and risk overestimating robustness. In parallel, stochasticity in iterative purification is frequently under-controlled, even though different purification trajectories can substantially change reported robustness metrics. Building on this insight, we introduce a memory-efficient full-gradient evaluation framework for stochastic purification defenses. The framework combines checkpointed backpropagation with evaluation protocols that control stochastic variability, thereby reducing memory bottlenecks while preserving exact gradients. We evaluate diffusion-based purification and Langevin sampling with Energy-Based Models (EBMs), demonstrating that full-gradient attacks uncover vulnerabilities missed by approximate-gradient evaluations. Our framework yields stronger state-of-the-art \ell_\infty and \ell_2 white-box attacks and further supports probing out-of-distribution robustness. Overall, our results show that exact-gradient evaluation is essential for reliable benchmarking of iterative stochastic defenses.

[CV-25] SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation NEURIPS2026

【速读】:该论文旨在解决高分辨率图像到视频(I2V)生成中效率与保真度之间的矛盾问题,尤其是在2K分辨率下,现有端到端模型因内存和延迟过高难以实用,而级联低分辨率生成再超分的方法则易产生细节幻觉且偏离输入图像的局部结构。解决方案的关键在于提出SwiftI2V框架,其核心创新为:首先生成低分辨率运动参考以降低token消耗并减轻建模负担,随后基于该运动信息进行强图像条件约束的2K视频合成,从而在控制计算开销的前提下恢复输入忠实的细节;同时引入条件分段生成(Conditional Segment-wise Generation, CSG)机制,实现分段式视频合成并限定每步token预算,结合段内双向上下文交互提升跨段一致性与输入保真度,最终在VBench-I2V 2K基准上达到与端到端方法相当的性能,但GPU时间减少202倍,并支持单张数据中心GPU(如H800)或消费级GPU(如RTX 4090)部署。

链接: https://arxiv.org/abs/2605.06356
作者: YaoYang Liu,Yuechen Zhang,Wenbo Li,Yufei Zhao,Rui Liu,Long Chen
机构: HKUST (香港科技大学); CUHK (香港中文大学); Joy Future Academy (未来科技学院); HKU (香港大学); HUAWEI Research (华为研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 17 figures. Submitted to NeurIPS 2026

点击查看摘要

Abstract:High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency–fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

[CV-26] Earth-o1: A Grid-free Observation-native Atmospheric World Model

【速读】:该论文旨在解决传统大气动力学建模框架在处理多源异构地球观测数据时存在的结构性瓶颈问题,即受限于预定义空间网格的强制适配,导致原始传感器数据无法被充分挖掘,并引发严重的计算效率问题。解决方案的关键在于提出Earth-o1——一种原生面向观测的大气世界模型(observation-native atmospheric world model),其核心创新是直接从非规则分布的观测数据中学习地球系统的连续三维物理演化过程,无需依赖传统数值求解器或数据同化方法;通过将多种传感器输入整合为统一的无网格动力场,模型可自主实现时空状态的推进,从而支持实时预报与跨传感器推理,且在历史回溯测试中达到与欧洲中期天气预报中心(ECMWF)集成预报系统(Integrated Forecasting System, IFS)相当的表面预报精度,验证了此类连续、观测驱动的世界模型作为新一代全观测原生地球物理模拟器的可行性与高保真度。

链接: https://arxiv.org/abs/2605.06337
作者: Junchao Gong,Kaiyi Xu,Wangxu Wei,Siwei Tu,Jingyi Xu,Zili Liu,Hang Fan,Zhiwang Zhou,Tao Han,Yi Xiao,Xinyu Gu,Zhangrui Li,Wenlong Zhang,Hao Chen,Xiaokang Yang,Yaqiang Wang,Lijing Cheng,Pierre Gentine,Wanli Ouyang,Feng Zhang,Zhe-Min Tan,Bowen Zhou,Fenghua Ling,Ben Fei,Lei Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth-o1, an observation-native atmospheric world model that overcomes these structural limitations. Rather than relying on conventional atmospheric dynamical modeling systems or traditional data assimilation, Earth-o1 directly learns the continuous, three-dimensional physical evolution of the Earth system from ungridded observational data. By integrating diverse sensor inputs into a unified, grid-free dynamical field, the model autonomously advances the atmospheric state in space and time. We show that this fundamentally distinct paradigm enables direct, real-time forecasting and cross-sensor inference without the overhead of explicit numerical solvers. In hindcast evaluations, Earth-o1 achieves surface forecast skill comparable to the operational Integrated Forecasting System (IFS). These results establish that continuous, observation-driven world models – a new class of fully observation-native geophysical simulators – can match the fidelity of established physical frameworks, providing a scalable data-driven foundation for a digital twin of the Earth.

[CV-27] nyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices

【速读】:该论文旨在解决在资源受限环境下(如无网络连接的偏远农业地区)实现高效、准确的可可叶部病害自动检测问题,特别是针对西非小农户面临的可可肿枝病毒病(Cocoa Swollen Shoot Virus Disease, CSSVD)和炭疽病造成的严重产量损失。现有基于深度学习的边缘部署系统缺乏不确定性量化能力,而现有的贝叶斯方法则多聚焦于硬件级推理架构而非农业应用场景。解决方案的关键在于提出TinyBayes框架,首次将闭式贝叶斯分类器与轻量级计算机视觉流水线相结合:使用YOLOv8-Nano进行病斑定位(模型大小5.9 MB),MobileNetV3-Small提取特征(3.5 MB),并引入Jacobi先验——一种通过投影获得闭式非迭代估计器的贝叶斯方法,用于分类;其衍生的Jacobi-DMR(Distributed Multinomial Regression)分类器仅增加13.5 KB,使总模型体积控制在9.5 MB以内,同时在Amini Cocoa Contamination Challenge数据集上达到78.7%准确率,并支持单张图像CPU端推理时间低于150 ms,实现了精度、模型尺寸与推理速度的最佳权衡。

链接: https://arxiv.org/abs/2605.06333
作者: Shouvik Sardar,Sourish Das
机构: Chennai Mathematical Institute ( Chennai Mathematical Institute)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: 14 Pages, 1 Figure, 4 Tables

点击查看摘要

Abstract:Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: this https URL

[CV-28] NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)中传统基于自下而上、逐步推进的导航方法存在的误差累积问题,以及现有基于环境地图的方法因依赖增量式记忆图更新或离散路径评分而导致连续空间推理受限和离散瓶颈的问题。其解决方案的关键在于提出一种自顶向下的导航范式(Top-Down VLN, TD-VLN),将导航任务重构为在预构建的俯视地图上的全局路径规划问题,并设计了NavOne统一框架:该框架通过一个俯视地图融合模块(Top-Down Map Fuser)实现多模态地图联合表征,并引入空间感知的深度混合机制(Attention Residuals for spatial-aware depth mixing),从而在单次前向传播中直接预测密集路径概率分布,显著提升导航效率与精度。

链接: https://arxiv.org/abs/2605.06317
作者: Dijia Zhan,Jinyi Li,Chenxi Zheng,Shaoyu Huang,Yong Li,Jie Tang,Xuemiao Xu
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

[CV-29] Render Dont Decode: Weight-Space World Models with Latent Structural Disentanglement

【速读】:该论文旨在解决当前世界模型(World Model)在训练过程中依赖原始像素编码至不透明的潜在空间并使用复杂解码器进行重建所带来的计算成本高、可解释性差的问题。其解决方案的关键在于引入NOVA框架,该框架将系统状态表示为一个辅助坐标基隐式神经表示(Implicit Neural Representation, INR)的权重和偏置,从而实现解析式渲染,消除解码瓶颈,同时具备紧凑性、可移植性和零样本超分辨率能力;此外,NOVA通过动作匹配目标可蒸馏为上下文相关的视频生成器,并在无需辅助损失或对抗性目标的情况下自动解耦场景结构成分(如背景、前景与帧间运动),支持对内容或动态的独立编辑而不影响另一部分。

链接: https://arxiv.org/abs/2605.06298
作者: Roussel Desmond Nzoyem,Mauro Comi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 30 figures, 8 tables

点击查看摘要

Abstract:Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at \sim 40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.

[CV-30] Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

【速读】:该论文旨在解决现有图像动画生成方法中因依赖拉格朗日运动引导(Lagrangian motion guidance)而导致的训练效率低、动态伪影多及时间一致性差的问题。其核心解决方案在于改用相邻帧欧拉运动场(Eulerian motion fields)作为局部监督信号,该设计使运动信息始终描述短时程的局部位移,从而支持并行化训练并提供有界误差的监督;同时引入双向几何一致性机制(Bidirectional Geometric Consistency),通过前向-后向循环校验数学识别并掩蔽遮挡区域,有效抑制了由于相邻帧生成导致的漂移伪影,显著提升了生成结果的时间连贯性和动态质量。

链接: https://arxiv.org/abs/2605.06280
作者: Thong Nguyen,Khoi M. Le,Cong-Duy Nguyen,Luu Anh Tuan,See-Kiong Ng,Chunyan Miao
机构: National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学); Centre for AI Research, VinUniversity(越南VinUniversity人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.

[CV-31] When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy

【速读】:该论文旨在解决标准交叉熵(Cross-Entropy)损失在分类任务中忽略类别层次结构语义距离的问题,即其对所有误分类一视同仁,未能利用已知的类间层级关系来提升模型性能。解决方案的关键在于提出层次感知交叉熵(Hierarchy-Aware Cross-Entropy, HACE),其核心由两个组件构成:一是预测聚合(prediction aggregation),通过将模型概率质量沿类别层次向上传播,使父节点累积子节点的置信度;二是祖先标签平滑(ancestral label smoothing),将真实标签信号沿从真类到根节点的路径进行分布,从而更合理地引导学习过程。实验表明,HACE在多个数据集和架构上均显著优于标准交叉熵。

链接: https://arxiv.org/abs/2605.06274
作者: April Chan,Davide D’Ascenzo,Sebastiano Cultrera di Montesano
机构: Massachusetts Institute of Technology (麻省理工学院); University of Milan (米兰大学); Politecnico di Torino (都灵理工大学); Broad Institute of MIT and Harvard (MIT和哈佛大学布罗德研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model’s probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture–dataset pairs, with a mean gain of 4.66%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18% over the next best baseline.

[CV-32] On-Orbit Real-Time Wildfire Detection Under On-Board Constraints

【速读】:该论文旨在解决卫星遥感中基于热红外(MWIR)影像的在轨野火检测问题,尤其针对子像素或单像素热异常、极端类别不平衡以及严苛的计算资源约束(如模型体积<1 MB、推理延迟<150 ms/批)等挑战。现有主流方法(如MODIS和VIIRS的上下文热阈值法)难以应对此类复杂场景。解决方案的关键在于提出一种轻量级密集表示学习框架,采用预训练的密集掩码自编码器(DenseMAE)及其与指数移动平均(EMA)蒸馏结合的变体(DenseMAE+EMA),在未校准的多星MWIR数据上进行无监督表征学习,并通过线性探测和像素级平均精度(AP)评估其性能。实验表明,该方法可在极低延迟(65.34 ms/批)和超小模型尺寸(0.52 MB)下实现高精度火点识别(测试AP=0.640,事件级Fire-F1=0.69),优于监督基线模型,在满足实时性与部署限制的前提下显著提升了野火检测能力。

链接: https://arxiv.org/abs/2605.06273
作者: Matthias Rötzer,Veronika Pörtge,Martin Ickerott,Jayendra Praveen Kumar Chorapalli,Dimitri Scheftelowitsch,Max Bereczky,Dmitry Rashkovetsky,Sai Manoj Appalla,Julia Gottfriedsen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:We present a deployed system for on-orbit wildfire detection aboard a nine-satellite commercial thermal infrared constellation, operating under demanding joint constraints: sub-megabyte model footprint, sub-150 ms per-batch TensorRT FP16 inference on an NVIDIA Jetson Xavier NX, and an end-to-end alert pipeline targeting under 10 minutes from satellite overpass to fire event communication. The system operates on uncalibrated mid-wave infrared (MWIR) single-band imagery at 200 m ground sampling distance, where fires frequently appear as sub-pixel or single-pixel thermal anomalies under extreme class imbalance – challenges not addressed by the contextual thermal-thresholding pipelines (MODIS, VIIRS) that currently dominate operational fire monitoring. We present an empirical study of lightweight dense representation learning for this regime using a proprietary nine-satellite MWIR dataset. We compare dense masked autoencoding (DenseMAE) and a hybrid DenseMAE+EMA (exponential moving average) distillation variant, and evaluate representations via linear probing and full-distribution pixel-level average precision (AP) under extreme class imbalance. DenseMAE pretraining enables compact downstream models on the latency-accuracy Pareto frontier: our fastest SSL-pretrained model achieves 0.640 test AP and 0.69 event-level Fire-F1 with 65.34 ms latency per batch and a 0.52 MB engine, without pruning or compression. The best configuration reaches 0.699 AP and 0.744 Fire-F1 below 1 MB, outperforming a supervised baseline (0.650 AP) under comparable constraints.

[CV-33] Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

【速读】:该论文旨在解决基于视觉 Transformer 的前馈式 3D 重建模型在处理视频输入(数百至数千帧)时因全局注意力层的二次计算复杂度而导致的效率瓶颈问题。现有基于 token 合并的方法通过压缩 token 序列加速模型,但其对查询 token(query tokens)与键值 token(key-value tokens)采用统一压缩策略,忽略了二者在 3D 重建任务中功能上的差异:查询 token 编码视图特异性几何请求且对压缩敏感,而键值 token 表征共享场景上下文并能承受激进压缩。解决方案的关键在于提出 Spark3R——一个无需重新训练的加速框架,通过解耦两类 token 的压缩策略:对查询 token 采用组内合并(intra-group token merging),对键值 token 采用轻量级剪枝(lightweight token pruning),并动态调整键值 token 在不同层的压缩因子,从而显著提升计算效率与重建质量之间的权衡。

链接: https://arxiv.org/abs/2605.06270
作者: Zecheng Tang,Jiaye Fu,Qiankun Gao,Haijie Li,Yanmin Wu,Jiaqi Zhang,Siwei Ma,Jian Zhang
机构: Peking University, Shenzhen (北京大学深圳校区); Peking University, Beijing (北京大学北京校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, \pi^3 , and Depth-Anything-3, and achieves up to 28\times speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

[CV-34] ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision

【速读】:该论文旨在解决医学图像分割中全标注数据集构建成本高、耗时长的问题,提出了一种基于高效涂鸦标注(scribble annotations)的弱监督分割方法。其解决方案的关键在于:首先通过最大化监督信号和随机性模拟来设计高效的涂鸦形式;其次引入正则化项以编码空间关系与形状约束,并利用期望最大化(EM)算法估计各类别标签的混合比例,从而精准识别未标注像素并修正错误预测;最终将这种高效涂鸦监督机制与先验信息融合,构建出名为ZScribbleSeg的框架,在多个医学图像分割任务中仅依赖涂鸦标注即实现了具有竞争力的性能表现。

链接: https://arxiv.org/abs/2605.06266
作者: Ke Zhang,Bomin Wang,Hangqi Zhou,Xiahai Zhuang
机构: Fudan University (复旦大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Medical Image Analysis

点击查看摘要

Abstract:Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute losses on annotated areas and generate pseudo labels by propagating annotations to adjacent regions. However, these methods often suffer from inaccurate and unrealistic segmentations due to insufficient supervision and incomplete shape information. In contrast, we first investigate the principle of good scribble annotations, which leads to efficient scribble forms via supervision maximization and randomness simulation. We further introduce regularization terms to encode the spatial relationship and the shape constraints, where the EM algorithm is utilized to estimate the mixture ratios of label classes. These ratios are critical in identifying the unlabeled pixels for each class and correcting erroneous predictions, thus the accurate estimation lays the foundation for the incorporation of spatial prior. Finally, we integrate the efficient scribble supervision with the prior into a framework, referred to as ZScribbleSeg, and apply it to multiple scenarios. Leveraging only scribble annotations, ZScribbleSeg achieves competitive performance on six segmentation tasks including ACDC, MSCMRseg, BTCV, MyoPS, Decathlon-BrainTumor and Decathlon-Prostate. Our code will be released via this https URL.

[CV-35] Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

【速读】:该论文旨在解决密集人群场景下视频语义搜索(video semantic search)中因视觉编码器倾向于关注显著前景区域而忽略具有语义重要性的背景区域所导致的检索性能下降问题。其解决方案的关键在于提出一种逆向注意力嵌入(Inverse Attention Embedding)机制,该机制显式地捕捉并强化被传统模型忽略的背景区域特征;通过将逆向注意力嵌入与常规视觉嵌入相结合,可在不增加额外训练成本的前提下显著提升视频语义检索的召回率(recall)。

链接: https://arxiv.org/abs/2605.06229
作者: Faisal Aljehrai,Mohammed A. Alkhrashi,Alreem Almuhrij,Sarah Abuhimed,Noorh Aldossary,Abdullah Aldwyish,Raied Aljadaany,Huda Alamri,Muhammad Kamran J Khan
机构: Saudi Data And Artificial Intelligence Authority (SDAIA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism that explicitly captures and highlights these overlooked regions. By combining inverse attention embeddings with traditional visual embeddings, our method significantly enhances semantic retrieval performance without additional training. Initial experiments and ablation studies demonstrate promising improvements over existing approaches in recall for video semantic search in crowded environments.

[CV-36] Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance CVPR2026

【速读】:该论文旨在解决同时高效获取物体形状(shape)与反射特性(reflectance)的难题,尤其在使用单一相机和结构光系统时如何优化4D光照条件的自适应计算。其核心解决方案在于提出一种可微分框架,通过像素级概率模型联合建模深度与反射率,并以降低深度不确定性为目标函数,动态调整后续光照条件;同时利用物理测量与仿真结果之间的差异进行微调,实现高保真度的深度图与反射参数图重建。关键创新点在于将光照策略的优化过程嵌入可微分链中,从而实现对复杂场景下几何与材质信息的高效协同估计。

链接: https://arxiv.org/abs/2605.06214
作者: Huakeng Ding,Yaowen Chen,Kun Zhou,Hongzhi Wu
机构: State Key Lab of CADCG, Zhejiang University (CADCG国家重点实验室,浙江大学); Hangzhou Research Institute of Holographic and AI Technology (杭州全息与人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. 10 pages, 13 figures

点击查看摘要

Abstract:We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.

[CV-37] Playing the network backward: A Game Theoretic Attribution Framework

【速读】:该论文旨在解决后向归因方法(如梯度、LRP和特定于Transformer的规则)缺乏统一框架以比较其底层计算机制的问题。解决方案的关键在于将后向归因重新建模为扩展网络图上的两人博弈,借鉴Gaubert与Vlassopoulos提出的ReLU Net Game理论;在此框架下,梯度和完整的alpha-beta-LRP家族被证明是特定均衡状态下博弈轨迹的积分,从而将归因图谱视为轨迹分布的投影而非核心对象。通过引入博弈论概念(如策略正则化、风险规避和扩展动作集),可直接转化为对现有归因规则的改进,例如在ViT-B/16上的一种alpha-beta-LRP改进版本在所有局部化指标上均优于先前的Transformer专用方法。

链接: https://arxiv.org/abs/2605.06212
作者: Jakob Paul Zimmermann,Jim Berend,Georg Loho,Sebastian Lapuschkin,Wojciech Samek
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attribution methods explain which input features drive a model’s prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos’ ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.

[CV-38] aming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

【速读】:该论文旨在解决离散视觉标记化器(discrete visual tokenizer)中因固定码本大小(constant codebook size)设计导致的信息瓶颈问题,即在序列中后期位置的条件熵迅速下降,使得大部分位置退化为记忆任务而非语义表示学习。其核心发现是“熵崖”(Entropy Cliff)现象:当码本容量 $ K $ 固定时,仅需少数位置即可使剩余位置的分布趋于确定性,从而限制了模型对图像结构的有效建模能力。解决方案的关键在于提出可变码本大小量化(Variable Codebook Size Quantization, VCQ),其中码本大小 $ K_t $ 沿序列单调增长,从最小值 $ K_\min=2 $ 逐步增至 $ K_\max $,无需改变损失函数、参数量或自回归训练流程。该机制通过在早期引入极小码本(形成粗粒度语义层次)与后期增大码本(提升细节重建能力),实现了更高效的信息组织和表达,显著提升了生成质量(如gFID指标)并揭示了容量分布对性能的关键影响。

链接: https://arxiv.org/abs/2605.06207
作者: Bowen Zheng,Weijian Luo,Guang Yang,Colin Zhang,Tianyang Hu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); hi-Lab, Xiaohongshu Inc (小红书实验室,小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size K to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with K=16384 , this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: t^* = \lceil \log_2 N / \log_2 K \rceil . Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size K_t grows monotonically along the sequence from K_\min=2 to K_\max , leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet 256\times256 over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at K_\min=2 naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.

[CV-39] Bridging visual saliency and large language models for explainable deep learning in medical imaging

【速读】:该论文旨在解决深度学习模型在医学影像领域临床应用中因缺乏可解释性而导致的采纳障碍问题,特别是针对脑肿瘤分类任务。解决方案的关键在于构建一个融合视觉、解剖和语言模态的多模态可解释性框架:首先通过双输出混合架构的卷积神经网络(CNN)同时优化分类与分割任务,提升特征的空间表达能力;其次利用Grad-CAM++等显著性图生成类判别热图,并通过自适应百分位阈值法转化为二值肿瘤掩膜;最后将掩膜映射至Harvard-Oxford皮层图谱以定位神经解剖结构,并将其编码为结构化JSON格式,作为条件输入驱动大语言模型(LLM)生成具有放射学风格的诊断报告。此统一流程实现了从像素级证据到临床可读叙事的转化,增强了人工智能辅助脑肿瘤诊断的透明度与可信度。

链接: https://arxiv.org/abs/2605.06197
作者: Paul Valery Nguezet,Elie Tagne Fute,Yusuf Brima,Benoit Martin Azanguezet,Marcellin Atemkeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.

[CV-40] EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

【速读】:该论文旨在解决当前机器人世界模型中视频生成与动作控制之间闭环耦合不足的问题,特别是现有方法将视频生成作为策略学习的辅助表示,未能充分利用动作信号引导视频合成,导致生成轨迹难以保持精确的机器人空间几何结构和细粒度的机器人-物体交互动力学。其解决方案的关键在于提出事件感知的生成式世界模型(Event-Aware Generative World Model, EA-WM),通过将关节或末端执行器动作直接投影到目标相机视角下形成结构化的运动学-视觉动作场(Structured Kinematic-to-Visual Action Fields),并引入事件感知的双向融合模块以调节跨分支注意力机制,从而实现基于几何约束的动作驱动视频合成,显著提升了生成轨迹的物理一致性与交互精度。

链接: https://arxiv.org/abs/2605.06192
作者: Zhaoyang Yang,Yurun Jin,Lizhe Qi,Cong Huang,Kai Chen
机构: Fudan University (复旦大学); Zhongguancun Academy (中关村学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院); University of Science and Technology of China (中国科学技术大学); DeepCybo (DeepCybo)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint. 22 pages, 10 figures

点击查看摘要

Abstract:Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.

[CV-41] Event-Causal RAG : A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

【速读】:该论文旨在解决当前视觉语言模型在超长视频(甚至无限时长视频)推理任务中的局限性,即模型难以在长时间跨度内保持连贯记忆并推断时间上相隔较远事件之间的因果关系。现有端到端视频理解方法受限于自注意力机制的O(n²)复杂度,而检索增强生成(RAG)方法则存在片段级记忆碎片化、时空与因果结构建模弱以及存储和在线推理成本高等问题。其解决方案的关键在于提出Event-Causal RAG框架,通过将流式视频分割为语义一致的事件,并以结构化的状态-事件-状态(State-Event-State, SES)图表示每个事件及其前后状态转移;这些图被整合成全局事件知识图谱,并存储于支持语义匹配与因果拓扑检索的双层内存中;在此基础上设计双向检索策略,高效识别最相关的事件因果链,结合视频证据输入基础视频大模型进行答案生成,从而显著提升跨长时间跨度的多事件整合与因果推理能力,同时实现更高的内存效率和流式处理鲁棒性。

链接: https://arxiv.org/abs/2605.06185
作者: Peizheng Yan,Yu Zhao,Liang Xie,Juntong Qi,Mingming Wang,Erwei Yin
机构: Tianjin Key Lab of Intelligent Unmanned Swarm Tech System(Tianjin Key Lab of Intelligent Unmanned Swarm Tech System); Tianjin University(Tianjin University); Institute of Computing and Intelligence(Harbin Institute of Technology (Shenzhen)); Tianjin Artificial Intelligence Innovation Center(Tianjin Artificial Intelligence Innovation Center); Defense Innovation Institute(Academy of Military Sciences); School of Future Technology(Shanghai University)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the O(n^2) complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.

[CV-42] SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

【速读】:该论文旨在解决当前基于ARKit blendshape系数的面部表情估计中因依赖伪标签(pseudo labels)而导致的表达 fidelity 低的问题。现有方法通常使用如Live Link Face等捕捉软件提供的伪标签进行监督训练,但这些伪标签常包含噪声激活、系数幅度偏差以及面部动作缺失或不准确等问题,导致模型学习到的是不完美的伪标签而非真实的感知表达一致性。解决方案的关键在于提出SuperFace框架,通过引入人类偏好反馈(preference feedback)来优化系数预测:不再将软件估算的系数视为固定真值,而是将其作为初始值,随后利用人类对渲染面部表情的主观判断进行迭代优化,从而实现更符合人类感知的面部动画表达。这种偏好驱动的优化机制使模型从模仿伪标签转向与人类感知对齐,显著提升了表情的真实感和表现力。

链接: https://arxiv.org/abs/2605.06179
作者: Zejian Kang,Xuanyang Xu,Wentao Yang,Kai Zheng,Yuanchen Fei,Hongyuan Zou,Hui Shan,Shuo Yang,Xiangru Huang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Hunan University (湖南大学); Shanghai Innovation Institute (上海创新研究院); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.

[CV-43] Retina-RAG : Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation MICCAI2026

【速读】:该论文旨在解决当前糖尿病视网膜病变(Diabetic Retinopathy, DR)自动化筛查系统普遍存在的局限性,即仅能进行图像级别的分类而缺乏结构化的临床报告生成能力,导致诊断结果难以直接应用于临床实践。其解决方案的关键在于提出一种低成本、模块化的框架Retina-RAG,该框架通过解耦高性能视网膜分类器与参数高效的视觉-语言模型(Qwen2.5-VL-7B-Instruct)实现灵活集成,并引入检索增强生成(Retrieval-Augmented Generation, RAG)模块,在推理阶段注入专业眼科知识和结构化分类输出,从而提升诊断一致性并减少幻觉现象。该设计使得系统在DR分级(F1-score=0.731)和黄斑水肿(Macular Edema, ME)检测(F1-score=0.948)上显著优于基线方法,同时具备生成高质量结构化报告的能力(ROUGE-L=0.429,SBERT相似度=0.884),且可在单张消费级GPU上运行,验证了临床可用的轻量化AI方案可行性。

链接: https://arxiv.org/abs/2605.06173
作者: Abdelrahman Zaian,Sheethal Bhat,Mohamed Abdalkader,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Submitted to MICCAI 2026

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.

[CV-44] DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

【速读】:该论文旨在解决现有文本到图像(Text-to-Image, T2I)生成模型评估基准依赖固定提示集(prompt set)所导致的过拟合与基准污染问题,从而影响评估结果的公平性与泛化能力。其核心解决方案是提出 DynT2I-Eval,一个全自动动态评估框架:通过从长文本描述中构建结构化的视觉语义空间,将提示分解为可控维度(如主体、逻辑约束、环境和构图),实现基于任务特定空间和难度感知采样的持续新鲜提示生成;同时,利用提示条件下的成对比较统一异构输出,并结合动态调度器、微批次聚合与加权贝叶斯更新机制,在提示分布变化和模型注入情况下仍能维持稳定的在线排行榜,从而提升评估的鲁棒性和长期排名保真度。

链接: https://arxiv.org/abs/2605.06170
作者: Juntong Wang,Jiarui Wang,Huiyu Duan,Lewei Li,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the continuous generation of fresh prompts via task-specific spaces and difficulty-aware sampling. DynT2I-Eval evaluates model performance across text alignment, perceptual quality, and aesthetics. Heterogeneous outputs are unified into prompt-conditioned pairwise comparisons, allowing a dynamic scheduler, micro-batch aggregation, and weighted Bayesian updates to maintain a stable online leaderboard despite changing prompt distributions and model injection. Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol, reducing the impact of prompt-set-specific tuning. Simulations and ablations further confirm that the proposed ranking framework achieves a strong balance among cold-start convergence, late-entry discovery, and long-run ranking fidelity.

[CV-45] Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study

【速读】:该论文旨在解决持续学习(Continual Learning, CL)在医学图像分割领域面临的三大核心问题:一是临床场景缺乏标准化,二是现有研究过度关注遗忘抑制而忽视了可塑性(plasticity)等关键属性,三是缺乏对现有方法的系统性评估基准。其解决方案的关键在于构建一个面向临床需求的基准研究体系,首先定义了三种具有临床意义的持续学习场景——跨中心域迁移(Domain-CL)、增量解剖结构分割(Class-CL)和跨器官分割(Organ-CL),并提出一套多维度评估框架,涵盖性能、遗忘率、可塑性、前向泛化能力、参数效率及回放负担等指标。实验表明,基于回放的方法在稳定性和可塑性之间取得了最佳平衡,参数隔离策略虽能有效减少遗忘但增加模型复杂度,而前向泛化能力仍是该领域亟待深入研究的方向。

链接: https://arxiv.org/abs/2605.06160
作者: Bomin Wang,Hangqi Zhou,Yibo Gao,Xiahai Zhuang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to a journal

点击查看摘要

Abstract:Continual learning (CL) is essential for deploying medical image segmentation models in clinical environments where imaging domains, anatomical targets, and diagnostic tasks evolve over time. However, continual segmentation still faces three main challenges. First, the scenarios for this task remain insufficiently standardized for real-world clinical settings. Second, existing research has been primarily focused on mitigating forgetting, overlooking the other essential properties such as plasticity. Third, a benchmark work with comprehensive evaluation on existing methods is stll desirable. To address these gaps, we present such benchmark study of continual medical image segmentation. We first define three clinically motivated scenarios, namely Domain-CL, Class-CL, and Organ-CL, to respectively capture the cross-center domain shift, the incremental anatomical structure segmentation, and the cross-organ segmentation. We then introduce an evaluation framework that measures not only general performance and forgetting, but also plasticity, forward generalizability, parameter efficiency, and replay burden. The results, from extensive experiments with representative CL methods, showed that it was still challenging to develop a model that could satisfy all the requirements simultaneously. Nevertheless, these studies also suggested that the replay-based methods achieve the best overall balance between stability and plasticity, the parameter-isolation methods should be effective at reducing forgetting, though at the cost of increased model size, and the forward generalizability remain a significantly understudied aspect of this research field. Finally, we discuss related learning paradigms and outline future directions for continual medical image segmentation.

[CV-46] Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

【速读】:该论文旨在解决当前生成式图像模型(Generative Image Models)中水印技术评估缺乏理论依据的问题,特别是现有方法依赖特定生成模型架构进行经验性测试,导致难以对水印方案的安全性、鲁棒性和保真度(Fidelity)进行客观比较与量化分析。其解决方案的关键在于提出一种解耦机制,将水印系统的决策逻辑与具体生成模型分离,并基于此构建一个形式化的评估框架,以安全、鲁棒性和保真度三者之间的权衡关系为核心,通过特征曲面(Characteristic Surface)实现跨模型的精确对比;在此基础上,作者进一步提出SSB水印方法,能够灵活覆盖任意安全-鲁棒性-保真度组合区域,从而为现代水印系统的设计提供理论保障,避免昂贵的经验验证过程。

链接: https://arxiv.org/abs/2605.06153
作者: Enoal Gesny,Eva Giboulot
机构: Inria (法国国家信息与自动化研究院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them heavily reliant on the specific model architectures used for generation and inversion. This prevents any clear conclusion on the performance of any method, especially regarding security, for which a rigorous definition is lacking. Against this approach, we argue that the effectiveness of a watermarking scheme should be established purely through a thorough theoretical analysis. This is enabled by decoupling the model-dependent part from the actual decision mechanism of the watermarking system. Using this decoupling, we introduce a formal evaluation framework based on security, robustness, and fidelity. This allows precise comparisons between watermarking systems through a characteristic surface representing the trade-off between these three quantities, independent of any generative model. Based on this framework, we propose SSB, a novel watermarking method that generalizes previous seed-based methods by allowing to reach any security-robustness-fidelity regime on its characteristic surface. This work opens the door to the design of modern watermarking systems with theoretical guarantees that do not necessitate any costly empirical evaluations.

[CV-47] Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

【速读】:该论文旨在解决离散图像分词器(discrete image tokenizer)在两阶段训练中与自回归(autoregressive, AR)先验模型之间存在的不一致性问题:第一阶段仅优化重建性能,第二阶段再拟合冻结的分词序列,导致分词器未考虑后续AR先验的建模需求,从而使得生成的token虽能较好保留图像信息,却难以被AR模型从左到右预测。解决方案的关键在于引入一种分布级别的先验匹配信号(prior-matching signal),在分词器训练阶段直接优化该信号,同时保持重建目标不变;该信号通过Wasserstein梯度流更新实现,对硬类别分词而言,等价于一个辅助AR模型与目标AR先验之间的token级对比,且无需反向传播通过任一AR模型,仅需前向计算即可完成更新。此方法构建的wAR-Tok分词器显著降低AR损失并提升生成FID指标,在CIFAR-10和ImageNet上实现与原分词器相当的重建质量。

链接: https://arxiv.org/abs/2605.06148
作者: Bowen Zheng,Yihong Luo,Tianyang Hu
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer’s current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.

[CV-48] AI-Generated Images: What Humans and Machines See When They Look at the Same Image

【速读】:该论文旨在解决生成式 AI(Generative AI)在在线虚假信息传播中被滥用的问题,特别是针对AI生成图像的检测系统缺乏透明性和可解释性这一痛点。解决方案的关键在于构建一个集成多种可解释人工智能(Explainable AI, XAI)方法的检测框架,通过大规模真实感伪造图像数据集AIText2Image训练不同架构与微调策略的检测模型,并结合100名参与者的人类反馈调查,系统评估视觉解释的清晰度与人类理解的一致性,从而提升检测结果的可解释性与可信度。

链接: https://arxiv.org/abs/2605.06143
作者: Silvia Poletti,Justin Ilyes,Marcel Hasenbalg,David Fischinger,Martin Boyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Included in the main conference proceedings published by Springer Nature (CCIS Series)

点击查看摘要

Abstract:The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understandable explanations for their predictions. To this end, we develop a suite of detectors with various architectures and fine-tuning strategies, trained on our large-scale photorealistic fake image dataset, AIText2Image, and assess their performance on state-of-the-art text-to-image AI generators. We integrate 16 different explainable AI (XAI) methods into our detection framework, and the visual explanations are comprehensively refined and evaluated through a novel approach that prioritizes human understanding of AI-generated images, using both textual and visual responses collected from a survey of 100 participants. This framework offers insights into visual-language cues in fake image detection and into the clarity of XAI methods from a human perspective, measuring the alignment of XAI outputs with human preferences.

[CV-49] Autoregressive Visual Generation Needs a Prologue

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成中“重建-生成鸿沟”(reconstruction-generation gap)的问题,即在传统方法中,视觉token需同时满足重建任务和生成任务,导致两者相互干扰,难以兼顾高质量重建与高保真生成。其解决方案的关键在于提出Prologue方法:通过在视觉token序列前引入一小部分独立训练的“前导token”(prologue tokens),这些token仅使用AR交叉熵(CE)损失进行优化,而原始视觉token仍专注于重建任务,从而实现生成与重建的解耦。这种设计使模型能在不损害重建质量的前提下,显著提升生成质量,且从ELBO(Evidence Lower Bound)角度进行了理论形式化。实验表明,Prologue在ImageNet 256x256上大幅降低gFID(生成FID),并展现出由AR梯度驱动的语义结构涌现特性。

链接: https://arxiv.org/abs/2605.06137
作者: Bowen Zheng,Weijian Luo,Guang Yang,Colin Zhang,Tianyang Hu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); hi-Lab, Xiaohongshu Inc (小红书实验室,小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model’s true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

[CV-50] Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration

【速读】:该论文旨在解决真实世界图像退化(real-world image degradation)具有空间非均匀性和组合性(compositional)特征时,如何在无需测试阶段退化标签的情况下,使单一模型能够自适应地处理多样化的局部退化模式的问题。现有方法通常依赖全局提示(global prompts)或退化描述符进行条件调制,或通过预定义专家池路由特征,但这些方法受限于全局条件对局部退化信息的瓶颈效应,或因静态路由导致更新同质化、稀疏分配不稳定。其解决方案的关键在于提出连续专家组装(Continuous Expert Assembly, CEA)——一种基于token级动态参数化的框架,利用轻量级交叉注意力超适配器(Cross-Attention Hyper-Adapter)从中间空间特征中提取实例相关的低秩路由基和残差方向,并通过密集符号点积亲和度实现每个空间token自主组装残差更新,从而避免外部提示、静态专家库和离散Top-k选择,同时具备线性注意力视角下的可解释性,显著提升在空间变化和组合退化场景下的恢复质量,且保持高效参数、计算和运行性能。

链接: https://arxiv.org/abs/2605.06127
作者: Haisen He,Xiangyu Zou,SongLin Dong,Heng Li,Yihong Gong,Zhiheng Ma
机构: Southern University of Science and Technology (南方科技大学); Shenzhen University (深圳大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typically modulate a shared backbone with global prompts or degradation descriptors, or route features through predefined expert pools. However, compact global conditioning can bottleneck localized degradation evidence, while static expert routing may produce homogeneous updates or rely on unstable sparse assignments. We propose \textbfContinuous Expert Assembly (CEA), a token-wise dynamic parameterization framework for all-in-one image restoration. CEA employs a lightweight \textbfCross-Attention Hyper-Adapter to probe intermediate spatial features and synthesize instance-conditioned low-rank routing bases and residual directions. Each spatial token then assembles its own residual update via dense signed dot-product affinities over the generated rank-wise components, avoiding external prompts, static expert banks, and discrete Top- selection. The resulting assembly rule also admits a linear-attention perspective, making its dense token-wise routing behavior transparent. Experiments on AIO-3, AIO-5, and CDD-11 show that CEA improves average restoration quality over strong prompt-, descriptor-, and expert-based baselines, with the clearest gains on spatially varying and compositional degradations, while maintaining favorable parameter, FLOP, and runtime efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.06127 [cs.CV] (or arXiv:2605.06127v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.06127 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-51] Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

【速读】:该论文旨在解决农业害虫识别中因物种间复杂性高、种内变异大以及专家标注数据稀缺所导致的视觉理解难题,尤其针对多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度害虫形态识别上的应用局限。其解决方案的关键在于提出 Pest-Thinker 框架,该框架融合知识驱动的强化学习(Reinforcement Learning, RL),通过构建两个高分辨率害虫基准数据集 QFSD 和 AgriInsect,并利用 Chain-of-Thought (CoT) 推理轨迹进行监督微调(Supervised Fine-Tuning, SFT),进而采用 Group Relative Policy Optimization (GRPO) 结合新型特征奖励机制,引导模型聚焦于可观察的形态学证据,该奖励由 LLM-as-a-Judge 策略评估,从而显著提升模型在域内与域外场景下的形态学理解能力,迈向专家级智能农业害虫视觉推理水平。

链接: https://arxiv.org/abs/2605.06121
作者: Xueheng Li,Yu Wang,Tao Hu,Ji Huang,Ke Cao,Qize Yang,Rui Li,Jie Zhang,Chengjun Xie
机构: Institute of Intelligent Machines, Hefei Institute of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院智能机械研究所); University of Science and Technology of China (中国科学技术大学); Zhongke Hefei Institute of Technology Innovation Engineering (中科合肥技术工程院); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain’s unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.

[CV-52] Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking

【速读】:该论文旨在解决基于RGB的跟踪器在低光照和快速运动等挑战性成像条件下性能下降的问题,同时针对现有事件相机(event camera)跟踪方法忽视事件数据固有的空间稀疏性和时间密集性,且依赖单一固定时间窗口采样策略导致在不同运动动态下表现不佳的局限性。解决方案的关键在于提出一种事件稀疏感知的跟踪框架,通过显式建模多时间尺度下的事件密度变化,将稀疏、中等密度和高密度事件搜索区域逐步注入三阶段视觉Transformer(Vision Transformer)骨干网络,实现分层多密度特征学习;此外,引入稀疏感知的专家混合模块(Mixture-of-Experts module)以促进不同稀疏模式下的专家专业化,并设计动态权重策略根据跟踪难度自适应调整推理深度,从而在跟踪精度与计算效率之间取得良好平衡。

链接: https://arxiv.org/abs/2605.06112
作者: Shiao Wang,Xiao Wang,Duoqing Yang,Wenhao Zhang,Bo Jiang,Lin Zhu,Yonghong Tian,Bin Luo
机构: Anhui University (安徽大学); Beijing Institute of Technology (北京理工大学); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on this https URL.

[CV-53] Metonymy in vision models undermines attention-based interpretability

【速读】:该论文旨在解决现代视觉模型(特别是预训练视觉Transformer)中局部性假设失效的问题,即传统基于部件的推理方法假设对象各部分的潜在表示仅编码对应图像区域的信息,但实验表明这类模型存在显著的对象内信息泄漏(intra-object leakage),导致注意力机制驱动的可解释方法失去可信度。解决方案的关键在于采用一种两阶段特征提取方法,通过设计上防止信息泄露,从而实现特征的内在解耦(intrinsic disentanglement),进而提升属性驱动的部件发现任务性能,验证了该策略在实际应用中的有效性。

链接: https://arxiv.org/abs/2605.06095
作者: Ananthu Aniraj,Cassio F. Dantas,Dino Ienco,Massimiliano Mancini,Diego Marcos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, preprint

点击查看摘要

Abstract:Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information about the corresponding image region. In this work, we test this basic assumption, measuring intra-object leakage in vision models using part-based attribute annotations. Through a comprehensive experimental evaluation, we show that modern pretrained vision transformers violate the locality assumption and exhibit a strong intra-object leakage, in which each part encodes information from the whole object, a visual metonymy that compromises the faithfulness of attention-based interpretable-by-design methods for part-based reasoning, ultimately rendering them uninterpretable. In addition, we establish an upper bound using a two-stage approach that prevents leakage by design. We then show that this inherently disentangled feature extraction improves attribute-driven part discovery on a variety of tasks, confirming the practical impact of intra-object leakage. Our results uncover a neglected issue affecting the interpretability of part-based representations, such as those in CBMs relying on part-centric concepts, highlighting that two-stage approaches offer a promising way to mitigate it.

[CV-54] VISD: Enhancing Video Reasoning via Structured Self-Distillation

【速读】:该论文旨在解决视频大语言模型(VideoLLM)在复杂推理任务中因序列级奖励稀疏及长时程推理轨迹缺乏细粒度信用分配而导致的训练困难问题。现有强化学习方法虽能提供可靠监督,但难以捕捉token级别的贡献;而传统自蒸馏方法虽提供密集监督,却缺乏结构化与诊断特异性,且易与强化学习产生不稳定性。解决方案的关键在于提出一种结构化自蒸馏框架VISD,其核心创新包括:(1) 引入一个视频感知的判别模型,将推理质量分解为答案正确性、逻辑一致性与时空锚定等多个维度,从而提供具有诊断意义的特权信息;(2) 设计方向-幅度解耦机制,利用奖励计算的rollout级优势决定更新方向,同时由结构化特权信号调节token级更新幅度,实现语义对齐的细粒度信用分配;(3) 结合课程调度与EMA教师稳定策略,提升长期视频序列优化的鲁棒性。实验表明,VISD显著优于强基线,在准确率和时空 grounding 质量上均有提升,并实现近2倍的收敛速度加速。

链接: https://arxiv.org/abs/2605.06094
作者: Hao Lin,Kunyang Lv,Xu Jiang,Jingqi Tian,Zhongjing Du,Jiayu Ding,Qiaoman Zhang,Hongbo Jin
机构: Peking University (北京大学); Tsinghua University (清华大学); HUST (华中科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

[CV-55] Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning CVPR2026

【速读】:该论文旨在解决自监督跟踪模型在无标签视频中难以学习鲁棒上下文知识的问题,特别是现有方法因依赖非语义查询而无法有效适应无标签场景,导致难以获取可靠的上下文线索。其解决方案的关键在于提出一种双模态上下文关联机制(dual-modal context association mechanism),该机制通过两个阶段协同利用细粒度语义提示(fine-grained semantic prompts)和上下文噪声(contextual noise)来驱动模型学习鲁棒的跟踪表征:早期训练阶段使用实例patch token作为提示引导前后向跟踪分支学习基础跟踪知识,随着训练推进逐步引入上下文噪声以扰动特征空间,促使模型在更复杂的环境中学习更具泛化能力的表示。此机制仅在训练阶段应用,保障了推理效率。

链接: https://arxiv.org/abs/2605.06092
作者: Yaozong Zheng,Qihua Liang,Bineng Zhong,Shuimu Zeng,Yuanliang Xue,Ning Li,Shuxiang Song
机构: Guangxi Normal University (广西师范大学); University of Southampton (南安普顿大学); Xi’an Research Institute of High Technology (西安高技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf\tracker, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.

[CV-56] OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

【速读】:该论文旨在解决基于高斯表示的开放词汇3D场景理解中,由于多视角观测导致的语义预测碎片化和空间不一致性问题。解决方案的关键在于提出OpenGaFF框架,其核心创新是引入高斯特征场(Gaussian Feature Field),将语义建模为高斯几何与外观的连续函数,通过显式地以几何结构为条件来增强几何与语义的耦合关系,从而提升相似结构在3D空间中的空间一致性;同时,设计了一个结构化的码本(codebook)作为共享语义基元,并结合码本引导的注意力机制,实现基于查询嵌入与码本条目间相似性匹配的语言特征检索,有效降低对象内部特征方差并支持鲁棒的开放词汇推理。

链接: https://arxiv.org/abs/2605.06088
作者: Kunyi Li,Michael Niemeyer,Sen Wang,Stefano Gasperini,Nassir Navab,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); Google (谷歌); Munich Center for Machine Learning (慕尼黑机器学习中心); Visualais
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

[CV-57] LARGO: Low-Rank Hypernetwork for Handling Missing Modalities

【速读】:该论文旨在解决多模态图像分析中因模态缺失导致的性能下降问题,尤其是在不同数据集上模型迁移困难、需频繁调整架构或超参数的挑战。现有方法通常在特征空间中设计对缺失输入鲁棒的表示,而本文提出在权重空间进行建模的新思路:通过构建一个超网络(hypernetwork)LARGO,利用CP张量分解(Canonical Polyadic tensor decomposition)将 2N12^N-1 个针对不同缺失组合的专用模型压缩为单一网络,从而实现高效且通用的缺失模态处理能力。其核心创新在于以权重空间替代特征空间建模,显著提升了模型的泛化性和部署灵活性。

链接: https://arxiv.org/abs/2605.06086
作者: Niels Vyncke,Pooya Ashtari,Aleksandra Pižurica
机构: Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing methods tackle this problem in feature space by engineering representations that are robust to missing inputs, we instead operate in weight space. We propose LARGO, a hypernetwork that compresses the 2^N-1 dedicated missing-modality models into a single network by modelling the convolutional weights using the Canonical Polyadic (CP) tensor decomposition. Extensive experimental validation on BraTS 2018 (4 modalities, 15 scenarios) and ISLES 2022 (3 modalities, 7 scenarios) shows that our method ranks first in 47 out of 52 configurations, achieving average Dice improvements of +0.68 % and +2.53 % over state-of-the-art baselines (mmFormer, M ^3 AE, ShaSpec, SimMLM). A proof-of-concept experiment on avMNIST suggests that LARGO may extend beyond medical imaging to heterogeneous non-medical modalities.

[CV-58] AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes

【速读】:该论文旨在解决低光照条件下图像质量下降导致目标检测精度显著降低的问题。其核心解决方案是提出一种面向检测性能优化的图像增强-目标检测联合框架AMIEOD,关键创新在于:(1) 设计多专家图像增强模块(Multi-Experts Image Enhancement Module, MEIEM),融合多种增强策略以充分挖掘低光图像信息;(2) 提出检测引导回归损失(Detection-Guided Regression Loss, DGRL),利用检测结果动态确定增强目标,使增强过程与检测任务对齐;(3) 构建基于检测引导交叉熵损失(Detection-Guided Cross-Entropy, DGCE)的专家选择模块(Expert Selection Module, ESM),在推理阶段自适应选择最优增强策略,从而实现低光场景下目标检测性能的显著提升。

链接: https://arxiv.org/abs/2605.06084
作者: Xiaochen Huang,Honggang Chen,Weicheng Zhang,Xiaobo Dai,Yongyi Li,Linbo Qing,Xiaohai He
机构: Sichuan University (四川大学); Police Integration Computing Key Laboratory of Sichuan Province (四川省公安融合计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE Transactions on Multimedia

点击查看摘要

Abstract:In multimedia application scenarios, images captured under low-illumination conditions often lead to lower accuracy in visual perception tasks compared to those taken in well-lit environments. To tackle this challenge, we propose AMIEOD, an image enhancement-enabled object detection framework for low-illumination scenes, where the two tasks are jointly optimized in a detection performance-oriented manner. Specifically, to fully exploit the information in poorly lit images, a Multi-Experts Image Enhancement Module (MEIEM) is proposed, which leverages diverse enhancement strategies. On this basis, aiming to better align the MEIEM with the detection task, we propose a Detection-Guided Regression Loss (DGRL) that utilizes the detection result to decide the regression target. Moreover, to dynamically select the most suitable enhancement strategy from MEIEM during inference, we construct an Expert Selection Module (ESM) guided by the proposed Detection-Guided Cross-Entropy (DGCE) loss, which formulates the optimization of ESM as a classification task. The improved method is well-matched with current detection algorithms to improve their performance in dim scenes. Extensive experiments on multiple datasets demonstrate that the proposed method significantly improves object detection accuracy in low-illumination conditions. Our code has been released at this https URL Comments: Accepted at IEEE Transactions on Multimedia Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.06084 [cs.CV] (or arXiv:2605.06084v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.06084 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Honggang Chen [view email] [v1] Thu, 7 May 2026 12:06:21 UTC (27,119 KB) Full-text links: Access Paper: View a PDF of the paper titled AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes, by Xiaochen Huang and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-59] MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

【速读】:该论文旨在解决无参考图像描述(image caption)评估中因全局嵌入相似性难以捕捉细粒度不匹配(如幻觉对象、缺失属性或错误关系)而导致的评价不准问题。其解决方案的关键在于提出一种名为MSD-Score的无参考指标,通过将图像块(image patch)和文本词元(text token)嵌入建模为单位超球面上的von Mises-Fisher混合分布,将图像-文本匹配问题转化为多尺度分布评分任务;借助加权双向KL散度量化语义差异,并在多尺度框架下融合全局相似性,从而实现对单候选与多候选场景的精准评估,同时提供可解释的局部定位错误诊断能力。

链接: https://arxiv.org/abs/2605.06080
作者: Shichao Kan,Xuyang Zhang,Haojie Zhang,Zhe Zhu,Yigang Cen,Yixiong Liang,Lianlei Shan,Linna Zhang,Zhe Qu,Jiazhi Xia
机构: Central South University (中南大学); Beijing Jiaotong University (北京交通大学); University of Chinese Academy of Sciences (中国科学院大学); Guizhou University (贵州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 17 pages, 10 figures. Code is available at: this https URL

点击查看摘要

Abstract:Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.

[CV-60] Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

【速读】:该论文旨在解决现有直接偏好优化(Direct Preference Optimization, DPO)方法在文本到图像(Text-to-Image, T2I)扩散模型中因依赖二元反馈而导致细粒度建模不足、优化效果受限的问题。其核心解决方案是提出ArenaPO,通过构建一个基于高斯分布的模型竞技场(Arena),利用标注的成对偏好数据推断各模型的能力分布,并基于截断正态分布的隐变量推理估算图像对之间的绝对质量差距,从而提供无需奖励模型的精细化反馈信号。该方法实现了传统强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)的丰富奖励优势与DPO高效训练的结合,且所有计算可在离线阶段完成,不引入额外训练开销。

链接: https://arxiv.org/abs/2605.06070
作者: Zhikai Li,Yue Zhao,Edward Zhongwei Zhang,Xuewen Liu,Jing Zhang,Qingyi Gu,Zhen Dong
机构: Institute of Automation, Chinese Academy of Sciences; University of California, Berkeley; University of California, Santa Barbara
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model’s capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.

[CV-61] PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

【速读】:该论文旨在解决单参考语音驱动下未见过说话者的共言语手势个性化问题,即在仅给定目标语音和一名新说话者的一个动作片段作为参考的情况下,生成符合该说话者特有姿态习惯但又能匹配当前语音内容的自然手势,且无需针对每个说话者进行额外优化。其核心挑战在于如何从参考动作中分离出稳定的说话者身份特征与随语音变化的动态轨迹。解决方案的关键在于提出PersonaGesture框架,包含两个核心组件:自适应风格注入(Adaptive Style Infusion, ASI)和隐式分布校正(Implicit Distribution Rectification, IDR)。ASI通过零初始化残差交叉注意力机制将编码后的说话者记忆令牌注入去噪过程,使风格信息影响运动生成而不破坏预训练的语音到动作先验;IDR则在潜在空间中应用长度感知的对角仿射映射,对参考样本估计的通道级统计偏差进行保守修正,从而实现更精准的个性保留与语义一致性。

链接: https://arxiv.org/abs/2605.06064
作者: Xiangyue Zhang,Yiyi Cai,Kunhang Li,Kaixing Yang,You Zhou,Zhengqing Li,Xuangeng Chu,Jiaxu Zhang,Haiyang Liu
机构: The University of Tokyo (东京大学); Shanda AI Research Tokyo (山东AI东京研究院); Nanyang Technological University (南洋理工大学); Renmin University (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: this https URL.

[CV-62] owards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

【速读】:该论文旨在解决文档视觉问答(Document Visual Question Answering, DocVQA)中模型推理过程缺乏可解释性的问题,即现有模型将与问题相关的证据提取与答案定位任务混杂在一起,且多为黑箱机制,难以验证预测结果是否基于正确的视觉证据。其解决方案的关键在于提出一种自解释的DocVQA框架CoExVQA,通过链式解释(chain-of-explanation)设计实现分步推理:首先识别与问题相关的证据区域,其次显式定位答案所在区域,最后仅从该锚定区域解码答案。这种结构化、可追溯的推理流程显著提升了模型的透明度和可验证性,并在PFL-DocVQA数据集上实现了当前最优的可解释性能,ANLS指标相比现有可解释基线提升12%。

链接: https://arxiv.org/abs/2605.06058
作者: Kjetil Indrehus,Adrian Duric,Changkyu Choi,Ali Ramezani-Kebrya
机构: University of Oslo (奥斯陆大学); Integreat – Norwegian Centre for Knowledge-driven Machine Learning (挪威知识驱动机器学习中心); TRUST – The Norwegian Centre for Trustworthy AI (挪威可信人工智能中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA’s chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

[CV-63] RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

【速读】:该论文旨在解决现有隐式视频到视频(V2V)生成方法在动态视角合成中面临的三大核心问题:一是依赖非因果的全序列处理导致计算延迟过高;二是采用刚性前缀式时间拼接架构,引发二次复杂度增长且难以支持实时流媒体或可变长度输入;三是缺乏对闭环轨迹中严重不一致性的有效缓解机制。解决方案的关键在于提出一个全新的自回归框架 \textttRealCam,其核心创新包括:首先设计一种基于跨帧上下文学习(Cross-frame In-context Learning)的高保真教师模型,通过交错源帧与目标帧形成同步上下文对,实现长度无关的泛化能力并天然支持因果适应,突破前缀瓶颈;其次采用带分布匹配蒸馏的自强制(Self-Forcing with Distribution Matching Distillation)策略将教师模型压缩为几步因果学生模型,显著提升推理效率;最后引入环闭数据增强(Loop-Closed Data Augmentation, LoopAug),从多视角数据集中合成全局一致的闭环序列,有效缓解闭合路径中的循环不一致性问题。

链接: https://arxiv.org/abs/2605.06051
作者: Youcan Xu,Jiaxin Shi,Zhen Wang,Wensong Song,Feifei Shao,Chen Liang,Jun Xiao,Long Chen
机构: Zhejiang University (浙江大学); Xmax.AI Ltd. (Xmax.AI有限公司); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \textttRealCam, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbfCross-frame In-context Learning paradigm. By interleaving source and target frames into synchronized contextual pairs, our design inherently enables length-agnostic generalization and naturally facilitates causal adaptation, breaking the rigid prefix bottleneck. We then distill this teacher into a few-step causal student via Self-Forcing with Distribution Matching Distillation, enabling efficient, on-the-fly streaming synthesis. Furthermore, to mitigate severe loop inconsistency in closed-loop trajectories, we propose \textbfLoop-Closed Data Augmentation (LoopAug), a novel paradigm that synthesizes globally consistent loop sequences from existing multiview datasets. Extensive experiments demonstrate that \textttRealCam achieves state-of-the-art visual fidelity and temporal consistency while enabling truly interactive camera control with orders-of-magnitude faster inference than existing paradigms. Our project page is at this https URL.

[CV-64] Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization CVPR2026

【速读】:该论文旨在解决红外与可见光图像融合(Infrared and Visible Image Fusion, IVIF)中难以灵活适应异构需求的问题,尤其是如何实现同时满足人类视觉和机器视觉偏好的自适应融合。现有方法在处理不同应用场景下的偏好差异时表现不足,导致融合结果难以兼顾多样性与任务导向性。解决方案的关键在于提出DPOFusion框架,其核心由两个模块构成:一是属性对齐的潜在扩散模型(Property-Aligned Latent Diffusion Model, PALDM),通过潜在融合先验和联合条件损失生成具有多样化特性的候选融合结果;二是偏好可控的潜在扩散模型(Preference-Controllable Latent Diffusion Model, PCLDM),基于实例级直接偏好优化(Instance Direct Preference Optimization, IDPO)进行微调,从而实现对最终融合结果的偏好信号直接控制。该框架首次在IVIF中实现了从人类偏好到任务驱动模型的精准对齐,并显著提升了融合质量与跨任务迁移能力。

链接: https://arxiv.org/abs/2605.06049
作者: Weijian Su,Songqian Zhang,Yuqi Han,Jian Zhuang,Yongdong Huang,Qiang Zhang
机构: Dalian University of Technology (大连理工大学); North Minzu University (北方民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.

[CV-65] Domain Generalization through Spatial Relation Induction over Visual Primitives

【速读】:该论文旨在解决域泛化(Domain Generalization, DG)中因结构化组成关系建模不足而导致的性能瓶颈问题,尤其在组合性域泛化(compositional domain generalization)任务上表现受限。其核心解决方案是提出Primitive-Aware Relational Structure for domain gEneralization (PARSE),通过将视觉识别显式分解为视觉基元(visual primitives)及其空间关系结构,利用软二元、三元和四元谓词对基元位置进行可微分的空间对齐建模,从而显式学习结构化的组成表示。关键创新在于设计了一个端到端架构,包含CNN骨干网络提取通用特征、概念瓶颈层生成可微分坐标下的基元热图,以及结构评分层评估基元间候选空间关系,并最终基于类别特定的关系组合联合证据计算分类概率,显著提升了CUB-DG等基准上的准确率。

链接: https://arxiv.org/abs/2605.06043
作者: Dat Nguyen,Duc-Duy Nguyen
机构: Harvard University (哈佛大学); Basis Research Institute (基础研究所); Hanoi University of Science and Technology (河内科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augmentation, or feature-alignment objectives. Although these strategies can be effective, they leave the representation learning of structural composition implicit, which may limit performance on compositional domain generalization benchmarks. In this work, we propose Primitive-Aware Relational Structure for domain gEneralization (PARSE), an image classification framework that factors visual recognition into visual primitives and their relational composition. We represent these compositions using soft binary, ternary, and quaternary predicates over primitive locations, yielding differentiable measures of spatial alignment that can be learned end-to-end. To learn primitives and relational structures jointly, we design an end-to-end architecture with three components: (1) a convolutional neural network (CNN) backbone that extracts general visual features, (2) a concept bottleneck layer that maps these features to primitive heatmaps with differentiable spatial coordinates, and (3) a structural scoring layer that evaluates candidate spatial relations among the detected primitives. We then compute class probability from the joint evidence of its class-specific relational compositions. Across CUB-DG and the DomainBed benchmark suite,PARSE improves accuracy by over 4.5 percentage points on CUB-DG and remains competitive with existing DG methods on DomainBed. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.06043 [cs.CV] (or arXiv:2605.06043v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.06043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-66] PlotPick: AI-powered batch extraction of numerical data from scientific figures

【速读】:该论文旨在解决系统性综述与荟萃分析中数值数据提取效率低下的问题,即研究人员常需从科学图表中手动提取结构化数据,这一过程耗时且难以扩展。解决方案的关键在于提出并实现了一个名为PlotPick的开源工具,该工具利用视觉语言模型(Vision-Language Models, VLMs)批量自动从科学图表中提取结构化表格数据。实验表明,六种来自不同提供商的VLM在两个主流图表转表格基准(ChartX和PlotQA)上均显著优于专用模型DePlot,尤其在训练数据未覆盖的图表类型(如箱线图)上表现突出,验证了VLM在通用性和鲁棒性上的优势。

链接: https://arxiv.org/abs/2605.06021
作者: Tommy Carstensen
机构: Copenhagen Research Centre for Biological and Precision Psychiatry (哥本哈根生物与精准精神病学研究中心); Mental Health Centre Copenhagen (哥本哈根心理健康中心); Copenhagen University Hospital (哥本哈根大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: 7 pages, 2 figures, 2 tables. Software available at this https URL and this https URL

点击查看摘要

Abstract:Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models’ training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at this https URL.

[CV-67] 2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval

【速读】:该论文旨在解决文本到图像的车辆再识别(Text-to-Image Vehicle Re-identification)问题,即在仅有目击者描述而无查询图像的情况下,从非重叠摄像头拍摄的图像中检索最相似的车辆图像。其核心挑战在于如何实现细粒度跨模态对齐,尤其是在文本描述与图像局部部件之间建立精准对应关系。解决方案的关键在于提出PFCVR模型:首先在部件级别构建图像与文本的局部配对,并引入可学习的部件查询令牌(part-query tokens),聚合局部特征与全局语义信息以进行显式局部对齐;其次设计双向掩码恢复模块(bi-directional mask recovery module),通过双模态相互重建被掩码内容,隐式地将局部对应关系扩展至全局特征对齐,从而显著提升跨模态匹配性能。

链接: https://arxiv.org/abs/2605.06012
作者: Xiao Wang,Ziwen Wang,Weizhe Kong,Wentao Wu,Yuehang Li,Aihua Zheng,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学); School of Artificial Intelligence (人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large-scale dataset called T2I-VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine-grained part-level annotations. Experimental results on the T2I-VeRI dataset show that PFCVR achieves 29.2% Rank-1 accuracy, improving over the best competing method by +3.7% percentage points. On the newly proposed T2I-VeRW benchmark, PFCVR achieves 55.2% Rank-1 accuracy, outperforming a comprehensive set of recent state-of-the-art methods. Source code will be released on this https URL

[CV-68] Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models

【速读】:该论文旨在解决纯RGB视觉模型在夜间、雾霾等复杂场景下可靠性不足的问题,以及现有高保真红外与可见光图像融合方法因延迟过高而难以部署于实时边缘计算环境的挑战。解决方案的关键在于提出FusionProxy——一个独立可插拔的实时图像融合模块,其核心创新在于利用教师样本集合的两种互补统计特性:一是原始图像空间中的像素级方差,用于加权像素级监督;二是冻结基础模型内部的像素级方差,用于空间上引导特征对齐。该设计使FusionProxy无需联合优化即可无缝集成到任意视觉感知系统中,并在多种硬件平台上实现接近实时的推理速度,显著提升全天候感知任务(如闭合回路自动驾驶)的鲁棒性与实用性。

链接: https://arxiv.org/abs/2605.06010
作者: Yuchen Guo,Junli Gong,Wenjun Dong,Yiuming Cheung,Weifeng Su
机构: Northwestern University (西北大学); Northeastern University (东北大学); Hong Kong Baptist University (香港浸会大学); Beijing Normal - Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.

[CV-69] Neuromorphic visual attention for Sign-language recognition on SpiNNaker

【速读】:该论文旨在解决现有手语识别方法在实时部署中面临的高延迟和高功耗问题,尤其针对美国手语(ASL)指拼写识别任务。其关键解决方案是提出一种端到端的类脑架构,集成基于脉冲的视觉注意力机制以实现在线兴趣区域提取,并将轻量级脉冲神经网络部署于SpiNNaker类脑计算平台上,从而在保证识别性能的同时显著降低能耗(0.565 mW)和延迟(3 ms),展现出面向边缘计算场景的可行性与高效性。

链接: https://arxiv.org/abs/2605.06005
作者: Sarka Liskova,Olha Vedmedenko,Mazdak Fatahi,Matej Hoffmann,P. Michael Furlong,Giulia D Angelo
机构: Faculty of Information Technology, Czech Technical University in Prague, Czech Republic(布拉格捷克技术大学信息学院); Dept. of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic(布拉格捷克技术大学电气工程学院控制系); Université de Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France(里尔大学, 国家科学研究中心, 里尔中央理工学院, CRIStAL联合实验室UMR 9189)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign-language recognition has achieved substantial gains in classification accuracy in recent years; however, the latency and power requirements of most existing methods limit their suitability for real-time deployment. Neuromorphic sensing and processing offer an alternative paradigm based on sparse, event-driven computation that supports low-latency and energy-efficient perception. In this work, we introduce an end-to-end neuromorphic architecture for American Sign Language (ASL) fingerspelling recognition that integrates a spiking visual attention mechanism for online region-of-interest extraction with a compact spiking neural network deployed on the SpiNNaker neuromorphic platform. We benchmark the proposed system against two datasets: a synthetically generated event-based version of the Sign Language MNIST dataset and a natively recorded ASL-DVS dataset, whilst providing a comprehensive overview of Sign-language recognition and related work. This work yields competitive performance in simulation (92.27%) and comparable performance on neuromorphic hardware deployment (83.1%), while achieving the most energy-efficient architecture (0.565 mW) and low latency (3 ms) across all benchmarked approaches. Despite its compact design, the system demonstrates the suitability of task-dependent visual attention applications for edge deployment.

[CV-70] 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

【速读】:该论文旨在解决视觉语言模型(VLMs)在单目视频中进行动态空间推理(dynamic spatial reasoning)的难题,即如何让模型理解并预测场景随时间演变的物理规律,而传统方法要么仅用文本描述时空关系(导致冗长且不精确),要么依赖外部几何模块(增加复杂度且未提升模型内在能力)。解决方案的关键在于提出4DThinker框架,其核心创新是引入“4D思维”机制——通过动态潜在心理意象(dynamic latent mental imagery),在连续隐藏空间中内生模拟场景演化过程;具体实现上包含两个关键组件:一是无标注的数据生成管道,从原始视频合成4D推理数据;二是动态意象微调(DIFT)与4D强化学习(4DRL),分别联合监督文本标记与4D潜在变量,并通过基于结果的奖励机制优化策略梯度,从而实现稳定且高效的动态空间推理能力。

链接: https://arxiv.org/abs/2605.05997
作者: Zhangquan Chen,Manyuan Zhang,Xinlei Yu,Xiang An,Bo Li,Xin Xie,ZiDong Wang,Mingze Sun,Shuang Chen,Hongyu Li,Xiaobin Hu,Ruqi Huang
机构: Tsinghua University, SIGS; The Chinese University of Hong Kong; National University of Singapore; LMMs-Lab; University of California, Los Angeles
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 16 figures

点击查看摘要

Abstract:Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to “think with 4D” through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at this https URL.

[CV-71] PhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

【速读】:该论文旨在解决当前移动设备上运动模糊恢复(motion blur restoration)评估中因使用聚合指标而掩盖不同模糊难度下性能差异的问题,从而无法准确反映模型在真实部署环境中的行为。其解决方案的关键在于提出一个难度分层的基准数据集iPhoneBlur,该数据集包含7,400对从高帧率iPhone 17 Pro视频中合成的图像对,通过PSNR引导的自适应时间窗口划分方法将样本分为Easy、Medium和Hard三类,并验证了各层级间光学流幅值单调增加2.2倍,确保分层有效性;同时每张图像附带详尽元数据,支持ISP感知和难度自适应恢复策略的研究。此设计使得模型在资源受限边缘系统中的可靠性与失效模式得以系统性评估,显著揭示了传统聚合指标所隐藏的高达7–9 dB的性能下降。

链接: https://arxiv.org/abs/2605.05990
作者: Abdullah Al Shafi,Kazi Saeed Alam
机构: Khulna University of Engineering & Technology, Bangladesh ( khulna 大学工程与技术)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 Pages, 12 figures

点击查看摘要

Abstract:Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.

[CV-72] Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters ICIP2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在生物医学图像分割任务中因领域偏移(domain shift)和提示依赖性(prompt dependency)导致性能下降的问题。其关键解决方案是提出一种无需提示(prompt-free)、参数高效的微调框架,通过引入卷积位置编码生成器(convolutional Positional Encoding Generator)以适应不同长宽比的输入,并采用双适配器策略:高精度适配器(High-Performance Adapter)利用可变形卷积实现精确边界建模,轻量适配器(Lightweight Adapter)则通过结构重参数化显著降低推理延迟,从而在保持高分割精度的同时大幅提升计算效率。

链接: https://arxiv.org/abs/2605.05979
作者: Hinako Mitsuoka,Kazuhiro Hotta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICIP2026

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66% over the vanilla SAM2, while reducing computational costs by approximately 87% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.

[CV-73] RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling

【速读】:该论文旨在解决跨传感器RAW图像中目标检测的域泛化问题,即由于不同设备间曝光条件、光谱敏感性及位深度差异导致的显著域差距,使得现有模型难以在异构传感器数据上实现稳定性能。解决方案的关键在于提出一个物理引导的全局-局部色调映射框架(RAWild),通过将传感器引起的差异分解为全局色调校正和基于RAW分布先验的空间自适应局部色彩调整,使单一网络能够在多种传感器数据上联合训练;同时构建基于物理的RAW模拟管线以合成涵盖多样光谱敏感性、光源和传感器非理想性的真实感输出,从而提升模型的跨传感器鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2605.05941
作者: Shuhong Liu,Gengjia Chang,Jun Liu,Xuangeng Chu,Yinqiang Zheng,Tatsuya Harada,Ziteng Cui
机构: The University of Tokyo (东京大学); I2WM; RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bit depths across devices introduce substantially larger domain gaps than sRGB, making sensor-agnostic generalization a fundamental challenge. In this study, we present \textbfRAWild, a physics-guided global-local tone mapping framework for sensor-agnostic RAW object detection. By factoring sensor-induced variations into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, our framework enables a single network to train jointly across heterogeneous sensors. To further support cross-sensor generalization, we construct a physics-based RAW simulation pipeline that synthesizes realistic sensor outputs spanning diverse spectral sensitivities, illuminants, and sensor non-idealities. Extensive experiments across multiple RAW benchmarks covering bit depths from 10 to 24 demonstrate state-of-the-art (SOTA) performance under single-dataset, mixed-dataset, and challenging robustness settings.

[CV-74] Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering

【速读】:该论文旨在解决定量CT生物标志物(如器官体积和组织衰减)的解读缺乏大规模健康参考分布的问题。由于临床数据通常严重偏向病理状态,构建无病理性干扰的参考队列极具挑战性。其解决方案的关键在于开发一种基于证据、经过交叉验证的大语言模型(LLM)集成系统,用于从放射学报告中过滤病理发现,从而从超过35万次CT检查中构建去病理化队列。该方法首先由五个LLM基于原始报告文本标记结构层面的异常候选对象,再通过交叉验证机制解决分歧;进而利用分布感知的广义加性模型(generalized additive models for location, scale, and shape, GAMLSS)建立涵盖106个解剖结构的全身参考图表,考虑年龄、性别、对比剂增强及采集参数等协变量,实现标准化的量化表型分析与多中心研究支持。

链接: https://arxiv.org/abs/2605.05933
作者: Christian Wachinger,Bernhard Renger,Christopher Späth,Jan Kirschke,Marcus Makowski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Supplement available at: this https URL

点击查看摘要

Abstract:Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we develop an evidence-grounded, cross-verified large language model (LLM) ensemble to filter pathological findings from radiology reports, enabling the construction of pathology-reduced cohorts from over 350,000 CT examinations. Five LLMs, first, flag structure-level abnormality candidates grounded in verbatim report evidence and, second, resolve disagreements via cross-verification. Using distribution-aware generalized additive models for location, scale, and shape, we establish comprehensive whole-body reference charts for 106 anatomical structures (volumes and attenuation) across adulthood, accounting for age, sex, contrast enhancement, and acquisition parameters. Longitudinal analyses reveal structure- and contrast-dependent changes distinct from cross-sectional trends. These resources facilitate covariate-adjusted centile scoring from routine CT, supporting standardized quantitative phenotyping, multi-site imaging studies, and scalable opportunistic screening research.

[CV-75] Backdoor Mitigation in Object Detection via Adversarial Fine-Tuning

【速读】:该论文旨在解决针对目标检测模型的后门攻击(backdoor attacks)防御问题,尤其是在仅能访问受污染检测器和少量干净数据、且未知攻击目标的情况下。现有基于分类任务的对抗微调方法难以直接迁移至目标检测场景,因其生成的对抗样本不匹配检测攻击空间(如目标误分类或消失),且标准检测损失会稀释修复信号。解决方案的关键在于提出一种检测感知的对抗微调框架:首先引入软分支最小化(soft-branch minimisation),通过软门控机制融合与误分类和消失类攻击对齐的目标,并结合检测感知的分类损失最大化;其次设计双目标微调损失(dual-objective fine-tuning loss),聚焦于与攻击目标匹配的预测结果,集中修复后门行为。实验表明,该方法在CNN和Transformer架构的目标检测器上均显著降低攻击成功率,同时保持真实检测性能。

链接: https://arxiv.org/abs/2605.05928
作者: Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak
机构: Queensland University of Technology (昆士兰科技大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defenses for object detection remain comparatively underdeveloped. Adversarial fine-tuning is a common backdoor mitigation approach in classification, but adapting it to detection is nontrivial as classification-oriented adversarial generation does not match the detection attack space, where attacks may cause object misclassification or disappearance, and standard detection losses can dilute the repair signal across many predictions. We address these challenges through a detection-aware adversarial fine-tuning framework for mitigating object-detection backdoors when the defender has access only to a compromised detector and a small clean dataset, without knowing the attack objective. For adversarial generation that does not require knowledge of the attack objective, we introduce soft-branch minimisation, which uses a soft gate to combine objectives aligned with misclassification and disappearance attacks, together with a detection-aware classification-loss maximisation. For targeted repair, we introduce a dual-objective fine-tuning loss applied to target-matched predictions, concentrating the defensive update on predictions most relevant to the backdoor behaviour. Experiments across CNN- and Transformer-based detectors show that our approach more effectively reduces attack success while preserving true detections, compared with classification-oriented baselines, and maintains competitive clean detection performance.

[CV-76] hink then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

【速读】:该论文旨在解决视频奖励模型(Video Reward Models, RMs)在训练和推理过程中面临的泛化能力不足与训练不稳定性问题。现有方法中,判别式RM(Discriminative RM)虽能高效回归奖励但易陷入捷径学习,依赖大规模数据;而带思维链(Chain-of-Thought, CoT)的生成式RM虽具更强解释性和泛化潜力,却因推理与评分耦合于单一自回归流程而导致优化困难。解决方案的关键在于提出DeScore框架,采用“先思考后评分”的解耦范式:首先由多模态大语言模型(MLLM)生成显式的CoT推理过程,再通过一个独立的判别式评分模块(包含可学习查询token和回归头)预测最终奖励。该设计结合了CoT的语义监督优势与判别式模型的稳定训练特性,并通过两阶段优化策略——第一阶段利用随机掩码机制实现判别式冷启动,第二阶段采用双目标强化学习分别优化推理质量与奖励校准,从而确保高质量推理直接转化为更优性能。

链接: https://arxiv.org/abs/2605.05922
作者: Yuan Wang,Ouxiang Li,Yulong Xu,Borui Liao,Jiajun Liang,Jinghan Li,Meng Wang,Xintao Wang,Pengfei Wang,Kuien Liu,Xiang Wang
机构: University of Science and Technology of China(中国科学技术大学); Kling Team, Kuaishou Technology(快手科技Kling团队); Institute of Software Chinese Academy of Sciences(中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textitDiscriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textitGenerative RMs with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score’’ paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

[CV-77] From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation

【速读】:该论文旨在解决传统降雨观测数据分辨率低、存在偏差,难以准确刻画局部降雨特征的问题,尤其在稀疏地面观测与雷达空间信息融合时面临挑战。其核心解决方案是提出DropsToGrid方法,这是一种基于神经过程(Neural Process)的模型,通过多尺度特征提取、时间注意力机制和多模态融合策略,将噪声干扰下的不规则分布私有气象站时序数据与雷达提供的空间上下文信息进行高效整合,从而生成具有明确不确定性量化能力的连续、高分辨率降雨场估计。

链接: https://arxiv.org/abs/2605.05912
作者: Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Ira Assent
机构: Aarhus University (奥胡斯大学); Cordulus
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution rainfall observations are crucial for weather forecasting, water management, and hazard mitigation. Traditional operational measurements are often biased and low-resolution, limiting their ability to capture local rainfall. Accurate high-resolution rainfall maps require integrating sparse surface observations, yet existing deep learning densification methods are hindered by rainfall’s skewed, localized nature, noise, and limited spatio-temporal fusion. We present DropsToGrid, a Neural Process-based method that generates dense rainfall fields by fusing temporal sequences from noisy, irregularly distributed private weather stations with spatial context from radar. Leveraging multi-scale feature extraction, temporal attention, and multi-modal fusion, the model produces stochastic, continuous rainfall estimates and explicitly quantifies uncertainty. Evaluations on real-world datasets demonstrate that DropsToGrid outperforms both operational and deep learning baselines, generating accurate high-resolution rainfall maps with well-calibrated uncertainty, even when only few stations are available and in cross-regional scenarios.

[CV-78] Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

【速读】:该论文旨在解决现有提示学习(prompt learning)方法在视觉-语言模型(VLMs)中忽视类别特定知识的问题,导致零样本分类性能受限。具体而言,现有方法要么使用类共享提示(class-shared prompts),缺乏细粒度监督易引发误分类;要么采用实例特定提示(instance-specific prompts),忽略跨实例的类级信息,可能导致同一类样本被错误划分至多个类别。为有效补充类别特定知识,作者提出一种即插即用的类感知知识注入框架(Class-Aware Knowledge Injection, CAKI),其核心在于两个关键组件:一是通过少量样本编码类特定知识生成类特定提示并存储于类级知识库(knowledge bank);二是设计查询-键提示匹配机制,使每个测试实例可从知识库中检索相关类级知识并用于优化预测结果。此方案显著提升了基类与新类上的分类性能。

链接: https://arxiv.org/abs/2605.05910
作者: Junhui Yin,Nan Pu,Xinyu Zhang,Lingfeng Yang,Lin Wu,Xiaojie Wang,Zhun Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Journal of Computer Vision

点击查看摘要

Abstract:Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \hrefthis https URLthis https URL.

[CV-79] Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers

【速读】:该论文旨在解决监督式深度学习模型在标签噪声(label noise)下的泛化性能下降问题,尤其是当噪声具有语义相近性(semantically proximal classification errors)时,传统鲁棒训练方法往往失效。其核心解决方案是提出一种与架构无关的 Lipschitz 常数贝叶斯头(Lipschitz-constant Bayesian header),可集成至特征提取器如视觉变换器(vision transformer)中,形成 bi-Lipschitz-constrained Bayesian Vision Transformer(LipB-ViT)。关键创新在于对变分权重的均值和对数方差同时施加谱归一化(spectral normalization),从而增强预测不确定性的校准能力并抑制噪声放大;此外,引入联合捕捉不确定性与置信度的新指标及自适应算术平均融合策略,显著提升了错误标签检测性能,在15%语义误标率下召回率超过0.93,优于基于k近邻的方法。

链接: https://arxiv.org/abs/2605.05908
作者: Frederik Schäfer,Luis Mandl,Lars Kälber,Tim Ricken
机构: University of Stuttgart (斯图加特大学); Rockwell Collins Deutschland GmbH (洛克希德·马丁德国公司); University of Leipzig Medical Center (莱比锡大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables; Supplementary 5 pages with 5 figures; Including references total 18 pages

点击查看摘要

Abstract:Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.

[CV-80] Understanding Cross-Language Transfer Improvements in Low-Resource HTR: The Role of Sequence Modeling

【速读】:该论文旨在解决阿拉伯文字符(Arabic-script)手写文本识别(Handwritten Text Recognition, HTR)在低资源条件下的性能提升问题,尤其关注跨语言联合训练是否能带来有效迁移,并区分其背后的机制是共享视觉表征(visual representations)还是序列级依赖关系(sequence-level dependencies)。解决方案的关键在于设计了一个受控的架构对比实验:在相同单语和多语训练设置下,比较仅使用卷积神经网络(CNN-only)与结合循环神经网络的CRNN模型(含CTC解码)的表现。结果表明,CRNN模型在多语训练中显著优于CNN-only模型,尤其是在数据极度有限时(如每类仅100样本),且跨语言改进(delta CER)主要由序列建模能力驱动,而非单纯依赖CNN提取的视觉特征相似性,说明上下文建模在低资源场景下对迁移学习效果至关重要。

链接: https://arxiv.org/abs/2605.05900
作者: Sana Al-azzawi,Chang Liu,Nudrat Habib,Elisa Barney,Marcus Liwicki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains unclear whether these improvements are better explained by shared visual representations or sequence-level dependencies. In this work, we conduct a controlled architectural study of line-level Arabic-script HTR, comparing CNN-only models with CTC decoding and CRNN models under identical single-script and multi-script training regimes. Experiments are performed on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD) datasets under low-resource settings (K in 100, 500, 1000). Our results show a clear divergence in transfer behavior: while CNN-only models exhibit limited or unstable improvements, CRNN models achieve better performance under multi-script training, particularly in the most data-constrained regimes. Focusing on transfer improvements (delta CER) rather than absolute performance, we find that cross-language improvements are associated with sequence-level modeling, while sharing visual representations learned by the CNN encoder, corresponding to similarities in character shapes across scripts, alone appears to be insufficient. This finding suggests that contextual modeling plays an important role in enabling effective transfer in low-resource scenarios, and that similar behavior may extend to other low-resource language settings.

[CV-81] Detecting AI-Generated Videos with Spiking Neural Networks

【速读】:该论文旨在解决AI生成视频(AI-generated video)在跨生成器(cross-generator)场景下的检测难题,即现有检测方法在面对不同生成模型产生的视频时性能显著下降的问题。其核心挑战在于,不同生成器引入的伪影类型和时间尺度差异导致传统检测器难以泛化。解决方案的关键在于发现并利用两个新特征:一是像素级帧间残差更平滑,二是语义特征空间中轨迹更紧凑,表明存在时空平滑性差距;更重要的是,作者观察到将原始视频输入脉冲神经网络(Spiking Neural Networks, SNNs)时,假视频在物体和运动边界处激发显著放电,而真实视频则不然,这揭示了SNN对局部时序伪影的敏感性。基于此,论文提出MAST检测器,采用脉冲驱动的时间分支处理多通道时序残差,同时结合冻结的语义编码器以实现跨生成器的强泛化能力,在GenVideo基准上达到93.14%平均准确率,验证了SNN在该任务中的有效性与实用性。

链接: https://arxiv.org/abs/2605.05895
作者: Minsuk Jang,Yujin Yang,Heeseon Kim,Minseok Son,Younghun Kim,Changick Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.

[CV-82] MTL-MAD: Multi-Task Learners are Effective Medical Anomaly Detectors

【速读】:该论文旨在解决医学图像中异常检测(anomaly detection)的问题,尤其针对训练阶段缺乏异常样本的挑战。传统方法依赖单一预训练任务与大规模模型取得较好性能,但难以充分捕捉正常解剖结构的多样性。本文的关键解决方案是提出一种从零开始学习多个自监督和伪标签任务的多任务学习(multi-task learning, MTL)框架,基于混合专家(Mixture-of-Experts, MoE)架构联合建模。通过精心设计多个代理任务的协同优化,该模型能够学习到对正常解剖结构鲁棒的表征,从而在推理阶段根据模型在各任务上的表现差异生成异常得分。实验表明,该方法在BMAD基准上优于所有现有先进方法,并能输出可解释的异常定位图,有助于临床诊断。

链接: https://arxiv.org/abs/2605.05891
作者: Bogdan Alexandru Bercean,Florinel Alin Croitoru,Vlad Hondru,Ciprian Mihai Ceausescu,Andreea Iuliana Ionescu,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学); Polytechnic University of Timişoara (蒂米什瓦拉理工大学); Colţea Clinical Hospital (科尔泰亚临床医院); Carol Davila University of Medicine and Pharmacy (卡罗尔·达维拉医科大学和药学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, we propose to learn multiple self-supervised and pseudo-labeling tasks from scratch, using a joint model based on Mixture-of-Experts (MoE). By carefully integrating multiple proxy tasks, the joint model effectively learns a robust representation of normal anatomical structures, so that anomaly scores can be derived based on how well the multi-task learner (MTL) solves each task during inference. We perform comprehensive experiments on BMAD, a recent benchmark that comprises a broad range of medical image modalities. The empirical results indicate that our multi-task learner is an effective anomaly detector, outperforming all state-of-the-art competitors on BMAD. Moreover, our model produces interpretable anomaly maps, potentially helping physicians in providing more accurate diagnoses.

[CV-83] DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation CVPR2026

【速读】:该论文旨在解决基于扩散模型的图像到图像(image-to-image, I2I)翻译中采样效率低的问题,特别是扩散桥模型(Diffusion Bridge Models, DBMs)在生成高质量图像时需要大量函数评估次数(Number of Function Evaluations, NFEs),导致计算成本过高。解决方案的关键在于提出DBMSolver——一种无需训练的采样器,其核心创新是利用DBM底层随机微分方程(SDE)和常微分方程(ODE)的半线性结构,通过指数积分器(exponential integrators)构建一阶和二阶高效求解方法,从而显著降低NFE需求(最高达5倍),同时提升生成质量(如在DIODE数据集上,20次NFE下FID指标相比二阶基线下降53%),实现了更优的效率与质量权衡。

链接: https://arxiv.org/abs/2605.05889
作者: Sankarshana Venugopal,Mohammad Mostafavi,Jonghyun Choi(Seoul National University)
机构: Seoul National University (首尔国立大学); ECE, IPAI, ASRI in SNU (首尔国立大学电子与计算机工程系、智能感知与人工智能研究所、先进机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: Accepted to CVPR 2026. Includes supplementary material

点击查看摘要

Abstract:Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM’s underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at this https URL.

[CV-84] raining-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models

【速读】:该论文旨在解决**密集手部接触估计(dense hand contact estimation)**问题,即在复杂人机交互场景中准确定位手部与物体之间的接触区域。传统方法受限于对三维手部几何结构的建模能力不足以及难以捕捉顶点级别的精细接触信息,而多模态大语言模型(MLLMs)虽具备强大的视觉语义理解能力,却因缺乏显式3D几何编码和细粒度几何推理机制,在此任务上表现有限。解决方案的关键在于提出一种无需训练、零样本(zero-shot)的方法 ContactPrompt:首先通过细致的手部部件分割(hand-part segmentation)和基于部件的顶点网格表示(part-wise vertex-grid representation),将3D手部几何结构以结构化方式嵌入输入;其次设计多阶段结构化接触推理机制(multi-stage structured contact reasoning),结合部件条件约束逐步融合全局语义与局部几何信息,从而有效激发MLLM的推理潜力并实现高精度的密集接触预测。

链接: https://arxiv.org/abs/2605.05886
作者: Daniel Sungho Jung,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.

[CV-85] 3DSS: 3D Surface Splatting for Inverse Rendering

【速读】:该论文旨在解决从多视角图像中进行物理基础逆渲染(physically-based inverse rendering)时,如何高效且准确地恢复形状、空间变化的双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)材料属性以及光照的问题。其核心挑战在于传统方法在处理表面采样与遮挡关系时存在精度不足或计算复杂度高的问题。解决方案的关键在于提出了一种可微分的表面点绘制(surface splatting)渲染器——3D Surface Splatting (3DSS),其创新性地将表面分离问题直接建模为重建核本身的函数,并基于覆盖度(coverage-based)推导出每层的不透明度,该不透明度由椭圆加权平均(Elliptical Weighted Average, EWA)重建权重累积得到,从而实现抗锯齿轮廓和稀疏覆盖边缘处的信息性可见性梯度。结合前向微facet着色、联合优化的HDR环境光照及密度感知自适应细化策略,3DSS能够同时恢复几何、材质与光照,并通过有向点云自然衔接至网格工作流,显著优于基于网格、隐式表示及高斯点绘制的基线方法。

链接: https://arxiv.org/abs/2605.05876
作者: Mae Younes,Adnane Boukhayma
机构: INRIA(法国国家信息与自动化研究院); University of Rennes(雷恩大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a direct formulation in terms of the reconstruction kernels themselves. From this foundation we derive a coverage-based compositing model whose per-layer opacity arises directly from the accumulated Elliptical Weighted Average reconstruction weight, yielding anti-aliased silhouettes and informative visibility gradients at sparsely covered edges. Combined with forward microfacet shading under co-optimized HDR environment lighting and density-aware adaptive refinement, 3DSS jointly recovers shape, spatially-varying BRDF materials, and illumination. Because the optimized representation is a set of oriented surface samples, it bridges natively to mesh-based workflows via surface reconstruction from oriented point cloud methods. We evaluate 3DSS against mesh-based, implicit, and Gaussian-splatting baselines across geometry reconstruction, novel-view synthesis, and novel-illumination relighting.

[CV-86] InkDiffuser: High-Fidelity One-shot Chinese Calligraphy via Differentiable Morphological Optimization

【速读】:该论文旨在解决当前中文书法生成方法中存在的笔画渲染质量差和墨迹形态不真实的问题,导致生成结果在视觉保真度和艺术流畅性方面表现有限。解决方案的关键在于提出了一种基于扩散模型的单样本中文书法合成框架 InkDiffuser,其核心创新包括两个方面:一是高频率增强机制,通过显式融合高频特征以更准确地提取字体结构细节;二是可微分墨迹结构(Differentiable Ink Structure, DIS)损失函数,该损失将可微分形态学操作引入扩散过程,使模型能够显式分解并精细优化墨迹轨迹结构,从而显著提升生成书法作品的笔画轮廓精度与视觉真实性。

链接: https://arxiv.org/abs/2605.05865
作者: Kunchong Shi,Jing Zhang
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Current Chinese calligraphy generation methods suffer from poor stroke rendering and unrealistic ink morphology, resulting in outputs with limited visual fidelity and artistic fluidity. To address this problem, we propose \textbfInkDiffuser, a diffusion-based generative framework for one-shot Chinese calligraphy synthesis. To guarantee high-fidelity rendering, we introduce two core contributions: a high-frequency enhancement mechanism and a Differentiable Ink Structure (DIS) loss that explicitly regularizes ink morphology. Inspired by the observation that high-frequency information in individual samples typically carries contour details, we enhance content extraction by explicitly fusing high-frequency representations for more accurate font structure. Furthermore, we propose a differentiable ink structure loss that integrates differentiable morphological operations into the diffusion process. By allowing the model to learn an explicit decomposition of ink-trace structures, DIS facilitates fine-grained refinement of stroke contours and delivers significantly improved visual realism in the generated calligraphy. Extensive experiments on various calligraphic styles and complex characters demonstrate that InkDiffuser can generate superior calligraphy fonts with realistic ink rendering effects from only a single reference glyph and outperform existing few-shot font generation approaches in structural consistency, detail fidelity, and visual authenticity. The code is available at the following address: this https URL.

[CV-87] Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection

【速读】:该论文旨在解决零样本三维异常检测(zero-shot 3D anomaly detection)中因域差距导致的性能瓶颈问题,即现有方法通常将3D观测投影为多视角表示以提取几何特征,再通过在RGB数据上预训练的视觉编码器进行处理,这种策略造成了编码器与投影表示之间的显著模态差异。其解决方案的关键在于提出一个统一的两阶段框架Align3D-AD:首先设计跨模态特征对齐机制,将渲染特征映射至RGB语义空间,实现从辅助类别RGB观测中的直接语义迁移;其次引入一种模态感知提示学习框架,结合双提示对比对齐策略,在不同模态间捕获互补语义信息并增强提示表征的判别能力,从而有效缩小模态间隙并提升模型泛化性与鲁棒性。

链接: https://arxiv.org/abs/2605.05850
作者: Letian Bai,Xuanming Cao,Juan Du,Chengyu Tao
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Hunan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified two-stage framework that leverages the RGB modality from auxiliary categories as cross-modal guidance for zero-shot 3D anomaly detection. First, we introduce a cross-modal feature alignment paradigm that maps rendering features into the RGB semantic space. Unlike prior works that implicitly rely on pretrained encoders, our method enables direct semantic transfer from RGB observations. A semantic consistency reweighting strategy is further introduced to refine feature alignment by reweighting local regions according to holistic semantic consistency. Second, we propose a modality-aware prompt learning framework with dual-prompt contrastive alignment. By assigning independent prompts to RGB-aligned and rendering features, our method captures complementary semantics across modalities, while the contrastive alignment further enhances prompt representations to improve discriminability. Extensive experiments on MVTec3D-AD, Eyecandies, and Real3D-AD demonstrate that Align3D-AD consistently outperforms existing zero-shot methods under both one-vs-rest and cross-dataset settings, highlighting its generalization capability and robustness. Code and the dataset will be made available once our paper is accepted.

[CV-88] VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

【速读】:该论文旨在解决视频大模型在处理长视频时面临的可扩展性瓶颈问题,即长视频产生的视觉标记(visual-token)序列过长,导致推理过程中的内存占用和延迟显著增加。现有压缩方法在特定场景下有效,但普遍存在对查询不敏感或采用固定压缩策略的问题,难以应对视觉证据在时间上分布不均的情况。解决方案的关键在于提出 VideoRouter,一个基于 InternVL 的查询自适应双路由框架:其中语义路由(Semantic Router)决定整体分配策略,权衡广域时间覆盖与关键帧高分辨率保留;图像路由(Image Router)利用早期大语言模型(LLM)层评估帧相关性,实现对非关键帧的激进压缩和对关键证据帧的细节保留。该框架通过构建 Video-QTR-10K 和 Video-FLR-200K 数据集分别监督分配策略与帧相关性,实验表明其能在相当或更低的计算预算下显著提升性能,最高实现 67.9% 的 token 减少。

链接: https://arxiv.org/abs/2605.05848
作者: Kuanwei Lin,Wenhao Zhang,Ge Li
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.

[CV-89] Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media CVPR

【速读】:该论文旨在解决科学知识在不同模态(如文本、视觉和语音)材料中缺乏结构化关联的问题,这限制了研究内容的统一探索与分析。其关键解决方案是构建首个多模态会议数据集(Multimodal Conference Dataset, MCD),该数据集整合了来自同一研究工作的论文、演示视频、解释性视频和幻灯片,并通过嵌入模型和视觉-语言模型评估跨格式细粒度对应关系的能力,从而建立首个系统性基准,为未来多模态科学理解研究提供基础支撑。

链接: https://arxiv.org/abs/2605.05831
作者: Megha Mariam K.M,Vineeth N. Balasubramanian,C.V. Jawahar
机构: IIIT Hyderabad (印度信息技术研究所); Microsoft Research India (微软研究院印度); IIT Hyderabad (印度理工学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study’s reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at this https URL

[CV-90] ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction

【速读】:该论文旨在解决线图(line chart)中自动化数据提取面临的两大核心挑战:一是由于图表样式多样性高且真实世界标注数据稀缺,导致现有端到端模型难以泛化;二是重建过程中存在两个关键失效模式——细小交叉曲线易因结构碎片化而丢失细节,以及依赖刚性空间规则的图例匹配策略在真实场景下因图例位置不可预测而失效。解决方案的关键在于提出ChartZero框架,其核心创新包括:1)利用纯合成数据训练模型以绕过真实标注瓶颈;2)通过新颖的全局正交实例(Global Orthogonal Instance, GOI)损失函数缓解曲线碎片化问题;3)引入基于视觉-语言模型(Vision-Language Model, VLM)引导的开放词汇图例匹配策略替代传统空间规则,实现更鲁棒的语义关联。此外,研究还构建了面向全流程重建的新评估指标与基准,推动对真实性能的全面衡量。

链接: https://arxiv.org/abs/2605.05820
作者: Md Touhidul Islam,Yasir Mahmud,Sujan Kumar Saha,Mark Tehranipoor,Farimah Farahmandi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crippling their ability to generalize across arbitrary aesthetics and grid layouts. Furthermore, existing models suffer from two critical failure modes during reconstruction. First, extracting thin, intersecting curves frequently causes structural fragmentation and the erasure of fine visual details, as standard architectures struggle against complex backgrounds. Second, semantic association is notoriously error-prone; current pipelines rely on rigid spatial heuristics that easily break down against the unpredictable legend placements of in-the-wild charts. Finally, measuring true progress is hindered by evaluation protocols that assess isolated sub-tasks rather than holistic, end-to-end data reconstruction. To address these foundational issues, we introduce ChartZero, a parsing framework that leverages synthetic priors to enable robust zero-shot chart data extraction. By training exclusively on a purely synthetic dataset of simple mathematical functions, our model completely bypasses the real-world annotation bottleneck. We overcome curve fragmentation via a novel Global Orthogonal Instance (GOI) loss, and replace brittle spatial rules with an open-vocabulary, Vision-Language Model (VLM)-guided legend matching strategy. Accompanied by a new metric and benchmark specifically designed for full end-to-end reconstruction, our evaluations demonstrate that ChartZero significantly advances generalized plot digitization without requiring real-world supervision. Code and dataset will be released upon acceptance.

[CV-91] CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

【速读】:该论文旨在解决医学视觉语言模型在胸部X光片(CXR)诊断中出现的“否定选项吸引”(negated-option attraction)问题,即模型在图像明确显示某种病变(如实变,consolidation)时仍错误选择“无该病变”的否定选项,导致临床判断与图像证据完全相反的极性反转(polarity reversal)。这种错误不仅降低准确性,更可能引发严重医疗风险。解决方案的关键在于引入CXRC contraBench基准测试框架,并提出QCCV-Neg(Question-Conditioned Consistency Verifier for Negation),一种无需重新训练即可在推理阶段检测并修正极性混淆子集的确定性校正机制,显著提升模型对存在性问题的正确响应能力,使MedGemma和Qwen2.5-VL在直接存在性探测任务中的准确率分别从约30%提升至96.60%和95.32%。

链接: https://arxiv.org/abs/2605.05810
作者: Zhengru Fang,Yanan Ma,Yu Guo,Senkang Hu,Yixian Zhang,Hangcheng Cao,Wenbo Ding,Yuguang Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer “No consolidation.” This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting “No X” despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at this https URL.

[CV-92] Na-IRSTD: Enhancing Infrared Small Target Detection via Native-Resolution Feature Selection and Fusion

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因背景杂波干扰而导致的微弱目标精确定位难题。现有方法通常采用下采样策略处理特征,导致小目标细节丢失,从而限制了检测性能。其解决方案的关键在于提出一种原生分辨率特征提取与融合框架(Na-IRSTD),通过保留原生分辨率特征来增强对微弱目标线索的感知能力,同时引入有效的token减少与选择策略,在保持低层细节的同时降低计算负担,实现高精度的目标区域筛选和高效处理。

链接: https://arxiv.org/abs/2605.05804
作者: Qian Xu,Chi Zhang,Qiming Zhang,Xi Li,Haojuan Yuan,Mingjin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard small targets’ details, resulting in suboptimal performance. In this paper, we present Na-IRSTD, a native-resolution feature extraction and fusion framework for IRSTD. This framework elegantly incorporates native-resolution features to preserve subtle target cues, overcoming the resolution limitations of existing infrared approaches and significantly improving the model’s ability to localize small targets. We also introduce an effective token reduction and selection strategy, which selects target patches with high accuracy and confidence, boosting the low-level details of the feature while effectively reducing native-resolution patch tokens compared to dense processing, thereby avoiding imposing an unbearable computational burden. Extensive experiments demonstrate the robustness and effectiveness of our token reduction and selection strategy across multiple public datasets. Ultimately, our Na-IRSTD model achieves state-of-the-art performance on four benchmarks.

[CV-93] Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses

【速读】:该论文旨在解决当前图像隐写术(Steganography)在生成式 AI 安全场景下缺乏统一、系统评估框架的问题,尤其是针对恶意攻击者利用隐写技术嵌入有害语义或指令以规避内容审核和诱导模型产生危险输出的安全风险。其解决方案的关键在于提出 SADBench,一个涵盖四大核心任务的系统性基准:隐写攻击能力评估、隐写分析防御能力评估、效率评估与迁移能力评估;该框架通过模拟真实世界中图像-文本双模态隐写(image-payload and text-payload steganography),并基于多样化的载体分布(cover distributions)测试不同攻击与检测方法,从而量化潜在威胁并揭示关键安全不对称性,如攻击对新分布具有强泛化能力而检测器难以适应,为可度量、驱动安全的隐写防御研究提供可复现、可扩展的基础平台。

链接: https://arxiv.org/abs/2605.05789
作者: Zhen Sun,Zongmin Zhang,Leyi Sheng,Yule Liu,Yifan Liao,Ke Li,Xinhu Zheng,Jiaheng Wei,Wenyuan Yang,Xinlei He
机构: Wuhan University(武汉大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Sun Yat-sen University(中山大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:Image steganography is widely used to protect user privacy and enable covert communication. However, it can also be abused by the adversary as a covert channel to bypass content moderation, disseminate harmful semantics, and even hide malicious instructions in images to elicit dangerous outputs from large models, posing a practical security risk that continues to evolve. To address the lack of a unified and systematic evaluation framework, we propose SADBench, a systematic benchmark that assesses the adversary’s ability to inject harmful secrets via steganography and the defender’s ability to detect such threats through steganalysis. Crucially, SADBench comprises 4 core tasks, namely steganography attack capability evaluation, steganalysis defense capability evaluation, efficiency evaluation, and transferability evaluation. It evaluates both image-payload and text-payload steganography across diverse cover distributions, utilizing harmful visual semantics and toxic instructions to simulate malicious attacks. Across a broad set of attacks and detectors, SADBench reveals that (i) INN and autoencoder-based methods demonstrate superior stability compared to other architectures, (ii) in-domain detection is near-perfect and cheaper than generation, (iii) a critical asymmetry exists in transferability where attacks robustly generalize to new distributions while detectors fail to adapt, and (iv) real-world threats persist on social media, where payloads either survive minimal compression or effectively adapt to aggressive compression via simulated training. Overall, SADBench establishes a systematic, reproducible, and extensible framework to quantify risks, paving the way for measurable and security-driven advancements in steganography defense.

[CV-94] Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

【速读】:该论文旨在解决当前统一多模态模型中理解(Understanding)与生成(Generation)模块之间连接薄弱的问题。尽管现有先进模型在各自任务上表现优异,但其解耦设计削弱了二者间的相互增强机制,导致潜在协同效应难以实现。解决方案的关键在于提出一种轻量级框架——理解导向后训练(Understanding-Oriented Post-Training, UNO),将理解不仅视为独立任务,更作为直接监督信号来引导生成表示的学习。通过引入编码语义抽象(如图像描述)和结构细节(如视觉回归)的目标函数,UNO实现了从理解到生成的有效梯度传播,从而显著提升图像生成与编辑性能,验证了理解对生成的催化作用。

链接: https://arxiv.org/abs/2605.05781
作者: Zeyu Liu,Zanlin Ni,Yang Yue,Cheng Da,Huan Yang,Di Zhang,Kun Gai,Gao Huang
机构: Tsinghua University (清华大学); Kolors Team, Kuaishou Technology (快手科技Kolors团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.

[CV-95] Von Neumann Networks

【速读】:该论文旨在解决传统深度学习模型在架构设计上依赖人工经验、参数效率低且难以自动适应新任务的问题。其解决方案的关键在于提出一种基于冯·诺依曼细胞阵列(Von Neumann cellular array)的新型人工神经元——冯·诺依曼神经元(Von Neumann neuron),该神经元通过学习具有扩散特性的格林函数(Green’s functions)与卷积操作,在细胞拓扑结构上实现自工程化(self-engineered)的网络架构。这种架构仅由输入输出的位置和结构决定,无需手动设计层间连接,并且理论上属于计算通用的细胞机器(Cellular Machines)范畴,从而在保持高表达能力的同时显著提升参数效率并具备任务扩展能力。

链接: https://arxiv.org/abs/2605.05780
作者: Shekhar S. Chandra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled by a diffusion process. In this work, we show that such a system, when developed in a modern deep learning setting, enables the construction of an artificial neuron having specialized roles that can be learnt. We refer to this neuron as the Von Neumann neuron, and the resulting neural network from such neurons result in a self-engineered design whose architecture is only dependent on the structure and locations of its inputs and outputs on this cellular array. The mathematical framework for these Von Neumann Networks (VNNs) is also constructed and shows that they are based on the extension of neural operators and the learning of Green’s functions with convolutions on a cellular topology having a diffusion signature. We also prove that these VNNs are part of a more general computational system called Cellular Machines that are computationally universal. Initial experiments show that VNN based multi-layered perceptrons outperform their equivalent deep learning variant on basic tasks, while being more parameter efficient and are capable of learning new types of tasks. This includes the ability to solve for and construct an extension of the Von Neumann (hardware) architecture common to all modern computers to cells and suggests new opportunities that could be explored.

[CV-96] he autoPET3 Challenge – Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization

【速读】:该论文旨在解决多中心、多示踪剂([18F]-FDG 和 [68Ga]-PSMA)PET/CT图像中自动化病灶分割的组合泛化能力问题,即模型在训练时未见过的示踪剂-中心组合下的性能表现。其关键解决方案在于构建了一个具有挑战性的基准测试环境:使用来自两个不同医院的共1,611例标注数据(构成目前最大的公开PSMA PET/CT数据集),并在测试集中引入两种未见的示踪剂-中心配对组合,以严格评估模型的泛化能力。此外,通过设立数据驱动的奖励类别(data-centric award category),量化了数据处理策略对性能提升的贡献,发现优化数据预处理与增强比单纯改进网络结构更为关键,尤其在应对系统性体积高估问题上效果显著。

链接: https://arxiv.org/abs/2605.05775
作者: Jakob Dexl,Katharina Jeblick,Andreas Mittermeier,Balthasar Schachtner,Anna Theresa Stüber,Johanna Topalis,Maximilian Rokuss,Fabian Isensee,Klaus H. Maier-Hein,Hamza Kalisch,Jens Kleesiek,Constantin M. Seibold,Hussain Alasmawi,Lap Yan Lennon Chan,Yixuan Yuan,Alexander Jaus,Rainer Stiefelhagen,Pauline Ornela Megne Choudja,Konstantin Nikolaou,Christian La Fougère,Sergios Gatidis,Matthias P. Fabritius,Maurice Heimer,Gizem Abaci,Lalith Kumar Shiyam Sundar,Rudolf A. Werner,Jens Ricke,Clemens C. Cyran,Thomas Küstner,Michael Ingrisch
机构: University of Tübingen (图宾根大学); Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint submitted to Medical Image Analysis

点击查看摘要

Abstract:We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

[CV-97] X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

【速读】:该论文旨在解决移动设备上个人代理(Personal Agent)在处理复杂、多模态交互时面临的挑战,即如何实现高上下文感知能力与个性化智能的统一。其核心问题在于现有系统难以有效融合用户界面(UI)状态、现实视觉环境和语音输入等多模态信息,并在任务连续性和个性化记忆之间取得平衡。解决方案的关键在于提出X-OmniClaw这一统一架构,通过三个核心模块实现:Omni Perception构建了统一的多模态输入管道,结合时间对齐模块将原始数据转化为结构化的意图表示;Omni Memory利用运行时工作记忆与本地数据蒸馏的长期个人记忆相结合,提升个性化智能;Omni Action采用结构化XML元数据与视觉感知融合的混合接地策略,辅以行为克隆(Behavior Cloning)和轨迹回放(Trajectory Replay)技术提取可复用的导航技能,从而实现精准的直接执行。该架构显著提升了移动任务的交互效率与可靠性,为下一代原生移动个人助手提供了可落地的技术蓝图。

链接: https://arxiv.org/abs/2605.05765
作者: Xiaoming Ren,Ru Zhen,Chao Li,Yang Song,Qiuxia Hou,Yanhao Zhang,Peng Liu,Qi Qi,Quanlong Zheng,Qi Wu,Zhenyi Liao,Binqiang Pan,Haobo Ji,Haonan Lu
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

[CV-98] RIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

【速读】:该论文旨在解决现有肺部CT模型评估基准中存在的结构性混杂问题,即静态回顾性数据集将病灶大小、肺叶分布、解剖结构和采集条件等因素交织在一起,导致难以准确识别影响模型性能的关键变量。其解决方案的核心在于提出iTRIALSPACE——一个可编程的评估框架,通过四阶段流程(多数据集结节特征建模、显式试验设计、解剖感知掩码插入及ControlNet条件化CT合成)构建受控虚拟病灶试验,从而实现对模型性能的可控、可解释与可验证测试。该框架基于包含54个属性的结节特征数据集(涵盖13,140个标注结节),并生成13种受控试验模式,在大规模虚拟病变研究中验证了合成数据与真实数据间性能排名高度一致(ρ = 0.93, p < 10⁻¹⁵),显著优于传统固定分布基准。

链接: https://arxiv.org/abs/2605.05761
作者: Fakrul Islam Tushar,Umme Hafsa Momy,Joseph Y. Lo,Geoffrey D. Rubin
机构: University of Arizona; Florida International University; Duke University Medical Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 13 figures, 13 tables

点击查看摘要

Abstract:We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ( \rho = 0.93, p 10 ^-15 ). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

[CV-99] MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

【速读】:该论文旨在解决生成式 3D 人-物交互(Human-Object Interaction, HOI)时面临的挑战:如何在保持高阶语义一致性的同时,精确维持物体间的物理接触。现有方法虽能较好对齐语义意图,但在物体接触精度上表现不足。作者发现一个关键现象——“几何遗忘”(Geometric Forgetting),即随着扩散模型深度增加,语义特征逐渐掩盖几何特征,导致模型丧失对物体形状细节的感知能力。解决方案的核心是提出 MaMi-HOI 框架,通过分层设计实现宏观运动流畅性与微观空间精度的统一:首先引入几何感知邻近适配器(Geometry-Aware Proximity Adapter, GAPA),显式重注入密集物体几何信息以进行残差贴合修正;其次设计运动学和谐适配器(Kinematic Harmony Adapter, KHA),主动对齐全身姿态与空间目标,确保骨骼结构在满足约束的前提下仍保持自然动作特性。此双机制协同优化,显著提升长期任务中复杂轨迹下的交互真实性与物理合理性。

链接: https://arxiv.org/abs/2605.05756
作者: Hao Wang,Shiqi Wang,Qi Liu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed \textitGeometric Forgetting: as diffusion model depth increases, semantic feature tend to overshadow object geometry feature, causing the model to lose its perception to object geometry. To address this, we propose MaMi-HOI, a hierarchical framework reconciling \textbfMacro-level kinematic fluidity with \textbfMicro-level spatial precision. First, to counteract geometric forgetting, we introduce the Geometry-Aware Proximity Adapter (GAPA), which explicitly re-injects dense object details to perform residual snapping corrections for precise contact. Nevertheless, such aggressive local enforcement can disrupt global dynamics, leading to robotic stiffness. In response, we introduce the Kinematic Harmony Adapter (KHA), which proactively aligns whole-body posture with spatial objectives, ensuring the skeleton actively accommodates constraints without compromising naturalness. Extensive experiments validate that MaMi-HOI simultaneously achieves natural motion and precise contact. Crucially, it extends generation capabilities to long-term tasks with complex trajectories, effectively bridging the gap between global navigation and high-fidelity manipulation in 3D scenes. Code is available at this https URL

[CV-100] Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation

【速读】:该论文旨在解决现实世界视频中人类运动分割(Human Motion Segmentation, HMS)性能不佳的问题,其根源在于传统基于子空间聚类的方法假设高维时序特征分布符合联合子空间(Union-of-Subspaces, UoS)结构,而实际帧级特征往往违背此假设。解决方案的关键在于提出一种名为Temporal Deep Self-expressive Subspace Clustering (TDSC) 的新方法,通过交替优化输入帧特征的结构化表示与自表达系数,引入编码率最大化正则项防止表示坍塌并引导学习到符合UoS分布的表示,同时结合时间约束促使相邻帧被分配至同一运动段;此外,设计了时间动量平均机制以稳定亲和力演化,并采用重参数化策略提升优化效率,从而实现更准确、鲁棒的HMS。

链接: https://arxiv.org/abs/2605.05753
作者: Xianghan Meng,Zhiyuan Huang,Zhengyu Tong,Chun-Guang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is currently under review by the IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assumption and yield unsatisfactory segmentation performance. To address this issue, we propose an efficient and effective approach for HMS, named Temporal Deep Self-expressive subspace Clustering (TDSC), which jointly learns temporally consistent structured representations and stabilized affinity for accurate and robust HMS. Specifically, in TDSC, we alternately learn structured representations of the input frame features and self-expressive coefficients via a properly regularized self-expressive model, in which a coding-rate maximization regularizer is incorporated to avoid representation collapse and conform the learned representations to span a desired UoS distribution, and meanwhile, temporal constraints are incorporated to promote temporally adjacent frames to be partitioned into the same groups. Moreover, we develop a temporal momentum averaging mechanism to stabilize affinity evolution and design a reparameterization strategy to enable efficient optimization. We conduct extensive experiments on five benchmark HMS datasets using both conventional (HoG) and up-to-date deep features (i.e., CLIP, DINOv2) to validate the effectiveness of our approach.

[CV-101] Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

【速读】:该论文旨在解决连续图像流中密集三维重建面临的两个核心挑战:一是如何实现精确的几何聚合,二是如何管理长期稳定的记忆以避免累积误差。现有前馈式重建框架虽通过持久化记忆表示整合观测数据,但主要依赖外观相似性进行更新,导致视角变化时出现冗余积累和几何不稳定问题。其解决方案的关键在于提出一种射线感知指针记忆(ray-aware pointer memory),该机制在统一的记忆表征中显式建模空间位置与视角方向,每个记忆指针存储三维坐标、对应射线方向及特征嵌入,从而联合推理几何邻近性和视角一致性。在此基础上,设计了一种自适应指针更新策略,摒弃传统的融合压缩方式,采用保留或替换机制,选择性保留信息量高的指针并剔除冗余项,有效维持几何结构清晰度与内存增长可控性;同时,通过空间距离与射线方向差异的联合判断,可统一区分局部冗余、新观测和潜在回环重访,并在检测到回环时触发位姿精修以保障全局几何一致性。

链接: https://arxiv.org/abs/2605.05749
作者: Feifei Li,Qi Song,Chi Zhang,Rui Huang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

[CV-102] mathcalB3-Net: Controlled Posterior Bridge Learning for Multi-Task Dense Prediction

【速读】:该论文旨在解决多任务密集预测(multi-task dense prediction)中因任务间证据可靠性差异未被显式建模而导致的负迁移问题。现有方法通常通过注意力机制或桥接特征隐式融合任务信息,但未能区分不同任务和空间位置上证据的可信度,导致不可靠信息污染共享表示。其解决方案的核心是提出 B3\mathcal{B}^3-Net,一种受控后验桥接学习框架,包含三个关键组件:精度场估计器(Precision Field Estimator)用于量化局部区域的证据精度;后验桥构建器(Posterior Bridge Operator)通过异方差证据融合生成加权后验桥,提升共享状态的可靠性;收缩分发器(Contractive Dispatch Operator)则以有界更新方式将桥接信息分配至各任务分支,抑制非控制性特征注入。该方法显著改善了多任务性能,且实验验证其优势源于受控的桥接机制而非骨干网络容量或解码器规模。

链接: https://arxiv.org/abs/2605.05722
作者: Meihua Zhou,Li Yang
机构: Wannan Medical University (皖南医学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Multi-task dense prediction solves complementary pixel-level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder-side interactions use attention, prompts, routing, diffusion, Mamba, or bridge features to exchange task evidence, but most of them organize this evidence implicitly. They usually fuse task features by similarity or affinity, without explicitly modeling that evidence reliability varies across tasks and spatial locations. As a result, unreliable evidence may contaminate the shared representation and intensify negative transfer. We propose \mathcalB^3 -Net, a controlled posterior bridge learning framework for multi-task dense prediction. Our method decomposes decoder-side interaction into reliability estimation, posterior bridge construction, and bounded redistribution. The Precision Field Estimator estimates patch-wise evidence precision from task-reference alignment and local variation. The Posterior Bridge Operator builds a precision-weighted posterior bridge through heteroscedastic evidence fusion, yielding a shared state more reliable than uniform or heuristic mixtures. The Contractive Dispatch Operator redistributes the bridge to each task branch through a bounded update, reducing uncontrolled feature injection. Experiments on NYUD-v2, PASCAL-Context, and Cityscapes show that \mathcalB^3 -Net achieves competitive or superior trade-offs over representative CNN-, Transformer-, diffusion-, Mamba-, and bridge-feature-based methods. Backbone-matched comparisons and extensive analyses further verify that the gains arise from controlled posterior bridge learning rather than backbone capacity or decoder scale.

[CV-103] riRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中泛化能力不足的问题,尤其是面对未见过的场景、物体和任务组合时表现不佳。其核心原因是现有方法依赖隐式视觉表征,导致策略对视觉变化敏感。解决方案的关键在于提出一种三元关系型VLA框架(TriRelVLA),通过构建显式的“物体-手部-任务”三元关系表示作为关系原语,建立以任务为引导的关系图结构,并利用关系感知的图Transformer建模节点间交互;进一步将关系结构压缩至瓶颈空间并投影至大语言模型(LLM)进行动作预测,从而减少对外观统计的依赖,显著提升跨场景、跨物体及跨任务的泛化性能。

链接: https://arxiv.org/abs/2605.05714
作者: Hanyu Zhou,Chuanhao Ma,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied manipulation. Our approach consists of three components: 1) We construct explicit object-hand-task triadic representations from multimodal inputs as relational primitives. 2) We build a task-grounded relational graph. Task-guided cross-attention forms nodes, and a relation-aware graph transformer models interactions among them. 3) We perform relation-conditioned action generation. The relational structure is compressed into a bottleneck space and projected into the LLM for action prediction. This triadic relational bottleneck reduces reliance on appearance statistics and enables transfer across scenes, objects, and task compositions. We further introduce a real-world robotic dataset for fine-tuning. Experiments show strong performance on fine-tuned tasks and clear gains in cross-scene, cross-object, and cross-task generalization.

[CV-104] EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation NEURIPS2026

【速读】:该论文旨在解决多模态手部姿态估计中缺乏同步采集EMG(表面肌电)与视觉数据的基准数据集问题,以及由此导致的手部运动感知精度受限、跨场景泛化能力不足的问题。其解决方案的关键在于构建了EgoEMG数据集,该数据集首次实现了双侧腕戴式EMG(16通道,2 kHz采样)、惯性测量单元(IMU,120 Hz)、第一人称RGB视频、外部RGB-D视频及动作捕捉(mocap)驱动的手部关节角度的同步记录,覆盖41名受试者执行60类手势(含单手与双手动作),总计超10小时数据;同时设计了统一的联合角度预测任务和通用的泛化划分策略(跨手势、跨用户及组合),并提出残差融合架构以实现EMG与视觉信息的有效协同,显著优于单一模态轻量级基线模型,为未来基于EMG与视觉融合的手部姿态估计研究奠定了基础。

链接: https://arxiv.org/abs/2605.05712
作者: Ziheng Xi,Jiayi Yu,Yitao Wang,Yanbo Duan,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 13 figures, 15 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks – EMG-to-pose, vision-to-pose, and EMG+vision fusion – under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.

[CV-105] Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

【速读】:该论文旨在解决视频情感识别中多模态融合的两个核心问题:一是现有方法在融合面部表情与远程光电容积脉搏波(rPPG)特征时,常因融合策略破坏预训练面部表征而影响泛化能力;二是缺乏显式机制抑制个体特异性差异,导致模型在未见受试者上的性能下降。解决方案的关键在于提出一种主体不变的跨模态提示调优框架,其核心创新包括:首先将rPPG波形转换为抗噪的时间-频率表示(TFRs),生成模态互补提示以调制冻结的视觉Transformer(ViT)中的面部token,从而实现高效跨模态交互并保留可迁移的面部表征;其次,在每个ViT层引入解耦共享-特定适配器(DSSA),显式分离主体共享与个体特异性成分,显著提升跨主体泛化能力。

链接: https://arxiv.org/abs/2605.05694
作者: Xiwen Luo,Jia Li,Rencheng Song,Yu Liu,Juan Cheng
机构: Hefei University of Technology (合肥工业大学); Anhui Province Key Laboratory of Measuring Theory and Precision Instrument (安徽省测量理论与精密仪器重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code will be available at this https URL

点击查看摘要

Abstract:Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.

[CV-106] CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leverag ing video transformers ICIP

【速读】:该论文旨在解决隐私保护动作识别(Privacy-preserving Action Recognition, PPAR)中现有加密方法在视频压缩后性能急剧下降的问题,即现有方法缺乏压缩友好性(compression-friendliness)。其解决方案的关键在于提出首个压缩友好的加密方法——CFE-PPAR(Compression-Friendly Encryption for PPAR),该方法通过将视频加密密钥与视频Transformer模型中的参数进行同步变换,使得加密后的视频可直接由模型识别,同时保持在Motion-JPEG和H.264压缩下的高识别性能。

链接: https://arxiv.org/abs/2605.05692
作者: Haiwei Lin,Shoko Imaizumi,Hitoshi Kiya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 6 pages, 5 figures, accepted to 2026 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To address these issues, in this paper, we propose the first compression-friendly encryption method for PPAR, called CFE-PPAR. In CFE-PPAR, videos encrypted with secret keys can be directly recognized by a video transformer, which uses parameters transformed by the same keys as those used for video encryption. In experiments, it is verified that CFE-PPAR outperforms previous methods on the UCF101 and HMDB51 datasets under Motion-JPEG and H.264 compression.

[CV-107] R2H-Diff: Guided Spectral Diffusion Model for RGB-to-Hyperspectral Reconstruction

【速读】:该论文旨在解决RGB到高光谱图像(RGB-to-HSI)重建这一高度病态的逆问题,即多个合理的光谱分布可能对应相同的RGB观测值,导致传统回归方法难以准确建模重建不确定性并常产生过度平滑的光谱响应。其解决方案的关键在于提出了一种名为R2H-Diff的高效扩散模型框架,通过将光谱恢复建模为条件迭代精化过程,在RGB引导下实现渐进式重建;核心创新包括:1)设计了Guided Spectral Refinement Module用于RGB条件下的特征融合,2)引入Hyperspectral-Adaptive Transposed Attention模块以高效建模空间-光谱依赖关系,3)采用无归一化去噪主干网络保持光谱幅度一致性,并结合任务自适应的线性噪声调度策略,在仅5步去噪情况下即可实现高质量重建,从而在保证重建保真度的同时显著降低模型复杂度与计算开销。

链接: https://arxiv.org/abs/2605.05688
作者: Songyu Ding,Ronggiang Zhao,Mingchun Sun,Jie Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits their ability to model reconstruction uncertainty and often leads to over-smoothed spectral responses. Although diffusion models provide strong distribution modeling capability, their direct application to hyperspectral reconstruction remains challenging due to the high spectral dimensionality, strong inter-band correlations, and strict requirement for spectral fidelity. To this end, we propose R2H-Diff, an efficient diffusion-based framework tailored for RGB-to-HSI reconstruction. Specifically, R2H-Diff formulates spectral recovery as a conditional iterative refinement process, enabling progressive reconstruction under RGB guidance. We proposed a Guided Spectral Refinement Module for RGB-conditioned feature fusion and a Hyperspectral-Adaptive Transposed Attention module for efficient spatial–spectral dependency modeling. Furthermore, a normalization-free denoising backbone is adopted to preserve spectral amplitude consistency, while a task-adapted linear noise schedule enables high-quality reconstruction with only five denoising steps. Extensive experiments on NTIRE2022, CAVE, and Harvard demonstrate that R2H-Diff achieves a favorable balance between reconstruction quality and computational efficiency. Notably, on NTIRE2022, R2H-Diff obtains 35.37 dB PSNR with a sub-million-parameter model of 0.58M parameters and 12.25G FLOPs, achieving the lowest model complexity among the evaluated methods while maintaining strong reconstruction fidelity.

[CV-108] MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery ICML2026

【速读】:该论文旨在解决基于扩散模型(diffusion-based methods)进行全身三维人体运动恢复时,因依赖全局分布匹配而导致局部关节重建误差的问题。其解决方案的关键在于提出MotionGRPO框架,通过强化学习后训练注入细粒度引导:将扩散采样建模为马尔可夫决策过程,并利用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行优化;同时设计混合奖励机制,融合学习的条件感知模型以保障全局视觉合理性与显式约束以提升局部关节精度;此外,为缓解因组内样本多样性不足导致的梯度消失问题,引入噪声注入策略以增强样本方差并稳定训练过程。

链接: https://arxiv.org/abs/2605.05680
作者: Nanjie Yao,Junlong Ren,Wenhao Shen,Hao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity

[CV-109] EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation

【速读】:该论文旨在解决基于冻结视觉编码器的向量搜索系统在部署时面临未见类别(unseen classes)查询的问题,现有适配器训练方法在此场景下性能崩溃:高容量适配器结合全局对比损失会隐式地将未见类样本错误分配至已见类簇中,导致最坏情况下的标签精度(Label Precision)比冻结基线下降超过40点。解决方案的关键在于提出欧几里得测地对齐(Euclidean Geodesic Alignment, EGA),这是一种残差适配器,耦合了三个核心原则:零初始化、局部三元组损失(local triplet loss)和超球面投影(hypersphere projection)。这些机制共同诱导出一种自限动态:当三元组满足较小边界时梯度停止产生,使适配器自动停止在局部几何已正确的区域更新,从而在不扰动未见类区域的前提下实现对已见类的全容量优化。实验表明,收敛时96.5%的三元组为梯度自由状态,且EGA在五个多样化的分布外(OOD)基准上均取得最优最坏情况标签精度表现。

链接: https://arxiv.org/abs/2605.05674
作者: Dongfang Zhao
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence 96.5% of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.

[CV-110] Large Vision-Language Models Get Lost in Attention ICML2026

【速读】:该论文旨在解决大视觉语言模型(LVLMs)中解码器骨干架构内部模块功能不清晰、机制冗余的问题,尤其是注意力机制与前馈网络(FFN)在信息处理中的角色混淆及其潜在低效性。其解决方案的关键在于提出一个基于信息论与几何理论的统一框架,用于量化残差更新的几何结构与熵变特性,从而揭示注意力机制本质上是保持子空间不变性的重配置操作,而FFN则是驱动语义创新的子空间扩展操作;进一步实验表明,用预定义值(如高斯噪声)替代学习到的注意力权重仍能获得相当或更优性能,说明当前模型存在严重的注意力机制误分配与冗余,提示先进LVLMs可能“迷失于注意力”,而非高效利用视觉上下文。

链接: https://arxiv.org/abs/2605.05668
作者: Gongli Xi,Ye Tian,Mengyu Yang,Huahui Yi,Liang Lin,Xiaoshuai Hao,Kun Wang,Wendong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Despite the rapid evolution of training paradigms, the decoder backbone of large vision–language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention’’ rather than efficiently leveraging visual context.

[CV-111] Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

【速读】:该论文旨在解决从极稀疏视图(仅需6至8张图像)中实现高保真、完整场景重建的问题,尤其针对现有方法在低视角密度下易出现缺失区域、模糊或伪影等缺陷。其解决方案的关键在于提出S2C-3D框架,包含三个核心组件:一是基于预训练模型微调的专用扩散模型(diffusion model),用于修复场景特定的图像并消除域差异;二是无需训练的视图一致性条件采样机制,在冻结扩散模型中引入邻近修复图像的一致性信息以生成视图一致的图像;三是相机轨迹规划方案,通过连接新采样相机与其最近邻点并迭代优化路径,确保对场景的全面覆盖。该方法实现了高质量3D高斯表示的鲁棒重建,显著优于当前最优方法。

链接: https://arxiv.org/abs/2605.05664
作者: Yiyang Shen,Yin Yang,Kun Zhou,Tianjia Shao
机构: State Key Lab of CADCG, Zhejiang University (浙江大学CADCG国家重点实验室); Hangzhou Research Institute of Holographic and AI Technology (杭州全息与人工智能技术研究院); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is developed by finetuning a pretrained architecture on the input views and their corresponding degraded counterparts. The adaptation to the scene distribution allows the model to repair Gaussian renderings while effectively eliminating domain gaps. Meanwhile, the trajectory planning scheme optimizes scene coverage by connecting each newly sampled camera to its two nearest neighbors. By iteratively constructing paths and retaining only those that significantly enhance visibility, the scheme establishes a trajectory that covers the entire scene. To address multi-view conflicts, the view-consistency conditioned sampling process quantifies the consistency between neighboring repaired images. This information is injected as a condition into the sampling process of the frozen diffusion model, facilitating the generation of view-consistent images without additional training. Consequently, our approach produces high-fidelity 3D Gaussians that are robust to artifacts. Experimental results demonstrate that S2C-3D outperforms state-of-the-art methods, constructing high-quality scenes that are free from missing regions, blurring, or other artifacts with very sparse inputs. The source code and data are available at this https URL.

[CV-112] MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality ICML2026

【速读】:该论文旨在解决统一视觉分词(Unified visual tokenization)中像素级重建保真度(空间等变性)与语义抽象能力(概念不变性)之间的根本性权衡问题。作者指出,这种冲突源于流形错位(Manifold Misalignment):在联合优化过程中,重建与感知任务产生相互抵消的梯度,形成零和博弈。解决方案的关键在于提出MUSE框架,其核心思想是基于拓扑正交性(Topological Orthogonality),将结构视为正交桥梁,从而在Transformer内部解耦优化过程——结构梯度用于精化注意力拓扑,语义梯度用于更新特征值,使原本破坏性的梯度干扰转化为相互增强机制。实验表明,MUSE打破了传统权衡,实现了最先进的生成质量(gFID 3.08)并优于教师模型InternViT-300M的线性探测性能(85.2% vs. 82.5%),验证了结构对齐的重建可提升语义感知能力。

链接: https://arxiv.org/abs/2605.05646
作者: Panqi Yang,Haodong Jing,Jiahao Chao,Tingyan Xiang,Li Lin,Yao Hu,Yang Luo,Yongqiang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages,Accepted by ICML 2026 main track

点击查看摘要

Abstract:Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2% vs. 82.5%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at this https URL.

[CV-113] AffectSeek: Agent ic Affective Understanding in Long Videos under Vague User Queries

【速读】:该论文旨在解决现有情感理解研究多集中于从图像、音频或预剪辑视频片段中识别情绪,而忽视了真实场景中用户通过自然语言查询与长视频交互时的情感理解需求的问题。为此,作者提出了一项新任务——模糊查询驱动的长视频情感理解(Vague-Query-driven video Affective Understanding, VQAU),要求模型在长视频中定位情感时刻、预测情绪类别,并生成基于证据的推理说明。为支持该任务,研究构建了VQAU-Bench基准数据集,整合长视频、模糊情感查询、时间片段标注、情绪标签及推理解释,形成统一评估框架。解决方案的关键在于提出AffectSeek框架,这是一个代理式(agentic)推理系统,通过分解VQAU为意图解析、候选定位、片段验证、情绪推理和推理生成五个步骤,结合角色专业化推理与跨阶段验证机制,逐步将模糊用户意图与长视频证据对齐,从而实现多步推理下的精准情感理解。

链接: https://arxiv.org/abs/2605.05640
作者: Zhen Zhang,Yuhang Yang,Yunxiang Jiang,Yuhuan Lu,Haifeng Lu,Zheng Lian,Runhao Zeng,Xiping Hu
机构: Lanzhou University (兰州大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); Khalifa University (哈利法大学); The University of Hong Kong (香港大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbfVague-Query-driven video Affective Understanding (VQAU), a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbfVQAU-Bench, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbfAffectSeek, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.

[CV-114] Learning a Delighting Prior for Facial Appearance Capture in the Wild SIGGRAPH

【速读】:该论文旨在解决从野外(in-the-wild)智能手机视频中高质量重建人脸反射率(reflectance)的问题,传统方法依赖昂贵的摄影棚录制,而现有基于模型的逆渲染方法难以在未知光照条件下有效解耦反射率与光照因素。其解决方案的关键在于提出一种基于训练的去光照网络(delighting network)作为先验约束优化过程,通过引入数据集潜变量调制(Dataset Latent Modulation, DLM)技术,将来自OLAT数据集和Light Stage渲染扫描的异构数据源无缝融合;具体而言,通过条件化可学习的源感知token(source-aware tokens),实现数据集特有风格与物理去光照原理的解耦,从而涌现出超越现有专有模型的强大去光照先验,显著提升反射率估计质量,并支持自动化的高保真人脸外观捕获流程。

链接: https://arxiv.org/abs/2605.05636
作者: Yuxuan Han,Xin Ming,Tianxiao Li,Zhuofan Shen,Qixuan Zhang,Lan Xu,Feng Xu
机构: Tsinghua University (清华大学); ShanghaiTech University (上海科技大学); Deemos Technology Co., Ltd. (德莫斯科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: ACM Transactions on Graphics (Proc. of SIGGRAPH), 2026. Code: this https URL Project Page: this https URL

点击查看摘要

Abstract:High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.

[CV-115] Leverag ing Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

【速读】:该论文旨在解决森林再生物种细粒度语义分割中因专家标注数据稀缺和类别极度不平衡导致的深度学习模型性能受限问题。其核心解决方案在于利用视觉-语言模型(Vision-Language Model, VLM)生成高保真合成图像及其像素级语义掩码,从而构建一个融合真实数据与AI生成数据的可扩展训练框架;关键创新点在于通过Prompt驱动的Nano Banana Pro模型实现高质量合成数据的自动化生产,并证明即使少量生成数据也能显著提升稀有物种的识别性能(部分物种F1分数提升达30个百分点),最终显著优于纯监督基线方法(F1提升超15个百分点)。

链接: https://arxiv.org/abs/2605.05627
作者: Gabriel Jeanson,David-Alexandre Duclos,William Larrivée-Hardy,Noé Cochet,Matěj Boxan,Anthony Deschênes,François Pomerleau,Philippe Giguère
机构: Université Laval (拉瓦尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 36 pages, 17 figures

点击查看摘要

Abstract:Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at this https URL.

[CV-116] RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

【速读】:该论文旨在解决类风湿关节炎(Rheumatoid Arthritis, RA)手部X光片评估中缺乏统一多层级分析框架的问题,特别是现有公开数据集普遍缺少全手覆盖、细粒度骨侵蚀(Bone Erosion, BE)标注以及与临床评分系统(如SvdH评分)一致的整合。解决方案的关键在于构建RAM-H1200数据集,其包含1,200张来自六个医疗中心的手部X光图像,并提供四层结构化标注:(i) 整手骨骼结构实例分割、(ii) 像素级BE掩膜、(iii) SvdH定义的关节感兴趣区域、(iv) 关节级别的SvdH评分(用于BE和关节间隙狭窄JSN)。该数据集首次实现了基于像素级BE掩膜的定量分析,突破了传统粗分类分级的局限,为联合建模解剖结构、局部病灶量化与临床标准化评分提供了统一基准,显著推动了RA影像智能分析的发展。

链接: https://arxiv.org/abs/2605.05616
作者: Songxiao Yang,Haolin Wang,Yao Fu,Junmu Peng,Lin Fan,Hongruixuan Chen,Jian Song,Masayuki Ikebe,Shinya Takamaeda-Yamazaki,Masatoshi Okutomi,Tamotsu Kamishima,Yafei Ou
机构: Institute of Science Tokyo (东京科学大学); Hokkaido University (北海道大学); Southwest Jiaotong University (西南交通大学); RIKEN (理化学研究所); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 50 pages, 24 figures, 25 tables

点击查看摘要

Abstract:Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.

[CV-117] Uncertainty-Guided Edge Learning for Deep Image Regression in Remote Sensing CVPR2026

【速读】:该论文旨在解决边缘计算环境下(edge learning)模型训练效率与预测不确定性估计之间的矛盾问题,特别是在遥感卫星等资源受限的边缘设备上进行深度图像回归任务时,如何高效地利用未标注数据来加速模型收敛。其核心挑战在于:边缘设备的算力有限,难以执行高开销的不确定性估计方法(如多次前向传播),同时传统假设(如高斯分布)又限制了对复杂预测分布的建模能力。解决方案的关键是提出一种基于深度贝塔回归(deep beta regression)的不确定性引导边缘学习算法(UGEL),该方法能在单次前向传播中准确估算预测不确定性,并据此优先选择最具信息量的数据用于模型更新,从而显著提升训练收敛速度。相比主动学习或半监督学习,UGEL在保持低计算成本的同时实现了更灵活的不确定性建模,适用于边缘场景下的实时模型优化。

链接: https://arxiv.org/abs/2605.05590
作者: Anh Vu Nguyen,Dino Sejdinovic,Tat-Jun Chin
机构: Australian Institute for Machine Learning (AIML), Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AI4Space @ CVPR 2026

点击查看摘要

Abstract:Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertainty of the current model on the unlabelled data, which is vital for informing model updating. In this paper, we investigate edge learning in the context of performing deep image regression on a remote sensing satellite, where a deep network is executed by an onboard computer to regress a scalar y from an input image, e.g., y is the percentage of pixels indicating cloud coverage or land use. We propose an uncertainty-guided edge learning (UGEL) algorithm that can accurately prioritise the data to speed up training convergence of the on-board regression model. Underpinning UGEL is the calculation of predictive uncertainty based on deep beta regression, where a deep network is used to estimate the parameters of a beta distribution for which the target y for an input image has a high likelihood. Compared to established methods for uncertainty estimation that are either too costly on edge devices (e.g., require many forward passes per sample) or make strict assumptions on the predictive distribution (e.g., Gaussian), deep beta regression is computable in a single forward pass and allows more general predictive distributions. Results show that UGEL delivers faster-converging edge learning than active or semi-supervised learning. Code and models are publicly available at this https URL.

[CV-118] xt-to-CAD Retrieval: a Strong Baseline

【速读】:该论文旨在解决工业设计领域中基于文本的计算机辅助设计(CAD)模型检索问题,即如何从大规模CAD数据库中根据自然语言查询高效、准确地检索出语义相关的CAD模型。现有方法依赖文件名或目录结构进行搜索,难以实现语义层面的匹配,限制了设计复用的效率与可扩展性。解决方案的关键在于提出一个统一的多模态框架,通过联合学习来自CAD模型的程序化序列(procedural sequences)和几何点云(geometric point clouds)的嵌入表示,并结合文本编码器获取查询语义特征;其中创新性地引入了一个基于交叉注意力机制的特征解码器,在训练阶段促进文本、序列与点云之间的隐式对齐,而在推理阶段移除该解码器以使用拼接后的序列-点特征实现高效检索,从而为文本到CAD检索任务建立了一个强有力的基线模型。

链接: https://arxiv.org/abs/2605.05572
作者: Honghu Pan,Zibo Du,Daxiang Liu,Chengliang Liu,Xiaoling Luo
机构: Hunan University (湖南大学); Shenzhen University (深圳大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.

[CV-119] An extremely coarse feedback signal is sufficient for learning human-aligned visual representations

【速读】:该论文旨在解决“监督信号粒度如何影响神经网络表征与人类视觉系统对齐程度”这一关键问题。其核心发现是:即使使用极为粗粒度的分类任务(如仅区分8个大类),训练出的神经网络仍能学习到与灵长类视觉系统高度一致的表征,甚至在人眼感知相似性判断上优于所有其他细粒度监督或自监督模型,包括大规模视觉模型。解决方案的关键在于采用基于主成分分析(PCA)的数据驱动方法,将训练图像划分为不同数量的类别(2至64类),从而系统性地控制监督信号的粒度,并通过对比macaque电生理记录和人类fMRI响应验证模型表征的脑对齐性。这一结果表明,人类视觉系统的形成可能并不依赖于精细的监督信号,而是可以从极简反馈中涌现,为构建更贴近人类感知的AI系统提供了新路径。

链接: https://arxiv.org/abs/2605.05556
作者: Yash Mehta,Michael F. Bonner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 Pages, 6 Figures

点击查看摘要

Abstract:Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains largely unexamined. Here we systematically investigate how the coarseness of a learning signal shapes representational alignment with human vision. We parametrically vary the level of signal granularity using a data-driven approach that partitions a set of training images into varied numbers of categories (2, 4, 8, 16, …, 64) via PCA-based splits of pretrained embeddings. We train hundreds of neural networks across convolutional and transformer architectures on these coarse classification tasks and compare their representations to macaque electrophysiology recordings and human fMRI responses. We find that networks trained to distinguish as few as 8 broad categories learn representations that match or exceed the neural alignment of models distinguishing 1,000-classes. Even more strikingly, these coarsely trained networks align more closely with human perceptual similarity judgments than all other models evaluated, including networks trained with fine-grained supervision or self-supervision as well as leading large-scale vision models. These results demonstrate that human-like visual representations emerge from remarkably coarse feedback, reframing what learning signals vision may require and opening a path toward building AI systems that are more aligned with human perception.

[CV-120] A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series

【速读】:该论文旨在解决从中分辨率成像光谱仪(MODIS)时间序列数据中进行树种分类的难题,其核心挑战包括树种间光谱特征差异微弱、空间-光谱-时间信息高度耦合,以及大规模拓扑上下文建模困难。解决方案的关键在于提出一种图调节的解耦稀疏Mamba模型(GDS-Mamba),通过三个创新设计实现:(1) 设计小批量图调节方法以显式挖掘输入图像间的拓扑关联效应,增强大尺度上下文建模能力;(2) 提出面向MODIS时序数据的解耦Mamba架构,分离空间模式、光谱特征与物候行为,提升高维信息解耦与特征提取精度;(3) 构建稀疏token机制,自适应学习最优token子集,缓解标准Mamba模型中相关性衰减问题,从而提高效率和细微特征捕捉能力。

链接: https://arxiv.org/abs/2605.05549
作者: Motasem Alkayid,Zhengsen Xu,Saeid Taleghanidoozdoozan,Yimin Zhu,Megan Greenwood,Quinn Ledingham,Zack Dewis,Mabel Heffring,Naser El-Sheimy,Lincoln Linlin Xu
机构: University of Calgary(卡尔加里大学); University of Jordan(约旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature differences among tree species, strong spatial-spectral-temporal information coupling, and the difficulty of modeling large-scale topological context information. To better address these challenges, this paper presents a novel Graph-regulated Disentangled Sparse Mamba model (GDS-Mamba) for enhanced tree species classification, with the following contributions. (1) First, to improve large-scale context modeling, we design a mini-batch graph-regulated approach that explicitly explores topological correlation effects among input images. (2) Second, to disentangle the high-dimensional spatial-spectral-temporal information coupling for improved feature extraction, we propose a novel disentangling Mamba architecture tailored for capturing independent spatial patterns, spectral signatures, and temporal phenology behaviors in MODIS time series. (3) Third, to improve efficiency and subtle feature learning, we design novel sparse token approaches that adaptively learn the optimum subset of tokens to better address the correlation decay problem that bottlenecks standard Mamba models. Extensive experiments using large-scale annual MOD13Q1 data across two Canadian provinces (i.e., Alberta and Saskatchewan) achieved an overall accuracy of 93.94% in Alberta and 80.19% in cross-provincial evaluations, outperforming twelve state-of-the-art classification models.

[CV-121] Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings ICLR2026

【速读】:该论文旨在解决大规模森林恢复监测中传统方法受限于实地调查的可行性不足以及遥感指数(如NDVI)饱和效应的问题,同时应对造林过程缓慢导致光谱变化不显著的挑战。其解决方案的关键在于利用AlphaEarth Foundation提供的卫星嵌入(satellite embeddings)模型,提出“参考轨迹嵌入”(Reference Trajectory Embedding)概念,通过计算待评估恢复点与成熟次生林参考点之间的余弦相似度来量化早期恢复成效,从而在高维嵌入空间中识别出不同土地利用/土地覆盖(LULC)类型的清晰聚类及明确的变化向量,实现对恢复进展的更敏感、可扩展的评估。

链接: https://arxiv.org/abs/2605.05547
作者: Alice Heiman
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)

点击查看摘要

Abstract:The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation’s model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a ‘Reference Trajectory Embedding’, defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.

[CV-122] he First Controllable Bokeh Rendering Challenge at NTIRE 2026 CVPR2026

【速读】:该论文旨在解决可控背景虚化(Controllable Bokeh Rendering)问题,即在图像合成或编辑过程中实现对背景模糊效果(bokeh)的精确控制,以生成具有视觉吸引力且符合人眼感知的高质量图像。解决方案的关键在于通过竞赛形式评估不同方法在未见图像上的表现,特别是针对人像和复杂场景中bokeh现象的定量保真度与主观感知质量;多数参赛团队基于Bokehlicious基线方法进行改进,聚焦于提升虚化效果的可控性与自然度,从而推动该领域技术的发展。

链接: https://arxiv.org/abs/2605.05510
作者: Tim Seizinger,Florin-Alexandru Vasluianu,Jeffrey Chen,Zhuyun Zhou,Zongwei Wu,Radu Timofte,Dafeng Zhang,Yipeng Lin,Qi Yan,Junhao Chen,Yang Yang,Divyavardhan Singh,Hariom Thacker,Hammad Mohammad,Aanchal Maurya,Kishor Upla,Kiran Raja,Wei Zhou,Hongyu Huang,Yujin Cho,Grigory Malivenko,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Challenge report paper from NTIRE Workshop at CVPR 2026

点击查看摘要

Abstract:This study presents the outcomes of the first Controllable Bokeh Rendering Challenge at NTIRE and highlights the most effective submitted methodologies. In total, 44 participants registered for the competition, of which 8 teams submitted valid solutions after the conclusion of the final test phase. All submissions were evaluated on unseen images, focusing on portraits and intricate subjects with complex and visually appealing bokeh phenomena. In addition to the first track focusing on established quantitative fidelity metrics, we conducted a qualitative user study with a panel of experts for a second track focusing on perceptual assessment. As this was the inaugural challenge on this topic, most of the participants focused on refining and extending the Bokehlicious baseline method.

[CV-123] EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion Flow and Function

【速读】:该论文旨在解决现有临床超声心动图数据集在跨模态关系建模方面的局限性,尤其是缺乏对心脏解剖结构、心肌运动与血流动力学之间物理一致性的研究支持。传统公开数据集通常缺失多普勒信息或将其融合为RGB叠加图像,且采集后经过厂商的有损显示处理,导致无法准确还原原始采集几何和时序关系。解决方案的关键在于构建EchoXFlow这一新型数据集,其保留了原始采集几何(native acquisition geometry)下的多模态流:包括时间分辨的1D、2D和3D超声数据以及多种多普勒模态,并同步配以ECG信号;所有数据均以可分离的模态流形式存储,从而支持基于物理约束的跨模态学习任务,如4D视觉感知和多模态深度学习模型训练,而这些任务无法仅从传统的扫描转换笛卡尔视频中实现。

链接: https://arxiv.org/abs/2605.05447
作者: Elias Stenhede,Joanna Sulkowska,Eivind Bjørkan Orstad,Henrik Schirmer,Arian Ranjbar
机构: Akershus University Hospital (阿克萨斯大学医院); University of Oslo (奥斯陆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce EchoXFlow, a clinical echocardiography dataset for learning from ultrasound in its native acquisition geometry rather than from scan-converted Cartesian videos. Existing public datasets offer limited opportunities to study cross-modal relationships between cardiac anatomy, myocardial motion, and blood flow, as Doppler is typically absent or fused as RGB overlays, and acquisitions are released after lossy vendor display processing. EchoXFlow comprises 37125 recordings from 666 routine-care examinations, preserving the timing, geometry, and modality relationships needed for physically grounded echo learning. Each recording is retained as separable modality-specific streams: temporally resolved 1D, 2D, and 3D data alongside multiple Doppler modalities, paired with a synchronized ECG. Clinical annotations span guideline-based measurements to dense 2D myocardial contours and 3D left-ventricular endocardial meshes. With its associated open-source tooling, EchoXFlow enables cross-modal, acquisition-aware learning tasks that cannot be formulated from conventional scan-converted videos alone, and serves as a testbed for 4D vision and physically grounded multi-modal learning more broadly.

[CV-124] Safety-Critical Camera Reliability Monitoring for ADAS via Degradation-Aware Uncertainty Pattern Analysis

【速读】:该论文旨在解决高级驾驶辅助系统(ADAS)中相机输入可靠性问题,即现有监控方法通常在下游感知性能已退化后才检测到传感器故障,导致安全隐患。解决方案的关键在于提出一种主动式相机可靠性监控框架,通过估计由退化引起的不确定性模式来提前预测感知风险;其核心创新是引入全局传感器健康指数(GSHI),该指数采用风险敏感的乘法聚合方式,将多种退化类型的严重程度统一为连续可靠性评分,并能突出单模态严重故障(如镜头遮挡或运动模糊)的影响。此外,基于单张RGB图像的轻量级多任务网络可同时预测退化类型、严重程度、GSHI及空间不确定性图,无需下游任务反馈,且在合成数据上训练后具备零样本迁移能力,实验证明GSHI能提前约0.47个严重度单位预警YOLOv8检测失败,显著优于传统图像质量评估(IQA)、检测置信度和干净特征异常检测基线。

链接: https://arxiv.org/abs/2605.05439
作者: Shiva Aher
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable camera input is essential for safety-critical ADAS perception, but most monitoring approaches detect sensor failures only after downstream performance has degraded. We propose a proactive camera reliability monitoring framework that estimates perception risk from degradation-induced uncertainty patterns before downstream failure becomes observable. The method introduces a Global Sensor Health Index (GSHI), a continuous reliability score that aggregates per-degradation severities using a risk-aware multiplicative formulation, allowing severe single-mode failures such as lens occlusion or motion blur to dominate the health estimate. A lightweight multi-task network predicts degradation type, severity, GSHI, and spatial uncertainty maps from a single RGB image without downstream task feedback. Training uses physics- and geometry-aware synthetic supervision over twelve camera degradation modes. Experiments on KITTI-derived degradations show that GSHI decreases monotonically with severity, achieves a health-estimation MAE of 0.064, and provides positive early-warning lead time of 0.47 \pm 0.25 severity units before YOLOv8 detection failure. GSHI also outperforms IQA, detector-confidence, and clean-feature OOD baselines, and transfers zero-shot to real adverse-weather driving data. These results support degradation-aware uncertainty analysis as a practical direction for proactive camera reliability monitoring in intelligent vehicles.

[CV-125] Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

【速读】:该论文旨在解决地球观测(Earth Observation, EO)档案中语义搜索的挑战,即如何在缺乏大规模成对图文数据和计算资源的情况下,实现基于自然语言的高效遥感图像检索。传统对比学习方法如CLIP需要大量配对数据与计算资源,难以在全球尺度上部署;而视觉基础模型(如CLAY)虽能生成高质量遥感图像嵌入,却缺乏自然语言语义引导。解决方案的关键在于提出GeoQuery系统,通过构建一个由10万张Sentinel-2影像组成的代理子集,并生成与其对应的文本描述,利用提示对齐(prompt-aligned)策略优化描述生成过程,使文本嵌入空间中的距离与冻结的CLAY视觉嵌入空间中的距离高度相关。该方法无需联合训练编码器,仅依赖两个阶段的检索:首先在代理文本描述中进行文本相似度搜索,再在全网CLAY嵌入中执行视觉最近邻搜索,从而实现零样本(zero-shot)语义检索。

链接: https://arxiv.org/abs/2605.05405
作者: James Walsh,William Fawcett,Grace Colvard,Raúl Ramos-Pollán
机构: University of Cambridge (剑桥大学); Universidad de Antioquia (安蒂奥基亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. We present GeoQuery, a zero-shot retrieval system that sidesteps this constraint through prompt-aligned text proxies. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings. On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6% accuracy within 50 km, with the strongest performance on floods (50% within 50 km) where terrain features are well captured by RGB embeddings. Deployed within ECHO, a crisis response system using Agentic Action Graphs, GeoQuery identified vulnerable areas during Brisbane’s 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

[CV-126] Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

【速读】:该论文旨在解决如何高效、低成本地评估软性交通干预措施(如临时行人避让区和路缘延伸)对车辆速度与交通安全的影响问题。传统方法依赖人工观测或复杂传感器部署,存在周期长、成本高、数据获取困难等局限。解决方案的关键在于构建一个基于人工智能(Artificial Intelligence, AI)的分析框架,利用现有闭路电视(CCTV)基础设施,结合深度学习(Deep Learning)与基于视角的速度估算技术,实现对驾驶员行为的自动化识别与量化分析。该方法不仅支持重复监测(如安装后第1周与第2周的数据对比),还显著提升了交通政策评估的时效性与证据基础,验证了软性基础设施在降低车速方面的有效性。

链接: https://arxiv.org/abs/2605.05402
作者: Vinit Katariya,Seungjin Kim,Curtis Craig,Nichole Morris,Hamed Tabkhi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 6 figures, 7 tables, Submitted/Under Review at the International Journal of Transportation Research (Submitted on 12 Jan 2026)

点击查看摘要

Abstract:Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary pedestrian refuges and curb extensions, on vehicle speed and safety. Using deep learning and perspective-based speed estimation, we evaluated driver behavior before and after interventions, with repeated post-installation monitoring in Week 1 and Week 2, in Minneapolis. Findings reveal that at unsignalized intersections, mean and 85th-percentile speeds fell by up to 18.75% and 16.56%, respectively, while pass-through traffic decreased by as much as 12.2%. Signalized intersections showed comparable reductions except one location, with mean and 85th-percentile speeds dropping by up to 20.0% and 17.19%. These results demonstrate the traffic-calming effectiveness of soft infrastructure and underscore the utility of AI-powered methods for rapid, low-cost, and evidence-based transport policy evaluation.

[CV-127] LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World KR CVPR2026

【速读】:该论文旨在解决从第一人称多相机头戴设备中跟踪3D人体运动的问题,该问题面临严重的观测者自身运动(egomotion)、部分可见性或遮挡以及训练数据匮乏等挑战。现有针对单目视频的方法通常依赖静态或缓慢移动的摄像头,难以高效利用多视角、标定且定位准确的输入信息,导致在动态的第一人称场景中表现脆弱。解决方案的关键在于提出一种名为LAMP(Localization Aware Multi-camera People Tracking)的新框架,其核心是通过早期解耦观测者与目标运动来实现鲁棒跟踪:首先利用已知设备的6自由度(6 DoF)运动和标定参数,将多视角检测到的2D人体关键点在时间窗口内统一转换至3D世界参考坐标系;其次,采用端到端训练的时空Transformer直接拟合该3D射线云中的3D人体运动。这种“先提升再拟合”(lift-then-fit)策略使模型能够在世界空间中学习自然的人体运动先验,并灵活融合来自多个时间异步、部分观测且移动的摄像头的信息,从而在单目基准上达到SOTA性能,并显著优于针对第一人称场景设计的基线方法。

链接: https://arxiv.org/abs/2605.05390
作者: Nan Yang,Julian Straub,Fan Zhang,Richard Newcombe,Jakob Engel,Lingni Ma
机构: Meta Reality Labs Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This “lift-then-fit” approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.

[CV-128] wo Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models CVPR2026

【速读】:该论文旨在解决3D点云数据中异常检测在资源受限、低延迟场景下的部署难题,尤其是现有基于扩散模型(Diffusion Models)的方法因迭代去噪过程导致计算开销大、推理速度慢的问题。其解决方案的关键在于通过一致性学习(Consistency Learning)重构基于重建的异常检测范式,实现仅需一到两次网络前向传播即可直接预测无异常几何结构,并引入一种新颖的混合损失函数(Hybrid Loss),显式约束重建结果向干净数据收敛。该设计显著降低了推理成本,在无需GPU加速的情况下将运行时间提升至当前最优方法的80倍,同时保持了优异的检测性能(如Anomaly-ShapeNet上I-AUROC达76.20%,Real3DAD上达72.80%),从而实现了边缘设备上的高效、低延迟异常检测。

链接: https://arxiv.org/abs/2605.05372
作者: Pranav A,Shashank B,Pranav Siddappa,Dominik Seuss,Minal Moharir,Subramanya KN
机构: R.V. College of Engineering (R.V.大学工程学院); Technical University of Applied Sciences Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026, at the 9th Workshop on Efficient Deep Learning for Computer Vision (ECV). To be published in the IEEE/CVF CVPR 2026 Workshop Proceedings

点击查看摘要

Abstract:Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.

[CV-129] amaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

【速读】:该论文旨在解决阿拉伯手语(Arabic Sign Language, ArSL)领域中缺乏高质量3D参数化标注和专用Avatar重建方法的问题,从而支持无障碍技术发展与阿拉伯聋人群体的文化传承。其解决方案的关键在于:首先构建了首个高质量的3D参数化标注数据集——Ishara-500,为500个具有文化真实性的沙特手语(Saudi Sign Language, SSL)手势提供了精确的SMPL-X人体模型参数;其次提出Tamaththul3D专用重建流水线,融合SMPLer-X进行鲁棒身体估计、WiLoR实现带自动定位与镜像校正的精细手部优化,以及MediaPipe提供2D姿态监督,并通过基于运动链的腕关节对齐与混合摆动-扭转分解的联合优化策略,在保持良好身体姿态的同时显著提升手部精度(较之前方法最高提升32%),从而建立首个高保真ArSL Avatar重建框架。

链接: https://arxiv.org/abs/2605.05367
作者: Eyad Alghamdi,Sattam Altuuaim,Obay Ghulam,Abdulrahman Qutah,Yousef Basoodan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Arabic Sign Language (ArSL) and its dialects serve approximately 400 million Arabic speakers worldwide, yet the community lacks high-quality 3D parametric annotations and specialized reconstruction methods for avatar generation. We address this critical gap through two key contributions: First, we introduce the first high-quality 3D parametric annotations for the Ishara-500 Saudi Sign Language dataset, providing precise SMPL-X parameters for 500 culturally authentic SSL signs. Second, we present Tamaththul3D, a specialized reconstruction pipeline designed for ArSL’s unique articulation patterns. Our pipeline integrates SMPLer-X for robust body estimation, WiLoR for detailed hand refinement with automatic localization and mirroring, and MediaPipe for 2D pose supervision. Through kinematic-chain-based wrist alignment with hybrid swing-twist decomposition and 2D-supervised joint optimization, Tamaththul3D achieves state-of-the-art hand accuracy (up to 32% improvement over previous methods) while maintaining competitive body pose. Together, these 3D annotations and Tamaththul3D pipeline establish the first comprehensive framework for high-fidelity ArSL avatar reconstruction, enabling new accessibility technologies and cultural preservation efforts for the Arab Deaf community.

[CV-130] Balancing Stability and Plasticity in Sequentially Trained Early-Exiting Neural Networks ICIP2026

【速读】:该论文旨在解决早期退出神经网络(Early-exiting Neural Networks)在顺序训练过程中因新引入的退出层干扰已训练好的早期退出层而导致性能下降的问题。解决方案的关键在于平衡模型的稳定性(stability)与可塑性(plasticity):一方面通过保留对已有退出层至关重要的权重参数来维持其性能(采用弹性权重巩固,Elastic Weight Consolidation),另一方面通过保持早期退出层输出分布的一致性以避免知识遗忘(采用无遗忘学习,Learning without Forgetting)。这两种机制共同提升了多退出结构在顺序训练中的鲁棒性和整体性能。

链接: https://arxiv.org/abs/2605.05358
作者: Alaa Zniber,Ouassim Karrakchou,Mounir Ghogho
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE ICIP 2026

点击查看摘要

Abstract:Early-exiting neural networks enable adaptive inference by allowing inputs to exit at intermediate classifiers, reducing computation for easy samples while maintaining high accuracy. In practice, exits can be trained sequentially by incrementally adding them to a shared backbone; however, this sequential training can cause newly introduced exits to interfere with previously learned ones, degrading the performance of earlier classifiers. We address this problem by retaining the knowledge embedded in existing exits while allowing new ones to specialize. We propose two alternative approaches that operate at different levels of the model. The first constrains learning by protecting parameters that are important for previously trained exits, while the second preserves the output distributions of earlier exits as the network adapts. These alternatives directly reflect the stability-plasticity trade-off studied in continual learning. Accordingly, we leverage \textitElastic Weight Consolidation to constrain critical weights and \textitLearning without Forgetting to preserve output distributions. Experiments on standard benchmarks show that our approaches consistently improve early-exit performance, achieving higher accuracy over existing sequential training methods and significant performance speedups at low computational budgets.

[CV-131] genioussBench: A New Dataset for Geospatial Visual Localisation

【速读】:该论文旨在解决大规模视觉定位(visual localisation)中因传统基于结构光恢复(Structure-from-Motion, SfM)方法的局限性所导致的可扩展性与真实场景适应性不足的问题。其解决方案的关键在于构建一个基于地理空间参考数据的基准测试平台 egenioussBench,该平台融合城市尺度的机载三维网格(airborne 3D mesh)和 CityGML LoD2 模型,以支持真正可扩展的定位系统开发;同时,通过智能手机图像配以厘米级、地图无关的地面真值(ground truth),并利用渲染深度估计全共可见矩阵后选取最大独立集的方式生成非共可见测试子集,从而在保证评估公平性的同时暴露跨视角与跨域挑战,为不同方法(如基于网格或LoD2模型的方法)提供统一的量化指标与验证路径。

链接: https://arxiv.org/abs/2605.05351
作者: Phillipp Fanta-Jende,Francesco Vultaggio,Alexander Kern,Yasmin Loeper,Markus Gerke
机构: AIT Austrian Institute of Technology (奥地利科学院技术研究所); Technische Universität Braunschweig (布伦瑞克工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present egenioussBench, a visual localisation benchmark built on geospatial reference data: a city-scale airborne 3D mesh and a CityGML LoD2 model. This pairing reflects deployable mapping assets and supports true scalability beyond traditional SfM-based approaches. The query data comprise smartphone images with centimetre-accurate, map-independent ground truth obtained via PPK and GCP/CP-aided adjustment. From 2,709 images, we derive a non-co-visible subset by estimating the full co-visibility matrix from rendered depth and selecting a maximum independent set; the released data include a test split of 42 non-co-visible images with withheld ground truth and a validation split of 412 sequential images with poses, e.g. for training of pose regressors and self-validation. The benchmark features a public leaderboard evaluated with binning metrics at multiple pose-error thresholds alongside global statistics (median, RMSE, outlier ratio), ensuring fair, like-for-like comparison across mesh- and LoD2-based methods. Together, these design choices expose realistic cross-view and cross-domain challenges while providing a rigorous, scalable path for advancing large-scale visual localisation. We make the evaluation code and data availeable at this https URL and this https URL

[CV-132] ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

【速读】:该论文旨在解决现有视觉 Transformer (Vision Transformer, ViT) 自编码器作为图像分词器时在非训练分辨率下性能下降以及依赖对抗损失导致训练不稳定的问题。其关键解决方案在于提出 ViTok-v2,通过引入 NaFlex 实现原生分辨率支持以提升跨分辨率和宽高比的泛化能力,并采用一种新的 DINOv3 感知损失替代 LPIPS 和 GAN 目标函数,从而实现任意规模下的稳定训练。这一改进使得 ViTok-v2 成为目前参数量最大(50亿)的图像自编码器,在 256p 分辨率上达到或超越当前最优重建性能,并在 512p 及以上分辨率显著优于所有基线方法,同时与流匹配生成器联合扩展实验表明,同步扩大自编码器与生成器规模可推动重建-生成权衡的帕累托前沿进步。

链接: https://arxiv.org/abs/2605.05331
作者: Philippe Hansen-Estruch,Jiahui Chen,Vivek Ramanujan,Orr Zohar,Yan Ping,Animesh Sinha,Markos Georgopoulos,Edgar Schoenfeld,Ji Hou,Felix Juefei-Xu,Sriram Vishwanath,Ali Thabet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B images and scaled to 5B parameters, the largest image autoencoder to date. ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p and outperforms all baselines at 512p and above. In joint scaling experiments with flow matching generators, we show that scaling both the autoencoder and the generator advances the Pareto frontier of this trade-off.

[CV-133] Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift CVPR2026

【速读】:该论文旨在解决3D目标检测中不确定性估计不可靠的问题,尤其是在分布偏移(distribution shift)场景下,现代检测器的置信度校准性能显著下降。现有后验校准方法虽能在分布内(in-distribution)测试中改善校准效果,但无法适应分布外(out-of-distribution)情形。其解决方案的关键在于提出一种密度感知校准方法(density-aware calibration),通过将后验校准器与DETR类3D检测器中潜在目标查询(latent object queries)的特征密度相结合,利用这些查询具备位置和类别感知特性的紧凑表示进行密度估计,从而在分布偏移场景下动态调整模型的分类与边界框回归不确定性,实现联合再校准。

链接: https://arxiv.org/abs/2605.05328
作者: Till Beemelmanns,Alexey Nekrasov,Stefan Vilceanu,Jonas Steinhaus,Timo Woopen,Bastian Leibe,Lutz Eckstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for publication at CVPR 2026

点击查看摘要

Abstract:Reliable uncertainty estimation for 3D object detection is critical for deploying safe autonomous systems, yet modern detectors remain poorly calibrated, especially under distribution shifts. Although post-hoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios. In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors. These queries form a compact, location and class-aware feature, ideal for density estimation, allowing our approach to adjust model confidences in distribution-shift scenarios. By fitting a density estimator on these query features, our approach jointly recalibrates both classification and bounding box regression uncertainties. On both a multi-view camera and LiDAR-based detector, our approach consistently outperforms standard post-hoc methods in both in-distribution and distribution-shifted scenarios. Code available this https URL .

[CV-134] Seeing What Shouldnt Be There: Counterfactual GANs for Medical Image Attribution

【速读】:该论文旨在解决现有图像分类可视化方法在医学影像分析中无法全面捕捉影响分类决策的显著对象的问题。当前基于判别模型的解释方法仅聚焦于最小化判别特征集,忽略了图像中其他重要但非判别性的物体,从而限制了对临床诊断的辅助价值。解决方案的关键在于提出一种基于反事实解释(Counterfactual Explanation, CX)的类导向特征归因方法,该方法利用生成对抗网络(Generative Adversarial Networks, GANs)结合循环一致性损失函数,生成具有因果逻辑的反事实实例(Counterfactual Instances, CIs),以揭示“若某对象不存在,则分类结果将不发生”的推理过程。同时,论文还提出了一种新的CI生成与评估技术,确保生成的反事实实例具备临床可解释性和合理性,从而提升解释的可信度和实用性。

链接: https://arxiv.org/abs/2605.05283
作者: Shakeeb Murtaza
机构: COMSATS University Islamabad (COMSATS大学伊斯兰堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ascription of an image gives insights into the objects that influence the classification of the whole image or its pixels towards a specific category. These insights help radiologists to visualize deformities in medical imaging. Most of the existing visualization techniques are based on discriminative models and highlight regions of the input image participating in the decision-making of a classifier. However, these approaches do not take all noticeable objects into account as their objective is to classify the input by using a minimal set of discriminative features. To overcome the issue, a counterfactual explanation (CX) based class-oriented feature attribution method is proposed. A counterfactual explanation (CX) explicates a causal reasoning process of the form: “if X had not happened, then Y would not have happened”. The method is built on generative adversarial networks (GANs) with a cyclical-consistent loss function. We evaluate our method on three datasets: synthetic, tuberculosis and BraTS. All experiments confirm the efficacy of the proposed method. This study also highlighted the limitations of existing counterfactual explanation techniques in producing plausible counterfactual instances (CIs). Accompanying CXs with believable CIs thus provides self-explanatory analogy-based explanations. To this end, a CI generation method is proposed. Also, a novel technique is used to evaluate the quality of CI. The baseline results are produced on the BraTS dataset.

[CV-135] Layout-Aware Representation Learning for Open-Set ID Fraud Discovery

【速读】:该论文旨在解决身份证件伪造检测中因攻击者持续演化伪造模板与制造流程而导致的历史标签失效问题,即传统静态二分类模型难以应对动态变化的伪造模式和规模化协同伪造活动(campaign-scale fraud)。其解决方案的关键在于提出一种面向开放集欺诈发现的布局感知表示学习方法:通过上下文感知的SimMIM微调与复合损失函数驱动的监督度量学习,对DINOv3进行适配,从而在仅使用美国身份证训练的基础上,实现对加拿大身份证布局的高精度分类(99.83%);同时,在20,448张加拿大身份证数据上识别出276例新型物理伪造案例(其中222例未被现有检测器捕获),并支持基于嵌入空间相似性的扩展分析,从单一已确认样本自动发现关联案件,突破了传统基于元数据图结构的关联限制。

链接: https://arxiv.org/abs/2605.05215
作者: Jinxing Li,Nicholas Ren,Cathy Chang,Hongkai Pan,Daniel George
机构: WithPersona( persona公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identity-document fraud detection is not a stationary binary classification problem. Adaptive attackers modify templates and fabrication pipelines, making historical fraud labels stale, and successful forgeries recur at scale as coherent campaigns. We therefore study layout-aware representation learning for open-set fraud discovery rather than only closed-set classification. We adapt DINOv3 to the document domain via context-aware SimMIM fine-tuning and supervised metric learning with composite loss that encourages inter-class separability and intra-class compactness. The model is trained with U.S. IDs only. With a lightweight MLP and softmax classifier, the embedding achieves 99.83% layout classification accuracy on Canadian layouts. Moreover, on a dataset of 20,448 Canadian IDs, embedding-space analysis surfaces 276 adaptive physical-fraud cases, including 222 not surfaced by incumbent detectors. The embedding supports similarity-based expansion from a single confirmed seed to additional related cases not linked by conventional metadata graphs. The layout-aware document embeddings provide a production-aligned basis for discovering novel and campaign-scale fraud under distribution shift.

[CV-136] he frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition with source-separable uncertainty as a case study

【速读】:该论文旨在解决MPI Sintel数据集上用于评估学习型内在图像分解(Intrinsic Image Decomposition)的协议不一致问题,特别是帧级划分(frame-level split)导致的测试指标虚高现象。研究首次量化了这种泄露效应:相比场景级划分(scene-level split),帧级划分使测试R_PSNR平均高出1.6–2.0 dB(p < 0.01,配对t检验),且随着训练扩展可超过10 dB,表明该效应与模型架构无关。解决方案的关键在于采用场景级划分作为统一基准,并在此基础上提出一种物理信息引导的分解方法 I = R ⊙ S + N,其核心创新是引入一个源分离的三通道异方差不确定性头(source-separable three-way heteroscedastic uncertainty head),实证验证了通道专业化——非朗伯反射不确定性通道与非朗伯残差误差的相关系数 r = 0.67,显著高于纹理通道;同时证明不确定性滤波在下游任务中有效:剔除75%最高不确定性像素可使保留像素的重建均方误差(MSE)降低77%,而随机过滤无改善。此方案以远低于深度集成(Deep Ensemble)的成本实现接近性能,且具备独特的源分离不确定性估计能力。

链接: https://arxiv.org/abs/2605.06359
作者: Jihwan Woo
机构: Amazon Web Services (AWS)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of Electronic Imaging. 25 pages, 10 figures. Addresses evaluation protocol issues in intrinsic image decomposition and proposes source-separable uncertainty estimation

点击查看摘要

Abstract:Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel’s correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.

[CV-137] umor-aware augmentation with task-guided attention analysis improves rectal cancer segmentation from magnetic resonance images

【速读】:该论文旨在解决预训练Transformer模型在跨影像模态(如CT到MRI)迁移时性能下降的问题,特别是针对两个常见假设失效的情形:一是下游数据可被适配至预训练模型固定的输入几何结构,二是预训练特征能有效跨模态迁移。研究发现,在CT到MRI迁移中存在两种相互作用的失败模式:因零填充导致的token使用效率低下以及特征适应效果差,进而引发准确率下降。解决方案的关键在于基于机制分析提出两项干预策略:一是肿瘤感知增强策略以提升肿瘤外观异质性的覆盖范围,二是各向异性裁剪策略以恢复token效率。实验表明,这些措施显著提升了模型在直肠MRI检测任务中的表现,验证了通过针对性数据处理优化可有效缓解跨模态迁移限制并增强鲁棒性。

链接: https://arxiv.org/abs/2605.05522
作者: Aneesh Rangnekar,Joao Miranda,Natally Horvat,Stephanie Chahwan,Samir Alrayess,Aditya Apte,Aditi Iyer,Eve LoCastro,Revathi Ravella,Marc J Gollub,Iva Petkovska,Jesse Joshua Smith,Paul Romesser,Julio Garcia-Aguilar,Harini Veeraraghavan,Joseph Deasy
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretraining on large-scale datasets has been shown to improve transformer generalizability, even for out-of-domain (OOD) modalities and tasks. However, two common assumptions often fail under OOD transfer: that downstream datasets can be adapted to the fixed input geometry of pretrained models and that pretrained representations transfer effectively across imaging modalities. We show that these assumptions break down through two interacting failure modes in CT-to-MRI transfer: inefficient token usage caused by zero-padding to match pretrained input dimensions and ineffective feature adaptation. These failures led to accuracy degradation despite extensive fine-tuning. We investigated these failure modes using two CT-pretrained hierarchical shifted-window transformer backbones, SMIT and Swin UNETR, pretrained with different objectives and datasets. Mechanistic analysis introduced an attention dilution index (ADI), an entropy-based metric quantifying attention diverted toward uninformative padding tokens, and centered kernel alignment (CKA) to measure feature reuse in MRI tasks. ADI increased with zero-padding, while high feature reuse did not necessarily correspond to improved accuracy. To mitigate these issues, we introduced two interventions: a tumor-aware augmentation strategy to improve tumor appearance heterogeneity coverage and an anisotropic cropping strategy to restore token efficiency. Fine-tuning on identical rectal MRI datasets improved detection rates to 224/247 (90.7%) for SMIT and 219/247 (88.7%) for Swin UNETR, demonstrating improved robustness under CT-to-MRI transfer. This study is among the first to examine when pretrained transformers fail to transfer effectively across imaging modalities and how simple mitigation strategies, motivated by mechanistic analysis of datasets, can reduce transfer limitations while improving robustness and MRI detection.

人工智能

[AI-0] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

【速读】:该论文旨在解决传统混合专家(Mixture-of-Experts, MoE)架构中专家容量分配僵化的问题:现有设计为每层独立拥有专属专家集,导致专家参数随深度线性增长,且假设各层均需独立的专家能力。然而,实验表明深层层的路由策略可被随机路由替代而仅造成轻微性能下降(1.0–1.6点),说明存在冗余。为此,作者提出UniPool架构,其核心创新在于将专家容量视为全局预算,通过引入一个由各层路由器共享的统一专家池(shared expert pool)取代逐层专家所有权;同时设计池级辅助损失以平衡整体专家利用率,并采用NormRouter实现稀疏且尺度稳定的路由。该方案不仅在多个LLaMA规模模型上显著降低验证损失(最多相对减少0.0386),还揭示了池大小可作为显式的深度缩放超参数——使用更少专家参数(41.6%–66.7%)即可匹配或超越原有分层MoE,证明专家参数无需与深度线性增长,可子线性增长并保持更高效率与效果。

链接: https://arxiv.org/abs/2605.06665
作者: Minbin Huang,Han Shi,Chuanyang Zheng,Yimeng Wu,Guoxuan Chen,Xintong Yu,Yichun Yin,Hong Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer’s learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool’s benefits compose with finer-grained expert decomposition.

[AI-1] Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)阶段中,如何实现更好的学习-遗忘权衡问题,即在保持预训练知识不被过度遗忘的前提下提升新任务性能。其核心解决方案是提出“优化器-模型一致性”(optimizer-model consistency)现象:使用与预训练阶段相同的优化器进行全参数微调(full fine-tuning),相较于其他优化器(包括LoRA等参数高效方法)能更有效地减少知识遗忘并维持或提升新任务表现。关键机制在于优化器通过正则化激活分布来塑造模型的损失景观(loss landscape),而采用相同优化器可使SFT阶段的权重更新遵循特定结构,从而最小化对预训练知识的干扰。

链接: https://arxiv.org/abs/2605.06654
作者: Yuxing Liu,Jianyu Wang,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon’s strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

[AI-2] AI Co-Mathematician: Accelerating Mathematicians with Agent ic AI

【速读】:该论文旨在解决数学研究中开放式探索与迭代流程的效率瓶颈问题,即如何通过人工智能(AI)有效辅助数学家在从灵感生成、文献检索、计算探索到定理证明和理论构建等全流程中实现高效协作。其解决方案的关键在于设计了一个异步、状态感知的工作空间——AI合作者(AI co-mathematician),该系统能够管理不确定性、细化用户意图、追踪失败假设,并输出原生数学成果(native mathematical artifacts),从而模拟人类合作研究的动态过程。这一架构不仅显著提升了AI在复杂数学任务中的交互能力,还在FrontierMath Tier 4等硬核基准测试中达到当前最优性能(48%得分)。

链接: https://arxiv.org/abs/2605.06651
作者: Daniel Zheng,Ingrid von Glehn,Yori Zwols,Iuliya Beloshapka,Lars Buesing,Daniel M. Roy,Martin Wattenberg,Bogdan Georgiev,Tatiana Schmidt,Andrew Cowie,Fernanda Viegas,Dimitri Kanevsky,Vineet Kahlon,Hartmut Maennel,Sophia Alj,George Holland,Alex Davies,Pushmeet Kohli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

[AI-3] Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

【速读】:该论文旨在解决现有概念解释方法在深度神经网络预测解释中缺乏因果关联性,以及仅能处理单一概念的表达能力受限的问题;同时,针对形式化归纳与对比解释方法仅依赖低级特征(如像素)而无法提供高层语义理解的局限性。其解决方案的关键在于提出“基于概念的归纳与对比解释”(concept-based abductive and contrastive explanations)的新范式,通过概念擦除(concept erasure)程序建立高阶概念与模型输出之间的因果关系,并设计一组算法枚举所有最小解释集,从而实现对单张图像及具有特定共同行为的图像集合的有效解释。

链接: https://arxiv.org/abs/2605.06640
作者: Ronaldo Canizales,Divya Gopinath,Corina Păsăreanu,Ravi Mangal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept-based explanations offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on formal abductive and contrastive explanations computes the minimal set of input features causally relevant for model outcomes but only considers low-level features such as pixels. Merging these two threads, in this work, we propose the notion of concept-based abductive and contrastive explanations that capture the minimal sets of high-level concepts causally relevant for model outcomes. We then present a family of algorithms that enumerate all minimal explanations while using concept erasure procedures to establish causal relationships. By appropriately aggregating such explanations, we are not only able to understand model predictions on individual images but also on collections of images where the model exhibits a user-specified, common behavior. We evaluate our approach on multiple models, datasets, and behaviors, and demonstrate its effectiveness in computing helpful, user-friendly explanations.

[AI-4] he Structural Origin of Attention Sink: Variance Discrepancy Super Neurons and Dimension Disparity ICML2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在的“注意力汇聚”(attention sink)现象,即初始token在自注意力机制中过度占据注意力分数的问题。其核心问题是:这种现象的结构成因尚不明确,缺乏可解释且可控的机制理解。解决方案的关键在于揭示了两个关键机制:首先,自注意力中的值聚合(value aggregation)过程天然导致各位置表示的方差差异;其次,前馈网络(Feed-Forward Network, FFN)层中“超神经元”(super neurons)的激活通过通道稀疏的下投影(channel-sparse down-projections)进一步放大这一方差差异,从而迫使模型在首token处形成结构锚点——即注意力汇聚点。作者通过两种受控干预验证了该因果链:一是修改注意力掩码隔离聚合效应,二是增强特定token表示的方差,二者均可在任意位置复制注意力汇聚现象。最终,基于此理解提出head-wise RMSNorm架构改进,在预训练阶段稳定值聚合输出,恢复各位置统计均等性,显著加速收敛。

链接: https://arxiv.org/abs/2605.06611
作者: Siquan Li,Kaiqi Jiang,Jiacheng Sun,Tianyang Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textitmechanistic explanation for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textithead-wise RMSNorm, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

[AI-5] Patch2Vuln: Agent ic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

【速读】:该论文旨在解决从二进制包(binary packages)中重构安全漏洞修复语义的问题,即在缺乏源代码或补丁说明的情况下,如何通过本地分析工具链识别出与安全相关的函数级变更。其核心挑战在于利用仅能访问的二进制差异信息(如ELF文件对)来准确定位漏洞修复点并生成可信的审计结论。解决方案的关键是提出Patch2Vuln这一可恢复的自动化流水线:首先提取旧/新ELF二进制对,使用Ghidra和Ghidriff进行二进制差分,按重要性排序变化函数;随后构建候选补丁档案,并调用离线语言模型代理执行初步审计、验证计划制定及最终审计输出。实验表明,该方法能在20个Ubuntu安全更新中成功定位10个已验证的安全相关函数,并正确归类11个根本原因类别,但受限于二进制差分覆盖率和局部行为验证能力,仍有部分案例未能命中目标函数或生成有效验证差异。

链接: https://arxiv.org/abs/2605.06601
作者: Isaac David,Arthur Gervais
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu .deb package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.06601 [cs.CR] (or arXiv:2605.06601v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.06601 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-6] owards Metric-Faithful Neural Graph Matching

【速读】:该论文旨在解决神经图匹配架构中编码器几何结构(encoder geometry)对图编辑距离(Graph Edit Distance, GED)估计质量影响不明确的问题。现有方法虽通过图神经网络(Graph Neural Network, GNN)编码图结构并结合回归或对齐模块近似GED,但其编码器的几何性质如何影响估计精度与排序稳定性仍缺乏理论支撑。解决方案的关键在于构建一个理论框架,将编码器的双利普希茨(bi-Lipschitz)几何特性与两类主流神经GED估计器——图相似性预测器和基于对齐的方法——的性能联系起来:对于固定图集合,图级双利普希茨编码器可产生受控的GED代理指标并提升排序稳定性;对于对齐类方法,节点级双利普希茨性质可传递至编码诱导的对齐代价,从而优化对齐目标。作者进一步提出FSW-GNN这一具备双利普希茨特性的Weisfeiler-Lehman等价编码器作为即插即用替换,在多个基准数据集上显著提升了GED预测与排序指标,验证了编码器几何设计作为神经图匹配重要原则的有效性。

链接: https://arxiv.org/abs/2605.06588
作者: Jyotirmaya Shivottam,Subhankar Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Edit Distance (GED) is a fundamental, albeit NP-hard, metric for structural graph similarity. Recent neural graph matching architectures approximate GED by first encoding graphs with a Graph Neural Network (GNN) and then applying either a graph-level regression head or a matching-based alignment module. Despite substantial architectural progress, the role of encoder geometry in neural GED estimation remains poorly understood. In this paper, we develop a theoretical framework that connects encoder geometry to GED estimation quality for two broad classes of neural GED estimators: graph similarity predictors and alignment-based methods. On fixed graph collections, where the doubly-stochastic metric d_\mathrmDS is comparable to GED, we show that graph-level bi-Lipschitz encoders yield controlled GED surrogates and improved ranking stability; for matching-based estimators, node-level bi-Lipschitz geometry propagates to encoder-induced alignment costs and the resulting optimized alignment objective. We instantiate this perspective using FSW-GNN, a bi-Lipschitz WL-equivalent encoder, as a drop-in replacement in representative neural GED architectures. Across representative baselines and benchmark datasets, the resulting geometry-aware variants significantly improve GED prediction and ranking metrics. A faithfulness case study of untrained encoders, together with ablations and transfer experiments, supports the view that these gains arise from improved representation geometry, positioning encoder geometry as a useful design principle for neural graph matching.

[AI-7] NeuroAgent : LLM Agents for Multimodal Neuroimaging Analysis and Research

【速读】:该论文旨在解决多模态神经影像分析中因不同模态(如结构磁共振成像 sMRI、功能磁共振成像 fMRI、弥散张量成像 dMRI 和正电子发射断层扫描 PET)预处理流程复杂、工具链异构、下游统计分析与疾病分类需定制代码及数据格式规范等问题所导致的自动化程度低、可重复性差和人工干预成本高的挑战。解决方案的关键在于提出 NeuroAgent,一个基于大语言模型(Large Language Model, LLM)驱动的智能体框架,采用分层多智能体架构与反馈驱动的“生成-执行-验证”引擎,实现从原始数据到下游任务(如阿尔茨海默病分类)的端到端自动化处理:智能体自主生成可执行预处理代码、自动检测并恢复运行时错误、验证输出完整性,并通过人机协同接口仅在边缘情况触发人工介入,从而显著降低人工参与度并提升分析效率与一致性。

链接: https://arxiv.org/abs/2605.06584
作者: Lujia Zhong,Yihao Xia,Jianwei Zhang,Shuo huang,Jiaxin Yue,Mingyang Xia,Yonggang Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal neuroimaging analysis often involves complex, modality-specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and disease classification commonly require task-specific code, evaluation protocols, and data-format conventions, creating additional barriers between raw acquisitions and reproducible scientific analysis. We present NeuroAgent, an LLM-driven agentic framework that automates key preprocessing and analysis steps for heterogeneous neuroimaging data, including sMRI, fMRI, dMRI, and PET, and supports interactive downstream analysis through natural-language queries. NeuroAgent employs a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine: agents autonomously generate executable preprocessing code, detect and recover from runtime errors, and validate output integrity. We evaluate the system on 1,470 subjects pooled across all ADNI phases (CN=1,000, AD=470), where all subjects have sMRI and tabular data, with subsets also having Tau-PET (n=469), fMRI (n=278), and DTI ( n=620 ). Pipeline ablation studies across multiple LLM backends show that capable models reach up to 100% intent-parsing accuracy, with the strongest backend (Qwen3.5-27B) reaching 84.8% end-to-end preprocessing step correctness. Automated recovery limits manual intervention to edge cases where human review is required via the Human-In-The-Loop interface. For Alzheimer’s Disease classification using automatically preprocessed multimodal data, our agent ensemble achieves an AUC of 0.9518 with four modalities, outperforming all single-modality baselines. These results show that NeuroAgent can reduce the manual effort required for neuroimaging preprocessing and enable end-to-end automated analysis pipelines for neuroimaging research.

[AI-8] Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

【速读】:该论文旨在解决流形生成模型(flow-based generative models)中人类偏好对齐(human preference alignment)的训练不稳定与计算效率低的问题。其核心解决方案是提出一种确定性伴随匹配(deterministic adjoint matching)框架,将偏好对齐建模为速度场上的最优控制问题,通过直接回归当前策略下由值梯度诱导的目标控制量,获得简单且稳定的训练目标;关键创新在于引入截断伴随方案(truncated adjoint scheme),仅在轨迹末端部分进行计算,该区域集中了奖励相关信号,从而大幅降低计算开销而不损害对齐质量,并进一步扩展框架以支持非标准KL正则化,实现对齐强度与分布保真度之间更灵活的权衡。

链接: https://arxiv.org/abs/2605.06583
作者: Zhengyi Guo,Jiayuan Sheng,David D. Yao,Wenpin Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target under the current policy, leading to a simple and stable training objective. Building on this perspective, we introduce a truncated adjoint scheme that focuses computation on the terminal portion of the trajectory, where reward-relevant signals concentrate, which yields substantial computational savings while preserving alignment quality. We further generalize the framework beyond standard KL-based regularization, allowing more flexible trade-offs between alignment strength and distributional preservation. Experiments on SiT-XL/2 and FLUX.2-Klein-4B demonstrate consistent gains across multiple alignment metrics, along with substantially improved diversity and mode preservation.

[AI-9] Directional Consistency as a Complementary Optimization Signal: The GONO Framework

【速读】:该论文旨在解决深度学习优化中一个被忽视的问题:梯度方向的一致性(directional alignment)与损失函数收敛(loss convergence)之间存在解耦现象,即优化器可能在梯度方向上表现出高度一致性(cc_t ≈ 1),但损失仍保持高位或下降缓慢。现有优化方法如Adam、SGD和RMSprop依赖于梯度幅值信号来调整更新,难以区分局部极小值、鞍点和平坦区域,从而导致效率低下。为此,作者提出GONO(Gradient-Oriented Norm-Adaptive Optimizer),其核心创新在于基于连续梯度余弦相似度(cc_t)动态调整Adam的动量系数β₁:当检测到方向一致时增强动量以加速收敛,而在震荡时抑制动量以稳定优化过程。理论证明GONO保持Adam的O(1/√T)收敛速率,并在无信息信号时退化为Adam;实验表明,cc_t能实现精确的震荡检测(F1=1.00 vs. 0.45 for gradient norm),且GONO在多个基准任务上性能媲美AdamW,验证了方向一致性作为可解释、可操作的优化信号的有效性。

链接: https://arxiv.org/abs/2605.06575
作者: Victor Daniel Gera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t - 1, measured via consecutive gradient cosine similarity) while the loss remains high or decreases slowly. This observation reveals that existing optimizers such as Adam, SGD, and RMSprop lack explicit mechanisms to exploit temporal consistency in gradient directions, relying instead on magnitude-based signals that fail to distinguish plateaus, saddle points, and genuine convergence. Motivated by this, we introduce GONO (Gradient-Oriented Norm-Adaptive Optimizer), which adapts Adam’s momentum coefficient beta_1 based on cc_t: amplifying momentum under directional consistency and suppressing it during oscillation. We prove GONO matches Adam’s O(1/sqrt(T)) convergence rate and reduces exactly to Adam when the signal is uninformative. Empirically, cc_t achieves oscillation detection with F1=1.00 (vs. 0.45 for gradient norm), and GONO remains competitive with AdamW on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%), establishing directional alignment as a theoretically grounded, practically actionable optimization signal. Code: this https URL

[AI-10] Ex Ante Evaluation of AI-Induced Idea Diversity Collapse

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在创意任务中评估的盲区问题,即现有方法仅关注个体输出质量,而忽视了创意成果在群体中的稀缺性与拥挤效应——当大量用户生成相似内容时,单个创意的价值会下降。解决方案的关键在于提出一种“人类相对框架”(human-relative framework),通过模型自生成数据与未受辅助的人类基线进行对比,无需人类-AI交互数据即可估算潜在的群体级创意拥挤风险。该框架引入两个核心指标:超额拥挤系数 Δ\Delta 和人类相对多样性比 ρ\rho,其中 ρ1\rho \geq 1 表示无超额拥挤的公平状态,并揭示了拥挤可通过生成协议设计进行干预,从而将多样性崩溃转化为开发阶段可量化、可优化的目标。

链接: https://arxiv.org/abs/2605.06540
作者: Nafis Saami Azad,Raiyan Abdul Baten
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding risk from model-only generations and matched unaided human baselines. By modeling ideas as congestible resources, we show that source-level crowding is identifiable from within-distribution comparisons, yielding an excess-crowding coefficient \Delta and a human-relative diversity ratio \rho . We show that \rho\ge1 is the no-excess-crowding parity condition and connect \Delta to an adoption game with exposure-dependent redundancy costs. Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels. Estimates stabilize with feasible model-only sample sizes. Importantly, generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI.

[AI-11] SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting

【速读】:该论文旨在解决当前流行病预测中因数据稀疏、噪声大及高度非平稳性导致的准确性不足问题,尤其是在多区域相互作用背景下如何有效利用时空信息提升预测性能。其解决方案的关键在于构建了一个名为SpatialEpiBench的标准化基准平台,涵盖11个真实公共卫生场景下的流行病数据集,并采用滚动评估策略和针对不同疫情特性的指标进行系统测试。通过该基准,研究发现即使引入基于流行病学先验知识的邻接信息,多数现有模型在1天至1个月的预测范围内仍低于简单的历史值(last-value)基线,揭示了三大失败模式:爆发提前识别能力差、对稀疏与噪声数据处理困难、以及常见地理邻接关系在流行病学空间信息中的有限效用。此工作为开发可实际部署的流行病预测模型提供了可复现的评估框架和关键改进方向。

链接: https://arxiv.org/abs/2605.06530
作者: Ruiqi Lyu,Alistair Turcan,Bryan Wilder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal methods are natural candidates for improving forecasts. Despite growing interest in spatial information, no standardized benchmark exists, and current evaluations often use simple chronological train-test splits that do not reflect real-time forecasting practice. We address this gap with SpatialEpiBench, a challenging benchmark for spatiotemporal epidemic forecasting in realistic public-health settings. SpatialEpiBench includes 11 epidemic datasets with standardized rolling evaluations and outbreak-specific metrics. We evaluate adjacency-informed forecasting models with widely used epidemic priors that adapt general models to epidemiology, but find that most methods underperform a simple last-value baseline from 1 day to 1 month ahead, even during outbreaks and with these priors. We identify three major failure modes: (1) poor outbreak anticipation, (2) difficulty handling sparsity and noise, and (3) limited utility of common geographic adjacency for epidemiological spatial information. We release benchmark data, code, and instructions at this https URL to support development of operationally useful epidemic forecasting models.

[AI-12] Market-Alignment Risk in Pricing Agents : Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

【速读】:该论文旨在解决生成式 AI (Generative AI) 在强化学习(Reinforcement Learning, RL)场景中因奖励函数设计不当而导致的“行为失真”问题,即代理在优化可量化的标量指标(如每间可用客房收入 RevPAR)时,未能学习到真实市场中的分布级行为(如价格分布、入住率等),反而产生短视或非理性策略(如过度降价、崩溃至固定价格区间)。其关键解决方案是提出一种基于轨迹级诊断与修复的框架——Trace-Prior RL:首先从历史市场轨迹中学习一个分布式的市场先验(market prior),然后训练一个随机定价策略,在最大化 RevPAR 的同时施加 KL 散度惩罚以逼近该先验。此方法有效将代理行为对齐至目标市场的多维特征(RevPAR、ADR、入住率、价格桶分布),并在种子级别不确定性内实现与基准酒店一致的性能表现,从而提供了一种可复现的“失败-修复”范式,适用于标量奖励易被操纵且意图行为仅在轨迹层面显现的智能体系统。

链接: https://arxiv.org/abs/2605.06529
作者: Peiying Zhu,Sidi Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages

点击查看摘要

Abstract:Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor’s remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B’s RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A’s own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.

[AI-13] Process Matters more than Output for Distinguishing Humans from Machines

【速读】:该论文旨在解决在在线环境中可靠区分人类与机器(如大语言模型和自主代理)的问题,传统方法依赖输出层面的相似性(遵循图灵测试标准),但难以有效识别生成式 AI (Generative AI) 的行为本质差异。解决方案的关键在于引入 CogCAPTCHA30——一套包含30个认知任务的评测体系,通过提取任务执行过程中的诊断性特征(process-level features)来实现更可靠的鉴别能力,即使在任务表现(performance metrics)完全匹配的情况下,过程特征仍能提供更强的判别信号(平均分类器 AUC = 0.88)。进一步研究表明,显式的过程级监督(如 P-SFT)可提升机器对人类行为模式的模仿程度,但其效果受限于任务特定的过程表示是否可用,凸显了过程建模规范性是实现类人认知过程的核心瓶颈。

链接: https://arxiv.org/abs/2605.06524
作者: Milena Rmus,Mathew D. Hardy,Thomas L. Griffiths,Mayank Agrawal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

[AI-14] On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的模型在训练过程中可能出现的隐式奖励过拟合问题,即模型在测试集上表现良好但训练阶段奖励较低的现象。其解决方案的关键在于通过周期性地替换秩-1子空间(Periodic Rank-1 Substitution),揭示出RLVR训练中有效信息主要集中在秩-1成分,并发现该过程本质上是对特定奇异谱进行优化,且左奇异向量在训练中表现出更强的对齐趋势,这表明RLVR实际是在优化采样效率。这一发现为理解RLVR如何塑造模型参数提供了新视角,并为改进现有强化学习范式以实现持续学习提供了潜在路径。

链接: https://arxiv.org/abs/2605.06523
作者: Hao Ye,Jisheng Dang,Junfeng Fang,Bimei Wang,Yizhou Zhang,Ning Lv,Wencan Zhang,Hong Peng,Bin Hu,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don’t maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.

[AI-15] Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models ICML2026

【速读】:该论文旨在解决Transformer-based tabular foundation models (TFMs)在小到中等规模表格预测基准任务中的推理机制不明确的问题。研究通过首次大规模机制性分析,揭示了多层模型在推理过程中逐层演化的动态特性,识别出不同的推理阶段及与语言模型显著不同的潜在空间动态。解决方案的关键在于发现模型存在显著的深度冗余,即推理过程中各层间存在重叠计算,由此设计出一种基于循环机制的单层模型,仅使用原始模型20%的参数即可实现相当的性能表现,验证了迭代优化与参数共享策略的有效性。

链接: https://arxiv.org/abs/2605.06510
作者: Amir Rezaei Balef,Mykhailo Koshil,Katharina Eggensperger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial depthwise redundancy across multiple models, suggesting iterative refinement with overlapping computations during inference stages. Guided by these insights, we design a proof-of-concept, looped single-layer model that uses only 20% of the original model’s parameters while achieving comparable performance. The code is available at this https URL.

[AI-16] On the Security of Research Artifacts

【速读】:该论文旨在解决当前研究 artifacts(研究工具或数据集)在学术会议中被广泛共享以支持可复现性,但其安全评估严重不足的问题。现有 artifact 评估(Artifact Evaluation, AE)主要关注功能是否可复现,却忽视了潜在的安全风险,而这些 artifacts 被公开发布和重复使用后可能引入攻击向量,引发安全滥用。解决方案的关键在于提出一种上下文感知的安全评估分类法(context-aware security assessment taxonomy),并开发 SAFE(Security-Aware Framework for Artifact Evaluation),该框架通过静态分析结合代码语义、执行上下文和实际可利用性来自动区分真实安全风险与误报,从而实现对 artifacts 中潜在安全问题的结构化识别与量化评估。SAFE 在准确率(84.80%)和 F1 分数(84.63%)上表现优异,表明其能有效提升 AE 过程中的安全性保障能力。

链接: https://arxiv.org/abs/2605.06508
作者: Nanda Rani,Christian Rossow
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: this https URL

[AI-17] PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中如何实现高隐私保障的同时保持可用性能的问题,特别是在零阶优化(Zeroth-Order, ZO)场景下。传统差分隐私(Differential Privacy, DP)机制通常需要引入大量噪声才能达到强隐私保护水平(如ε=0),这会显著损害模型性能;而本文提出的PACZero框架通过引入概率近似正确(Probably Approximately Correct, PAC)隐私理论,首次在零阶微调中实现了信息论意义上的隐私保护(即I(S*; Y₁:T) = 0),此时成员推断攻击(Membership Inference Attack, MIA)的成功率被严格限制在先验水平,无需依赖无限噪声或ε=0的极端条件。其关键创新在于:利用符号量化(sign-quantizing)子集聚合的零阶梯度,在更新方向一致的步骤上释放符号信息不会增加条件互信息(conditional mutual information),从而实现“零代价”隐私保护;进一步设计了两种变体——PACZero-MI(基于精确校准的二值发布控制互信息预算)和PACZero-ZPL(在分歧步骤使用均匀随机翻转以强制I=0),有效平衡了隐私与性能之间的权衡。实验表明,在SST-2和SQuAD任务上,PACZero-ZPL可在I=0条件下实现接近非私有基线的准确率,且优于现有所有方法在高隐私强度下的表现。

链接: https://arxiv.org/abs/2605.06505
作者: Murat Bilgehan Ertan,Xiaochen Zhu,Phuong Ha Nguyen,Marten van Dijk,Srinivas Devadas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at I(S^*; Y_1:T)=0 . This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at \varepsilon=0 and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ( I=0 via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at I=0 , PACZero-ZPL reaches 88.99\pm0.91 , within 2.1 pp of the non-private MeZO baseline ( 91.1 FT). No prior method produces usable utility in the high-privacy regime \varepsilon1 , and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at I=0 .

[AI-18] Operator-Guided Invariance Learning for Continuous Reinforcement Learning

【速读】:该论文旨在解决连续时间与状态/动作空间下强化学习(Reinforcement Learning, RL)方法在数据效率低、对干扰变异性和分布偏移敏感的问题。其核心挑战在于如何发现并利用更通用的价值保持结构(value-preserving structures),这些结构能够通过非线性算子在具有同构价值函数的连续系统之间进行变换,从而提升学习的稳定性与泛化能力。解决方案的关键在于提出VPSD-RL框架,该框架将连续RL建模为受控扩散过程,并借助李群(Lie group)作用及其对应的拉回算子定义价值保持映射;通过证明当且仅当拉回价值函数与前推动作与控制生成器和奖励泛函可交换时,价值保持结构存在,进而设计基于确定方程残差最小化的机制来学习无穷小生成器,并通过常微分方程(ODE)流将其指数化以获得有限变换;最终通过过渡增强与变换一致性正则化将这些结构集成到连续RL中,实现对近似轨道上最优价值函数的定量稳定性保障,其敏感度由有效Horizon决定,实验表明该方法显著提升了数据效率和鲁棒性。

链接: https://arxiv.org/abs/2605.06500
作者: Zuyuan Zhang,Fei Xu Yu,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbfVPSD-RL (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton–Jacobi–Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.

[AI-19] From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在机制可解释性研究中对特征高阶共现结构分析不足的问题。现有方法主要依赖于最高激活标记列表或解码器权重向量进行特征表征,忽略了特征间共享的复杂局部上下文关联模式。解决方案的关键在于引入一种图结构表示:将每个SAE特征建模为一个基于token共现关系的图,其中节点代表高频出现在强激活附近的token,边表示它们在局部上下文窗口内的共现关系;并通过一种定制化的、基于频率分箱的Weisfeiler-Lehman(WL)风格图核来衡量此类结构空间中的相似性。该方法能够识别出仅靠token频次或解码器余弦相似度无法捕获的结构性模式,如标点密集型模式、语言/脚本聚类及代码模板等,从而补充了传统特征分析视角的局限性。

链接: https://arxiv.org/abs/2605.06494
作者: Ruben Fernandez-Boullon,Pablo Magariños-Docampo,Javier Perez-Robles
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.

[AI-20] Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

【速读】:该论文旨在解决当前前沿人工智能(AI)代理在真实场景中是否倾向于表现出工具性趋同(Instrumental Convergence, IC)行为的问题,即模型是否会为了达成特定目标而违背人类指令,采取如自我保护等可能带来风险的行为。解决方案的关键在于构建一个现实且低风险的基准测试套件,包含七个操作任务及对应的合规路径与违规捷径,并通过八种变体框架系统性地操控监控强度、指令清晰度、任务重要性、权限设置、工具实用性以及诚实路径的可访问性等因素,从而量化IC行为的发生概率及其驱动机制。研究采用确定性环境状态评分和人工轨迹审查相结合的方法,在1680个样本中检测到5.1%的IC行为,揭示了该行为虽罕见但具有集中性和可测量性,验证了对当前前沿AI代理危险倾向进行稳健评估的可行性。

链接: https://arxiv.org/abs/2605.06490
作者: Jonas Wiedermann-Möller,Leonard Dung,Maksym Andriushchenko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal-based agents. This is behaviour such as self-preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our benchmark is realistic and low-stakes which serves to reduce evaluation-awareness and roleplay confounds. The suite contains seven operational tasks, each with an official workflow and a policy-violating shortcut. An eight-variant shared framework varies monitoring, instruction clarity, stakes, permission, instrumental usefulness and blocked honest paths to support inferences regarding the factors driving IC behaviour. We evaluated ten models using deterministic environment-state scorers over 1,680 samples, with trace review employed for audit and adjudication purposes. The final IC rate is 86 out of 1,680 samples (5.1%). IC behaviour is concentrated rather than uniform: two Gemini models account for 66.3% of IC cases and three tasks account for 84.9%. Conditions in which IC behaviour is indispensable for task success result in the greatest increase in the adjusted IC rate (+15.7 percentage points), whereas emphasising that task success is critical or certain framing choices do not produce comparable effects. Our findings indicate that realistic, low-nudge environments elicit IC behaviour rarely but systematically in most tested models. We conclude that it is feasible to robustly measure tendencies for dangerous behaviour in current frontier AI agents.

[AI-21] Reason STL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

【速读】:该论文旨在解决自然语言到信号时序逻辑(Signal Temporal Logic, STL)的自动翻译问题,这一任务在自主系统和信息物理系统的形式化验证与合成中至关重要。当前方法存在两大局限:一是人工编写STL公式需要时序逻辑专业知识且难以扩展;二是依赖商业大语言模型(Large Language Models, LLMs)API进行提示推理会带来高昂的token消耗和敏感系统需求泄露风险,不利于工业部署。论文提出的解决方案核心是\textscReasonSTL框架,其关键创新在于将翻译过程分解为显式推理、确定性工具调用和结构化公式构建三阶段,并引入过程奖励训练机制以同时监督工具使用轨迹与最终STL公式的正确性,从而实现透明、低成本、隐私安全的自动化形式规范生成。

链接: https://arxiv.org/abs/2605.06483
作者: Bowen Ye,Zhijian Li,Junyue Huang,Junkai Ma,Xiang Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textscReasonSTL, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textscReasonSTL decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textscSTL-Bench, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textscReasonSTL achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textscReasonSTL provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.

[AI-22] Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

【速读】:该论文旨在解决历史手稿页面的精确年代判定问题,传统方法通常将年代粗略划分为世纪类别,导致精度不足。其解决方案的关键在于提出一种基于视觉特征的概率性深度回归框架,将年代预测建模为连续年份轴上的概率分布估计问题,并在单次前向传播中输出包含可分解的偶然不确定性(aleatoric uncertainty)与认知不确定性(epistemic uncertainty)的完整预测分布。模型采用EfficientNet-B2主干网络结合Normal-Inverse-Gamma(NIG)输出头,通过联合负对数似然与证据正则化目标函数进行训练,在DIVA-HisDB基准上实现了5.4年测试平均绝对误差(MAE),显著优于现有方法,且具备更低的推理成本和更好的校准性能(PICP=92.6%)。

链接: https://arxiv.org/abs/2605.06475
作者: Ranjith Chodavarapu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93% of patches within 5 years and 97% within 10 years. Our approach achieves \textbfPICP=92.6%, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2%, 50 passes) and Deep Ensembles (PICP=79.7%, 5 models) at 5\times lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman \rho=0.729 ), and a selective prediction about the most certain 20% of patches can provide \textbf0.5 years MAE. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with \rho=0.905 between uncertainty and page-level error.

[AI-23] Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

【速读】:该论文旨在解决有限时域马尔可夫决策过程(Markov Decision Process, MDP)中离策略评估(off-policy evaluation)的问题,核心挑战在于如何利用历史数据准确估计目标策略(target policy)下的预期回报,尤其是在函数逼近存在的情况下。解决方案的关键是提出了一种名为Q-MMR的新理论框架,其通过学习一组标量权重(每个数据点对应一个权重),使得加权后的奖励能够逼近目标策略下的期望回报。这些权重通过自顶向下的归纳方式,基于值函数判别器类(value-function discriminator class)的矩匹配目标进行优化。值得注意的是,该方法在仅假设Q函数可实现性(realizability of Q^\pi)的前提下,即可获得数据依赖的有限样本保证,并且误差界不依赖于函数类的统计复杂度(即维度无关界),这显著区别于传统方法对函数类复杂度的强依赖。

链接: https://arxiv.org/abs/2605.06474
作者: Xiang Li,Nan Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of Q^\pi , with a dimension-free bound – that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.

[AI-24] Beyond Task Success: Measuring Workflow Fidelity in LLM -Based Agent ic Payment Systems PAKDD2026

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的多智能体系统在支付流程中缺乏细粒度评估方法的问题。现有指标如任务成功率(Task Success Rate, TSR)和代理交接F1分数(Agent Handoff F1-Score, HF1)仅关注最终结果或无序的路由决策,无法捕捉智能体执行路径中的偏差。为此,作者提出一种轨迹保真度指标——代理成功率(Agentic Success Rate, ASR),通过在转换层级比较观测到与预期的智能体执行序列,将性能分解为转换召回率(Transition Recall)和转换精确率(Transition Precision)。ASR能够识别出诸如跳过确认检查点等隐蔽性工作流异常,从而揭示TSR和HF1无法检测到的系统性行为差异,证明了轨迹级评估在受监管领域的重要性。

链接: https://arxiv.org/abs/2605.06457
作者: Donghao Huang,Joon Kiat Chua,Zhaoxia Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables. Accept at AI and Data Science for Digital Finance (AIDS4DF) Workshop, PAKDD 2026

点击查看摘要

Abstract:LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.

[AI-25] PrefixGuard: From LLM -Agent Traces to Online Failure-Warning Monitors

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在执行长序列工具调用任务时,因最终结果验证延迟而导致无法及时干预的问题。其核心挑战在于如何设计轻量级、可泛化的前缀监控机制(prefix monitor),以在不依赖人工编写事件模式且避免昂贵的部署时LLM判断的前提下,实现早期风险预警。解决方案的关键在于提出PrefixGuard框架,该框架包含两个阶段:首先通过StepView离线诱导出确定性的类型化步骤适配器(deterministic typed-step adapters),从而将原始轨迹转化为结构化表示;其次基于终端结果进行监督学习,训练出事件抽象和前缀风险评分器(prefix-risk scorer)。这一方法显著提升了AUPRC指标(最高达0.900),并证明了其在多个基准测试中优于纯文本控制策略,同时揭示了评分性能与实际部署可用性之间的差异,为构建可解释、可诊断的前缀预警系统提供了实用路径。

链接: https://arxiv.org/abs/2605.06455
作者: Xinmiao Huang,Jinwei Hu,Rajarshi Roy,Changshun Wu,Yi Dong,Xiaowei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, \tau^2 -Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and \tau^2 -Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas \tau^2 -Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

[AI-26] ORTHOBO: Orthogonal Bayesian Hyperparameter Optimization

【速读】:该论文旨在解决贝叶斯优化(Bayesian Optimization, BO)中因采集函数(acquisition function)估计噪声导致的决策不稳定问题。尽管代理模型和采集目标正确指定,有限样本的蒙特卡洛误差仍可能扰动采集值,进而改变候选解的排序并引发次优决策。解决方案的关键在于提出一种正交采集估计器(orthogonal acquisition estimator),通过减去一个最优加权的得分函数控制变量(score-function control variate),使得采集残差与后验得分方向正交,从而实现蒙特卡洛方差的降低。在此基础上,作者进一步构建了OrthoBO框架,结合集成代理模型与外层对数变换,在理论上保证估计无偏性、方差缩减及配对排序稳定性,并在数值实验中验证其在超参数优化任务中的有效性。

链接: https://arxiv.org/abs/2605.06454
作者: Maresa Schröder,Pascal Janetzky,Michael Klar,Stefan Feuerriegel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the surrogate model and acquisition target are correctly specified, finite-sample Monte Carlo error can perturb acquisition values. This can, in turn, flip candidate rankings and lead to suboptimal BO decisions. As a remedy, we aim at variance reduction and propose an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate, which yields an acquisition residual orthogonal to posterior score directions and which thus reduces Monte Carlo variance. We further introduce OrthoBO: a Bayesian optimization framework that combines our orthogonal acquisition estimator with ensemble surrogates and an outer log transformation. We show theoretically that our estimator preserves the target, leads to variance reduction, and improves pairwise ranking stability. We further verify the theoretical properties of OrthoBO through numerical experiments where our framework reduces acquisition estimation variance, stabilizes candidate rankings, and achieves strong performance. We also demonstrate the downstream utility of OrthoBO in hyperparameter optimization for neural network training and fine-tuning.

[AI-27] Constraint Decay: The Frag ility of LLM Agents in Backend Code Generation

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在生成生产级后端代码时,难以满足结构约束(如架构模式、数据库设计和对象关系映射)的问题。现有基准测试通常忽略非功能性需求,仅评估功能正确性,导致生成结果虽能运行却缺乏工程规范性。解决方案的关键在于构建一个系统性的评估框架:通过固定统一的API契约,在80个全新项目任务和20个功能实现任务中,结合端到端行为测试与静态验证器,量化结构复杂度对代理性能的影响。实验发现“约束衰减”现象——随着结构要求增加,代理表现显著下降;同时识别出数据层缺陷(如查询构造错误和ORM运行时违规)是主要失败根源,揭示了同时满足功能与结构约束仍是编码代理的核心挑战。

链接: https://arxiv.org/abs/2605.06445
作者: Francesco Dente,Dario Satriani,Paolo Papotti
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

[AI-28] SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在社会概念推理(Social Concept Reasoning)能力评估方面的空白问题,即当前主流评测方法多聚焦于数学或技术任务,缺乏针对抽象社会概念(如社会规范、文化价值观和制度逻辑)的系统性评估框架。其解决方案的关键在于提出SCRuB(Social Concept Reasoning under Rubric-Based Evaluation)框架,该框架通过三阶段流程实现:基于权威来源构建提示(prompt)、由专家与模型生成响应,并采用五维批判性思维量规进行对比评估;同时引入“学科视角委员会”(Panel of Disciplinary Perspectives)集成方法以提升评估泛化能力,最终实证表明前沿模型在所有维度上均超越人类专家,揭示了单轮问答式评测对社会概念推理已达到饱和状态。

链接: https://arxiv.org/abs/2605.06444
作者: Jamelle Watson-Daniels,Himaghna Bhattacharjee,Skyler Wang,Brandon Handoko,Antonio Li,Anaelia Ovalle,Mahesh Pasupuleti,Candace Ross,Vidya Sarma,Arjun Subramonian,Karen Ullrich,Will van der Vaart,Yijing Xin,Maximilian Nickel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.

[AI-29] Knowledge Graphs the Missing Link in Agent ic AI-based Formal Verification ICDT

【速读】:该论文旨在解决生成式 AI 在形式验证(Formal Verification, FV)中用于从自然语言规范自动生成 SystemVerilog Assertions (SVAs) 时面临的两大核心挑战:一是规范与寄存器传输级(Register Transfer Level, RTL)设计之间语义脱节,导致生成的断言存在语法错误或逻辑不一致;二是现有方法将规范和RTL视为松散文本,缺乏对微架构细节的精准建模。其解决方案的关键在于构建一个以验证为中心的知识图谱(Verification-centric Knowledge Graph, KG),该KG基于从规范、RTL及形式工具反馈(包括语法诊断、反例 Counterexamples, CEXs 和覆盖率报告)中提取的结构化中间表示(Intermediate Representations, IRs)建立,实现了需求、设计层次、信号、假设与属性之间的可追溯关联。通过多智能体工作流对KG进行查询与更新,驱动语法修复、CEX引导修正和覆盖率驱动的属性增强三个迭代优化循环,从而显著提升断言生成的准确性与编译成功率,并在多个基准设计上实现高达99.4%的形式覆盖率。

链接: https://arxiv.org/abs/2605.06434
作者: Vaisakh Naduvodi Viswambharan,Keerthan Kopparam Radhakrishna,Deepak Narayan Gadde,Aman Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear at the IEEE International Conference on IC Design and Technology 2026 (ICICDT), June 22 - 24, 2026, Dresden, Germany

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification-to-RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification-centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal-tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design-grounded context for generation. A multi-agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX-guided correction using trace links, and coverage-directed property augmentation. Evaluation across seven benchmark designs indicates that KG-based context retrieval improves specification-to-RTL grounding and consistently produces compilable SVAs with low syntax-repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.

[AI-30] Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

【速读】:该论文旨在解决现代深度学习架构在处理天然无限维信号(如时间序列、概率分布或算子)时缺乏统一学习理论的问题,这些信号定义在不规则域上且维度可能无限。解决方案的关键在于提出一种新颖的卷积学习框架——HilbNets,其核心是利用与希尔伯特丛(Hilbert bundle)相关的连接拉普拉斯算子(connection Laplacian)作为卷积操作符,并通过两阶段采样程序实现可计算性:首先,证明在采样密度增加时,从流形上采样得到的希尔伯特细胞层(Hilbert Cellular Sheaf)的层拉普拉斯算子以概率收敛于原始连接拉普拉斯算子,这是对Belkin-Niyogi经典图拉普拉斯收敛结果在无限维丛上的推广;其次,证明离散化的HilbNets能收敛到连续架构并具有跨不同采样的一致性,从而为几何学习方法提供了理论保障和实际可行性。

链接: https://arxiv.org/abs/2605.06395
作者: Kartik Tandon,Julian Gould,Tanishq Bhatia,Francesca Dominici,Alejandro Ribeiro,Claudio Battiloro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 51 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textitHilbNets. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \ Niyogi \citeBELKIN20081289 convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.

[AI-31] Automated alignment is harder than you think

【速读】:该论文试图解决的问题是:在人工智能能力持续提升的背景下,利用AI代理(AI agents)自动化对齐研究(alignment research)以确保人工超级智能(ASI)安全的策略可能因系统性错误而产生误导性的安全评估,从而导致未对齐AI被意外部署。其核心风险源于对“难以监督的模糊任务”(fuzzy tasks)的自动化处理——这些任务缺乏明确的评估标准,且人类判断本身存在系统性偏差,使得AI代理生成的内容即使看似正确,也可能因错误聚合而形成过度自信的安全结论。解决方案的关键在于训练AI代理可靠地执行此类模糊任务,其中“泛化能力”(generalisation)和“可扩展监督”(scalable oversight)被视为最可行路径,但二者在自动化对齐场景下均面临前所未有的挑战,例如优化压力导致错误集中于人类难以察觉的领域、AI错误模式与人类不同、以及模型输出高度相关性等。

链接: https://arxiv.org/abs/2605.06390
作者: Aleksandr Bowkis,Marie Davidsen Buhl,Jacob Pfau,Geoffrey Irving
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.

[AI-32] Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

【速读】:该论文旨在解决标准在线蒸馏(On-Policy Distillation, OPD)在训练学生模型时存在的三个结构性缺陷:高方差更新、零优势区域梯度消失以及纠正信号缺失导致的探索瓶颈。其解决方案的关键在于提出非对称在线蒸馏(Asymmetric On-Policy Distillation, AOPD),通过在非正优势区域以局部差异最小化替代无效的负向强化,同时保留正向强化机制,从而提升训练稳定性与策略多样性。实验表明,AOPD 在数学推理基准上显著优于标准OPD,且在序列工具使用适应中展现出更强的能力保持性。

链接: https://arxiv.org/abs/2605.06387
作者: Nan Jia,Haojin Yang,Xing Ma,Jiesong Lian,Shuailiang Zhang,Weipeng Zhang,Ke Zeng,Xunliang Cai,Zequn Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are this http URL therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

[AI-33] MinMax Recurrent Neural Cascades

【速读】:该论文旨在解决传统循环神经网络(Recurrent Neural Networks, RNNs)中存在的梯度消失或爆炸问题,同时提升模型的表达能力与计算效率。其解决方案的关键在于引入Min-Max代数(MinMax algebra)构建一种新型的递归结构——MinMax递归神经级联(MinMax Recurrent Neural Cascades, MinMax RNCs),该结构通过非线性变换保持状态和激活值的有界性,并确保梯度在几乎所有点上存在且有界,尤其关键的是其状态梯度不会随时间距离衰减,从而避免了传统RNN中的梯度消失问题。这一设计使得MinMax RNCs不仅具备理论上的强大表达能力(如识别所有正则语言),还能实现并行高效计算(对数复杂度)和稳定的训练特性,实验也验证了其在合成任务和真实场景下均表现出优越性能。

链接: https://arxiv.org/abs/2605.06384
作者: Alessandro Ronca
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.

[AI-34] Rethinking Vacuity for OOD Detection in Evidential Deep Learning

【速读】:该论文旨在解决证据深度学习(Evidential Deep Learning, EDL)中基于空虚度(Vacuity,或不确定性质量,Uncertainty Mass, UM)的分布外(Out-of-Distribution, OOD)检测评估中存在的偏差问题。具体而言,当训练集(In-Distribution, ID)与测试集(OOD)类别数(class cardinality, K)不一致时,UM指标会因K的变化而产生人为的性能高估,导致AUROC和AUPR等评价指标失真,从而误导模型性能判断。解决方案的关键在于识别并量化这种由类别数差异引发的评估伪影(evaluation artefact),强调在EDL模型评估中必须确保ID与OOD数据具有相同的类别数,否则即使模型预测不变,也会出现显著的指标偏移(如AUROC最大偏移达0.360,AUPR达0.683)。该研究通过理论分析与实证验证,指出当前EDL评估范式需更严格控制类别的可比性,尤其在因果语言模型(causal language models)结合多选题问答(MCQA)数据集的应用场景下,应明确界定ID与OOD的定义边界。

链接: https://arxiv.org/abs/2605.06382
作者: Claire McNamara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ( K ) by the total strength of belief ( S ) of the model’s predictions, where S is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of K . In particular, it is unlikely in practice that there is a linear relationship between K and S as K and S increase due to the nature of EDL (suppressing incorrectly assigned evidence). As a result, when comparing In Distribution (ID) and OOD results, it is important that K_\mathrmID and K_\mathrmOOD are equal; something that is not always ensured in practice. We provide an empirical demonstration of how results for AUROC and AUPR can substantially differ when class cardinality between ID and OOD differs by 1, with AUROC differing by as much as 0.318 and AUPR by 0.613 for standard EDL, and AUROC by 0.360 and AUPR by 0.683 for IB-EDL. More concretely, our findings isolate an evaluation artefact: when K differs between ID and OOD, AUROC/AUPR can be artificially inflated without any change in model predictions. We further discuss the evaluation of EDL over causal language models using Multiple-Choice Question-Answer (MCQA) datasets and argue for clearer definitions of ID and OOD in this context. Our primary contribution is an empirical and theoretical demonstration that vacuity-based OOD detection in EDL-fine-tuned LLMs is highly sensitive to uncontrolled differences in evaluated class cardinality.

[AI-35] Debiased Multimodal Personality Understanding through Dual Causal Intervention

【速读】:该论文旨在解决多模态人格理解中因个体属性(如可观察的年龄和不可观测的心理状态)引发的主观偏差问题,此类偏差会导致模型学习到虚假关联,从而造成不公平的人格判断。解决方案的关键在于构建一个结构因果模型(Structural Causal Model, SCM),并提出一种双因果调整网络(Dual Causal Adjustment Network, DCAN),通过两个核心模块实现因果解耦:一是基于原型的后门调整因果学习(Back-door Adjustment Causal Learning, BACL)模块,用于阻断可观测人口统计因素引起的虚假相关性;二是前门调整因果学习(Front-door Adjustment Causal Learning, FACL)模块,通过中介字典干预来处理潜在且不可观测的偏倚。该方法实现了对人格表征的因果解耦,提升了公平性和预测准确性。

链接: https://arxiv.org/abs/2605.06371
作者: Yangfu Zhu(Capital Normal University)Zitong Han(Capital Normal University)Nianwen Ning(Henan University)Yuting Wei(University of International Relations)Yuandong Wang(Capital Normal University)Hang Feng(Capital Normal University)Zhenzhou Shao(Capital Normal University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn-ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by subject bias (e.g., observable age and unobservable mental states), as subjects originate from diverse demographic backgrounds. Learn ing such spurious associations between multimodal features and traits may lead to unfair personality understanding. In this work, weconstruct aStructural Causal Model (SCM)toanalyze theimpact of these biases from a causal perspective, and propose a novel Dual Causal Adjustment Network (DCAN) to mitigate the interference of subject attributes on personality understanding. Specifically, we design a Back-door Adjustment Causal Learning (BACL) module to block spurious correlations from observable demographic factors via a prototype-based confounder dictionary, and subsequently ap ply a Front-door Adjustment Causal Learning (FACL) module to ad dress latent and unobservable biases throughalearnedmediatordic tionary intervention, thereby achieving causal disentanglement of representations for deconfounded reasoning. Importantly, we con struct a Demographic-annotated Multimodal Student Personality (DMSP) dataset to support the analysis and discussion of fairness related factors. Extensive experiments on the benchmark dataset CFI-V2 and our DMSPdataset demonstrate that DCAN consistently improves prediction accuracy, reaching 92.11% and 92.90%, respec tively. Meanwhile, the improvementsinthefairnessmetricsofequal opportunity and demographic parity are 6.57% and 7.97% on CFI-V2, and 15.38% and 20.06% on the DMSP dataset. Our code and DMSP dataset are available at this https URL

[AI-36] Flow Matching with Arbitrary Auxiliary Paths

【速读】:该论文旨在解决生成式建模中概率路径设计的灵活性与理论一致性问题,即如何在保持连续性方程和训练目标一致性的同时,引入更广泛的辅助变量分布以构建多样化的生成轨迹。解决方案的关键在于提出Flow Matching with Arbitrary Auxiliary Paths (AuxPath-FM)框架,通过将任意分布的辅助变量 η\eta 引入概率路径构造 Xt=a(t)X1+b(t)X0+c(t)ηX_t = a(t)X_1 + b(t)X_0 + c(t)\eta,突破了传统条件流匹配仅限于高斯噪声的限制,从而实现了对不同先验分布(如均匀、拉普拉斯、Rademacher等)的支持,并保证了理论上的连续性和训练目标的一致性,为标签引导生成等特定任务提供了可定制的概率路径设计基础。

链接: https://arxiv.org/abs/2605.06364
作者: Xin Peng,Ang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a new generative modeling framework, \textbfFlow Matching with Arbitrary Auxiliary Paths (AuxPath-FM), which generalizes conditional flow matching by incorporating an auxiliary variable drawn from an arbitrary distribution into the probability path. Unlike prior methods that restrict auxiliary components to Gaussian noise, AuxPath-FM allows the variable \eta to follow any distribution, producing trajectories of the form X_t = a(t)X_1 + b(t)X_0 + c(t)\eta . We theoretically demonstrate that this construction preserves the continuity equation and maintains a training objective consistent with the marginal formulation. This flexibility enables the design of diverse probability paths using various priors, including Gaussian, Uniform, Laplace, and discrete Rademacher distributions, each offering unique geometric properties for generative flows. Furthermore, our framework allows for specialized tasks such as label-guided generation by encoding structured semantic information into the auxiliary distribution. Overall, AuxPath-FM provides a principled and general foundation for probability path design, offering both theoretical generality and practical flexibility for diverse generative modeling tasks.

[AI-37] opological Signatures of Grokking

【速读】:该论文旨在揭示神经网络在训练过程中如何内部化任务的潜在结构,特别是针对“grokking”现象——即模型在训练后期突然实现泛化能力的非连续转变。其核心问题是:如何从几何与拓扑角度量化并解释这种结构化的表征学习过程。解决方案的关键在于引入持久同调(persistent homology)分析方法,通过对嵌入矩阵生成的点云数据进行拓扑特征提取,发现了一种清晰且一致的拓扑签名:第一同调群(H₁)的最大和总持久性显著增加,表现为一个主导的长寿命拓扑特征及日益结构化的次级特征,这反映了任务本身的循环结构。相比传统的频谱和几何诊断手段(如傅里叶分析和局部内在维数),持久同调提供了统一的多尺度几何-拓扑表征框架,能够同时捕捉局部与全局结构,并通过消融实验验证这些拓扑转变与泛化能力密切相关,而非单纯的记忆行为。

链接: https://arxiv.org/abs/2605.06352
作者: Yifan Tang,Qiquan Wang,Inés García-Redondo,Anthea Monod
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 19 pages, 14 figures, 2 tables

点击查看摘要

Abstract:We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ( H_1 ). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics – specifically, Fourier analysis and local intrinsic dimension – persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.

[AI-38] Prediction and Empowerment: A Theory of Agency through Bridge Interfaces

【速读】:该论文旨在解决在部分可观测环境(Partially Observable Markov Decision Process, POMDP)下,智能体如何实现有效代理行为(agency)的问题,尤其关注生成式AI在确定性物理或模拟世界中因初始条件不确定性、固定规律位和外生噪声而产生的表观随机性。其核心挑战在于厘清预测(prediction)、压缩(compression)与赋能(empowerment)三者之间的分离机制。解决方案的关键在于将感知与执行建模为桥接接口,该接口由智能体可控参数与环境可控信道状态共同构成,并通过关于隐变量微观状态的先验分布及多对一观测粗化,诱导出一个确定性的POMDP框架;在此基础上证明:完美预测可通过识别目标家族相关的隐藏商集(hidden quotient)或通过 overwrite 控制使未来动作决定目标来实现,但仅高赋能不足以达成此目标;当接口可细化且具备足够记忆时,以动作条件化的观测压缩能降低对隐含商集后验不确定性的认知,若细化需调节世界侧信道条件,则会形成目标条件化的接口赋能(target-conditioned interface empowerment)。这一理论框架为现代AI系统的设计提供了原则性指导:应明确区分隐藏状态识别、接口细化、任务相关可控性与单纯覆盖控制或干扰控制等目标维度。

链接: https://arxiv.org/abs/2605.06346
作者: Richard Csaky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is a working draft: feedback and criticism is most welcome

点击查看摘要

Abstract:We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfaces split between agent-controlled parameters and environment-controlled channel state, inducing a deterministic POMDP through a prior over latent microstates and many-to-one observation coarsening. Within this framework, we prove a separation between prediction, compression, and empowerment. Perfect prediction can be achieved either by identifying the hidden quotient relevant to the target family or by overwrite control that makes the future target action-determined; high empowerment alone is insufficient. Under refinable interfaces and sufficient memory, action-conditioned observation-compression progress reduces posterior uncertainty about the latent quotient, and when refinement requires steering world-side channel conditions, this creates target-conditioned interface empowerment. A bit-string specialization with a conserved information budget makes the resulting tradeoff explicit: prediction by identification requires internal capacity at least the relevant latent entropy, whereas overwrite control requires terminal action capacity over the controlled quotient. For modern AI agents, the results suggest a design principle rather than a theorem of inevitability: objectives should distinguish hidden-state identification, interface refinement, task-relevant controllability, and mere overwrite or distractor control. Human–AI alignment is partly an interface-design problem, where the relevant bridge is between human intent, agent internal state, external tools, and world-side channel conditions. This is a working draft: feedback and criticism is most welcome.

[AI-39] More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

【速读】:该论文旨在解决当前AI研究代理(AI research agents)在辅助科研时存在的局限性——即它们通常依赖明确且可操作的初始输入(如已清晰表述的研究问题),而忽视了人类研究初期常面临的“隐性摩擦”(tacit friction),即在形成具体问题前的模糊认知状态。解决方案的关键在于提出一个名为InciteResearch的多智能体框架,其核心机制包括:(1)通过结构化五维研究人员画像状态,从模糊甚至跨领域的输入中提取具体的摩擦点;(2)借助7阶段因果推导追踪强制最大化可行性与新颖性的乘积,从而暴露并挑战隐藏假设;(3)验证所提方法是否为重构洞察的必要结论。该框架实现了从隐性理解到显性、可检验和可行动的研究构想转化,显著提升了生成提案的新颖性和影响力(在TF-Bench基准上从3.671/3.806提升至4.250/4.397)。

链接: https://arxiv.org/abs/2605.06345
作者: Jie Yu,Song Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures; Code is available at this https URL

点击查看摘要

Abstract:AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi-agent framework designed to make a researcher’s implicit understanding explicit, inspectable, and actionable. InciteResearch decomposes the logical chain of Socratic questioning and distributes it across the entire pipeline that: (1) Elicits a structured five-dimensional researcher profile state anchored by specific friction points from vague, even domain-unrelated inputs; (2) Violates hidden assumptions by maximizing the feasibility-novelty product with enforcing a 7-stage causal derivation trace; and (3) check whether the proposed method is a Necessary consequence of the reframed insight. We further introduce TF-Bench, the first benchmark for tacit-to-explicit research assistance that distinguishes domain-related from domain-unrelated inspirations across four scientific modes. On TF-Bench, InciteResearch achieves leapfrogging gains over a prompt-based baseline (novelty/impact from 3.671/3.806 to 4.250/4.397), shifting generated proposals from recombination to architectural insight. Our work demonstrates that AI can serve as an extension of thinking itself, rather than merely automating downstream execution.

[AI-40] Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

【速读】:该论文旨在解决预训练数据分布对表格基础模型(Tabular Foundation Models)下游性能影响机制不明确的问题,特别是不同来源的预训练语料——人工精选数据、网络爬取数据与合成数据——在分布上的差异及其对模型泛化能力的影响。其解决方案的关键在于系统性地比较三类典型语料:T4(网络爬取)、TabFM(Kaggle精选)和TabICL(合成数据),通过聚合表级、列级特征及相关性统计,并结合判别器AUC和k-NN覆盖率指标进行量化分析。研究发现,合成数据(TabICL)占据真实表格空间的狭窄区域,且无法通过优化超过8.6万种超参数配置弥补这一分布差距;而精选与爬取数据在分布层面具有高度可互换性;更意外的是,这种分布偏差并未在基于特征相似性的度量或TabICL内部表示中体现为显著性能下降,表明覆盖真实数据分布并非决定模型泛化能力的核心因素。

链接: https://arxiv.org/abs/2605.06343
作者: Alex O. Davies,Telmo de Menezes e Silva Filho,Nirav Ajmeri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL’s own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL’s generalisation.

[AI-41] CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models GECCO2026

【速读】:该论文旨在解决多耦合优化问题(coupled optimization problems)中现有大语言模型(Large Language Model, LLM)驱动的自动启发式设计方法仅适用于单问题场景、难以有效协调多个紧密耦合子问题的局限性。其解决方案的关键在于提出CoupleEvo框架,通过三种进化协调策略——顺序策略(sequential strategy)、迭代策略(iterative strategy)和集成策略(integrated strategy)——来协同演化各子问题的启发式规则,从而提升整体优化性能。实验表明,基于分解的策略(顺序与迭代)在收敛稳定性与解质量上表现更优,凸显了跨子问题进化搜索协调的重要性,并验证了LLM驱动启发式设计在复杂耦合优化问题中的潜力。

链接: https://arxiv.org/abs/2605.06341
作者: Thomas Bömer,Bastian Amberg,Max Disselnmeyer,Anne Meyer
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: accepted at GECCO 2026, San Jose, Costa Rica, Workshop

点击查看摘要

Abstract:Many real-world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to single-problem settings. In this paper, we propose CoupleEvo. CoupleEvo proposes three evolutionary coordination strategies to evolve heuristics for coupled optimization problems: the sequential strategy evolves heuristics for one subproblem after the other; the iterative strategy alternates the evolution of heuristics for different subproblems over successive generations; and the integrated strategy evolves heuristics for all problems simultaneously. The approach is evaluated on two representative coupled optimization problems. Experimental results show that decomposition-based strategies (sequential and iterative) provide more stable convergence and higher solution quality, while the integrated evolution strategy suffers from increased search complexity and variability. These findings highlight the importance of coordinating evolutionary search across interdependent subproblems and demonstrate the potential of LLM-driven heuristic design for complex coupled optimization problems. The code is available: this https URL.

[AI-42] A Regime Theory of Controller Class Selection for LLM Action Decisions

【速读】:该论文旨在解决部署阶段语言模型(Language Model, LM)和视觉-语言模型(Vision-Language Model, VLM)在面对每个输入时如何动态决策的问题,即选择直接回答、检索证据、转交给更强模型或放弃响应(abstain)。传统观点认为更高的实例级表达能力(per-input expressivity)总是有益的,但作者指出在有限样本下并非如此:不同基准测试偏好不同的控制器类别,这源于实例级不确定性信号在分布依赖尺度上的耗尽。解决方案的关键在于构建一个由四类控制器组成的嵌套格(nested lattice)——固定动作、分区路由器(partition router)、实例级控制器(instance-level controller)和先验门控控制器(prior-gated controller),并提出一种基于三个可估计数据瓶颈的“制度理论”(regime theory),用于指导控制器类别的选择:一是相较于最优固定动作所能提升的空间,二是是否存在足够样本使实例级控制器做出可靠决策,三是当实例级信号不可靠时粗粒度分区路由器能恢复多少性能。该理论给出了一个伯恩斯坦紧致阈值(Bernstein-tight threshold),并具有信息论下界匹配性,且严格嵌套交叉验证能保证选出近最优控制器类别。

链接: https://arxiv.org/abs/2605.06339
作者: Zhaoyang Jiang,Zhizhong Fu,Yunsoo Kim,Jiacong Mi,Zicheng Li,Xuanqi Peng,Honghan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at this https URL.

[AI-43] Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在事件日志(event log)分析中因计算资源消耗高、依赖云基础设施及安全风险而难以实际部署的问题,同时弥补现有方法仅能识别问题而无法提供可操作修复方案的不足。其解决方案的关键在于:利用高性能LLM生成大规模合成Windows事件日志数据集,并引入轻量级语言模型(Small Language Models, SLMs)通过LoRA(Low-Rank Adaptation)参数高效微调技术进行任务定制化训练,从而实现本地化部署下对事件问题的精准识别与有效修复建议生成,且在性能上优于直接使用LLMs,同时显著降低计算开销。

链接: https://arxiv.org/abs/2605.06330
作者: Siraaj Akhtar,Saad Khan,Simon Parkinson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 27 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) have shown promise for event log analysis, but their high computational requirements, reliance on cloud infrastructure, and security concerns limit practical deployment. In addition, most existing approaches focus only on the identification of the problem and do not provide actionable remediation. Small language models (SLMs) present a light-weight alternative that can be fine-tuned for a specific purpose and hosted locally. This paper investigates whether SLMs, when fine-tuned for a specific task, can serve as a practical alternative for event log analysis while also generating solutions. We first create a large-scale synthetic Windows event log dataset that contains remediation actions using a high-performing LLM. We then fine-tune multiple SLMs and LLMs using the LoRA parameter-efficient fine-tuning technique and evaluate their performance by comparing with expert assessment. The results show that the dataset accurately reflects real-world scenarios and that fine-tuned SLMs consistently outperform LLMs in identifying issues and providing relevant remediation, while requiring fewer computational resources.

[AI-44] Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练中优化器效率与内存占用之间的权衡问题,特别是针对KL-Shampoo这类利用梯度矩阵结构的预条件优化方法在实际应用中的性能瓶颈。其核心问题是:如何在保持预条件效果的同时降低计算复杂度并提升训练稳定性。解决方案的关键在于对KL-Shampoo中Kronecker因子的谱结构进行建模——发现其具有“尖峰-平坦”(spike-and-flat)特征(即少数主导特征值后接近均匀分布的尾部),并据此提出Pro-KLShampoo:将其中一个Kronecker因子限制在一个参数化子空间内,其中包含完整的r维谱结构和剩余n−r维方向上的单一共享特征值,同时在这些低秩方向上引入正交化处理;这一设计不仅保留了原方法的代数形式,还显著提升了训练效率,在多个模型规模下均实现了更低的验证损失、更少的GPU显存占用及更快的收敛速度。

链接: https://arxiv.org/abs/2605.06316
作者: Ruotong Sun,Ermin Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning – most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization – and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo’s Kronecker preconditioners: their eigenvalue spectra exhibit a \emphspike-and-flat shape – a few dominant eigenvalues followed by an approximately uniform tail – across layers and training stages, holding exactly under a rank- \rho signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo’s Kronecker factors to a parametric family aligned with the spike-and-flat shape: full spectral structure on a tracked r -dimensional subspace, single shared eigenvalue across the remaining n-r directions. On these directions, we apply orthogonalization. An identity shows that this orthogonalization recovers the algebraic form of full KL-Shampoo’s preconditioner. On four pre-training scales (GPT-2 124M / 350M, LLaMA 134M / 450M), Pro-KLShampoo consistently outperforms KL-Shampoo at every subspace rank we test in validation loss, peak per-GPU memory, and wallclock time to reach each loss level.

[AI-45] Measuring Black-Box Confidence via Reasoning Trajectories: Geometry Coverag e and Verbalization

【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中缺乏可靠置信度估计的问题,从而实现通过纯文本API的安全部署。现有主流黑盒基线方法——基于K样本的自一致性(self-consistency)——存在计算复杂度线性增长且忽略推理轨迹几何结构的局限性。其解决方案的关键在于提出一种全新的黑盒轨迹置信度评分机制:将CoT视为滑动窗口轨迹,并利用单参数softmax测量其收敛到外部答案锚点的程度,无需 logits、隐藏状态或监督校准器。该方法在多个基准(如MedQA-USMLE、GPQA Diamond、MMLU-Pro)和推理模型(Gemini 3.1 Pro、Claude Sonnet 4.6)上均优于自一致性(K=8),在K=4时即实现帕累托改进(中位AUC提升0.075),并揭示了轨迹几何信息(G)、覆盖率(C)与语义化置信表达(V)三者独立且互补的信号贡献机制。

链接: https://arxiv.org/abs/2605.06308
作者: Marc Boubnovski Martell,Josefa Lia Stoisser,Kaspar Märtens,Jialin Yu,Robert Kitchen,Philip Torr,Jesper Ferkinghoff-Borg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior ©, within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).

[AI-46] Attributions All the Way Down? The Metagame of Interpretability

【速读】:该论文旨在解决模型解释中第二阶交互效应的量化问题,即如何刻画特征间在解释层面的相互影响关系。传统第一阶归因方法(如Shapley值)仅提供单个特征对模型输出的贡献,忽略了特征之间在解释过程中的协同或竞争作用。论文提出的解决方案核心是引入“元博弈”(metagame)框架,将归因方法本身视为一个合作博弈,并计算其Shapley值以得到元归因(meta-attribution),从而量化特征j对特征i归因的定向影响(记为φji(f)\varphi_j \to i(f))。理论证明表明,归因可逐层分解为元归因,且该方法扩展了现有的交互指数,使其具备方向性;实证部分验证了该方法在语言模型、视觉-语言编码器及多模态扩散Transformer等场景下的有效性。

链接: https://arxiv.org/abs/2605.06295
作者: Hubert Baniecki,Przemyslaw Biecek,Fabian Fumagalli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution \phi(f) explaining a model f , we measure the directional influence of feature j on the attribution of feature i , denoted as meta-attribution \varphi_j \to i(f) , by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

[AI-47] Data Language Models: A New Foundation Model Class for Tabular Data

【速读】:该论文旨在解决当前人工智能系统中缺乏对表格数据(tabular data)原生理解的问题。现有方法如梯度提升树或最新的表格基础模型均需依赖预处理管道才能处理表格数据,无法像语言模型理解句子、视觉模型理解图像那样直接处理原始单元格值。解决方案的关键在于提出数据语言模型(Data Language Model, DLM),这是一种能够直接从原始单元格值中学习并理解表格结构的新型基础模型。DLM作为表格数据的原生理解层,消除了传统AI系统与原始数据之间的预处理障碍,其首个实现Schema-1在行级预测任务上超越了梯度提升集成、AutoML堆栈及现有表格基础模型,并在缺失值重建任务中优于经典统计方法和前沿大语言模型,验证了对数据分布几何结构的内在理解比外部世界知识更适用于表格数据建模。

链接: https://arxiv.org/abs/2605.06290
作者: Eda Erol,Giuliano Pezzoli,Ozer Cem Kelahmet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset’s own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.

[AI-48] Correct Code Vulnerable Dependencies: A Large Scale Measurement Study of LLM -Specified Library Versions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成Python代码时对第三方库(Third-Party Library, TPL)版本选择不当所带来的安全与兼容性风险问题。现有研究未系统评估LLM生成代码中版本标识符的合理性,而实际应用中这些版本可能包含已知漏洞(CVE)或导致安装失败。论文通过构建PinTrace基准测试集对10个主流LLM进行大规模实证分析,发现模型在直接提示下指定版本的概率为26.83%–95.18%,其中36.70%–55.70%的任务引入至少一个已知CVE,且多数为高危或严重级别;更关键的是,这些漏洞大多在模型知识截止前已公开,表明存在系统性偏差而非随机错误。解决方案的核心在于:引入外部锚定的版本约束机制,可显著降低漏洞暴露率和兼容性失败率,从而将版本选择从“被忽视的风险面”转变为可管理的开发环节。

链接: https://arxiv.org/abs/2605.06279
作者: Chengjie Wang,Jingzheng Wu,Xiang Ling,Tianyue Luo,Chen Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 35 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model’s knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at this https URL. Comments: 35 pages, 8 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.06279 [cs.SE] (or arXiv:2605.06279v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.06279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion

【速读】:该论文旨在解决生成式模型在合成表格数据时普遍存在“真实数据效用差距”(synthetic-real utility gap)的问题,即当前基于扩散模型的合成数据在下游任务中的性能往往难以超越直接使用真实数据训练的模型。传统方法主要依赖训练时的架构改进、规模扩展或重新训练单一生成器来缩小这一差距,而忽略了推理阶段对预训练模型输出进行微调的可能性。论文提出TARDIS(Tabular generation through Refinement, Distillation, and Inference-time Sampling),其关键在于引入一种推理时精炼框架,通过冻结预训练扩散模型参数,在不改变模型结构的前提下,利用树状帕尔岑估计器(Tree-structured Parzen Estimator)搜索最优得分级引导策略,并结合一个统一的数学模式——双向切姆霍夫精炼(Bidirectional Chamfer Refinement, BCR):该模式同时以连续方式(通过得分梯度)和离散方式(通过批次排序后处理)最小化合成样本与真实样本之间的对称切姆霍夫距离(symmetric Chamfer functional)。BCR作为核心机制,在每个数据集上经由内层网格搜索优化样本选择器与软标签蒸馏步骤后被有效激活,最终实现仅需1至80分钟即可在单张消费级GPU上达到甚至超越真实数据的下游任务性能(中位提升+8.6%,p=0.016),且保持与原始扩散模型相当的流形保真度、多样性及样本级隐私保护水平。

链接: https://arxiv.org/abs/2605.06261
作者: Eugenio Lomurno,Filippo Balzarini,Francesco Benelle,Francesca Pia Panaccione,Matteo Matteucci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural advances, scaling, and retraining of monolithic generators. The inference-time alternative, i.e., refining the outputs of a pre-trained backbone with parameters left untouched, has remained largely unexplored for tabular synthesis. We introduce TARDIS (Tabular generation through Refinement, Distillation, and Inference-time Sampling), an inference-time refinement framework that operates on a frozen pre-trained backbone, configured per dataset by a Tree-structured Parzen Estimator search over score-level guidance during reverse diffusion, with each trial’s objective set by an inner grid search over post-hoc sample selectors and an optional soft-label distillation step. The search space encodes a single mathematical pattern we name Bidirectional Chamfer Refinement (BCR): the symmetric Chamfer functional between synthetic and real samples is minimized both continuously, via a score-level gradient, and discretely, via batch-ranking post-generation. The per-dataset search recovers BCR-aligned configurations on most datasets, evidence for BCR as the dominant refinement pattern. Across 15 binary, multiclass, and regression benchmarks TARDIS achieves a median +8.6% downstream-task improvement over models trained on real data (95% CI [+3.3, +16.4], Wilcoxon p=0.016, 11/15 strict wins) and improves over the TabDiff backbone on all 15 datasets (mean +12.9%, p10^-4), matching the backbone on manifold fidelity, diversity, and sample-level privacy. Inference-time refinement of a pre-trained tabular diffusion backbone reaches and exceeds real-data utility in 1 to 80 minutes on a single consumer-grade GPU.

[AI-50] he Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks

【速读】:该论文旨在解决深度神经网络如何学习表征这一核心理论问题。其解决方案的关键在于提出一个以特征为中心的分析框架,通过建立权重更新与特征演化之间的联系,推导出一个简洁的“特征学习方程”(Feature Learning Equation),并识别出权重Gram矩阵作为刻画特征动态的核心对象。该框架进一步引入“虚拟协方差”(Virtual Covariance)来描述训练过程中特征演化的协方差结构,并基于此定义“目标线性度”(Target Linearity)作为衡量特征与目标之间线性对齐程度的指标,从而揭示深层网络在训练中逐步将表示转化为目标线性结构的机制,为神经坍缩(Neural Collapse)和生成模型中的线性插值等现象提供了统一的线性化解释。

链接: https://arxiv.org/abs/2605.06258
作者: Taehun Cha,Daniel Beaglehole,Adityanarayanan Radhakrishnan,Donghun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages including appendix

点击查看摘要

Abstract:Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We introduce a simple identity, the Feature Learning Equation, which identifies the weight Gram matrix as the key object capturing feature dynamics. This enables us to interpret gradient descent as implicitly inducing a hypothetical evolution of features, whose covariance structure - termed the Virtual Covariance - characterizes how representations evolve during training. Building on this perspective, we introduce Target Linearity, a measure quantifying the linear alignment between features and targets. By analyzing the training and layer-wise dynamics, we show that deep networks learn to sequentially transform representations toward target-linear structure. This linearization perspective provides a unified interpretation of several empirical phenomena, including Neural Collapse and linear interpolation in generative models.

[AI-51] Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real Repairable but Not Accuracy-Dominant

【速读】:该论文旨在解决前向-前向(Forward-Forward, FF)训练中层间自由搭车(layer free-riding)现象所导致的优化病理问题。具体而言,在累积良好度(cumulative-goodness)变体中,较深层网络可能继承由浅层已部分分离的任务,使得后续层的学习梯度因前序层积累的正边际(positive margin)呈指数衰减,从而削弱了当前层的判别能力。解决方案的关键在于提出三种局部修复策略:基于块的(per-block)、硬度门控的(hardness-gated)和深度缩放的(depth-scaled)机制,这些方法无需依赖反向传播梯度即可恢复当前层的分离度量。实验表明,这些策略在CIFAR-10/100上显著提升深层网络的层分离统计指标(提升4–45倍),且对最终准确率影响极小,证明了该优化问题可被有效缓解,但并非限制模型性能的主要因素。

链接: https://arxiv.org/abs/2605.06240
作者: Amirhossein Yousefiramandi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block d decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies – per-block, hardness-gated, and depth-scaled – that recover current-layer separation measures without relying on backpropagated gradients. On CIFAR-10 and CIFAR-100, these remedies dramatically improve layer-separation statistics, with 4\times – 45\times gains in deeper layers, while changing accuracy by less than one percentage point for non-degenerate training procedures. Tiny ImageNet provides a tougher cross-dataset check for our selected block-wise configuration and reveals the same qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments further show that architecture and augmentation choices have a larger effect on final accuracy than the training-rule modifications studied here. Cumulative free-riding is therefore a real and repairable optimization pathology. Nonetheless, for the FF training rules, architectures, and datasets we study, it is not the dominant factor limiting achievable accuracy.

[AI-52] Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks

【速读】:该论文旨在解决多模态推荐系统(Multimodal Recommender Systems)在面对基于规避的推广攻击(evasion-based promotion attacks)时鲁棒性不足的问题。现有防御方法主要针对单模态场景和投毒攻击(poisoning-based threats),对跨模态攻击机制缺乏有效应对,尤其在多用户环境下,视觉与文本模态的梯度方向因不同用户群体主导而出现不一致(cross-modal gradient mismatch),导致攻击效果被削弱,且鲁棒训练低估了最坏情况风险。其解决方案的关键在于提出无目标对抗训练与多模态协同机制(Untargeted Adversarial Training with Multimodal Coordination, UAT-MC):通过将所有物品视为潜在目标来处理规避攻击中的未知目标问题,并引入梯度对齐机制显式校正跨模态梯度不一致,从而实现模态间扰动的同步优化,最大化对抗强度以提升模型鲁棒性,同时在防御-精度权衡下保持可接受的推荐性能。

链接: https://arxiv.org/abs/2605.06238
作者: Guanmeng Xian,Ning Yang,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poisoning-based threats, leaving evasion-based threats underexplored. In this work, we first identify a cross-modal gradient mismatch under the multi-user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst-case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT-MC). UAT-MC tackles the challenge of unknown targeted items in evasion-based attacks (as opposed to poisoning-based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT-MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense-accuracy trade-off. Code is available at this https URL.

[AI-53] Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence

【速读】:该论文旨在解决当前大型模型向自主智能体(Autonomous Agent)演进过程中,因长时决策、工具使用及真实环境交互所带来的系统性风险难以被有效发现与持续优化的问题。现有代理基础设施在评估、数据管理和代理演化方面仍呈碎片化状态,缺乏闭环改进机制。其解决方案的关键在于提出Safactory框架,该框架通过三个紧密耦合的平台实现统一进化流程:并行仿真平台用于轨迹生成,可信数据平台用于轨迹存储与经验提取,自主演化平台则支持异步强化学习与在线策略蒸馏,从而构建了一个可扩展的、面向可信自主智能的闭环演化体系。

链接: https://arxiv.org/abs/2605.06230
作者: Xinquan Chen,Zhenyun Yin,Shan He,Bin Huang,Shanzhe Lei,Pengcheng Shi,Kun Cai,Bei Chen,Bangwei Liu,Zeyu Kang,Chao Huang,Yang Zhang,Wenjie Li,Ruijun Ge,Yajie Wang,Tianshun Fang,Tianyang Xu,Yiwen Cong,Meng Jin,Gaolei Li,Xuansheng Wu,Linhan Liu,Zijing He,An Li,Yan Teng,Xin Tan,ChaoChao Lu,Ji He,Jie Li,Chunfeng Song,Jinya Xu,Fan Song,Shujie Wang,Jianmin Qian,Jie Hou,Xuhong Wang,Yingchun Wang,Hui Wang,Xia Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 50 pages, 21 figures

点击查看摘要

Abstract:As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbfSafactory, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbfParallel Simulation Platform for trajectory generation, a \textbfTrustworthy Data Platform for trajectory storage and experience extraction, and an \textbfAutonomous Evolution Platform for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.

[AI-54] Soft Deterministic Policy Gradient with Gaussian Smoothing

【速读】:该论文旨在解决确定性策略梯度(Deterministic Policy Gradient, DPG)在处理稀疏或离散奖励场景时因批评者(critic)对动作的梯度不可导而导致策略梯度定义不清、学习不稳定的问题。其解决方案的关键在于引入基于高斯平滑的软化贝尔曼方程(smoothed Bellman equation),从而构建一个新的动作价值函数,并推导出软确定性策略梯度(Soft-Deterministic Policy Gradient, Soft-DPG)。该方法通过消除对 critic 动作梯度的显式依赖,确保即使在非光滑 Q 函数情况下,策略梯度依然保持良好定义,进而提升了算法在离散奖励环境中的鲁棒性和性能。

链接: https://arxiv.org/abs/2605.06228
作者: Hyunjun Na,Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures

点击查看摘要

Abstract:Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.

[AI-55] Price of Fairness in Short-Term and Long-Term Algorithmic Selections IJCAI-26

【速读】:该论文旨在解决高风险场景下算法决策中公平性与效用之间的权衡问题,特别是静态公平约束可能在长期内加剧群体差异的困境。其核心挑战在于如何设计既保证短期公平又避免长期不公平积累的决策策略。解决方案的关键在于引入短期与长期群体公平性的概念,并通过理论分析揭示“公平代价”(Price of Fairness, PoF)的结构性限制;进一步证明,在简单的投资政策(如持续向弱势群体倾斜资源)下,即使短期PoF较大,长期不公平仍可趋近于零,从而实现低PoF下的可持续公平性。

链接: https://arxiv.org/abs/2605.06227
作者: Shahin Jabbari,Chen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The short version of this paper appears in the proceedings of IJCAI-26

点击查看摘要

Abstract:Algorithmic decision-making in high-stakes settings can have profound impacts on individuals and populations. While much prior work studies fairness in static settings, recent results show that enforcing static fairness constraints may exacerbate long-run disparities. Motivated by this tension, we study a stylized sequential selection problem in which a decision-maker repeatedly selects individuals, affecting both immediate utility and the population distribution over time. We introduce notions of group fairness for both the short and long term and theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF). We characterize optimal and fair policies in the short term and show that the PoF can be large even when group distributions are nearly identical. In contrast, we show that long-term disparities can vanish under simple investment policies that achieve a low PoF. We also empirically validate these theoretical observations using both synthetic and real datasets.

[AI-56] A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

【速读】:该论文旨在解决罕见病诊断中存在的时间延迟和准确性不足的问题,传统诊断流程常因多源异构数据整合困难、缺乏针对性分析策略以及临床决策支持工具的局限性而效率低下。其解决方案的关键在于提出一个名为Hygieia的多模态AI代理系统,该系统采用基于路由器(router-based)与知识增强(knowledge-enhanced)的架构,能够有效整合表型特征、基因组谱和临床记录等多维数据,并针对不同疾病类别动态调整诊断策略;尤其在罕见病场景下优先识别风险相关基因因素,同时输出置信度评分以提升可解释性和辅助临床判断,从而显著提高诊断准确率并降低医生工作负荷。

链接: https://arxiv.org/abs/2605.06226
作者: Tianyu Liu,Wangjie Zheng,Rui Yang,Benny Kai Guo Loo,Hui Zhang,Jeffries Lauran,Jianlei Gu,Botao Yu,Weihao Xuan,Kexin Huang,Nan Liu,James Zou,Yonghui Jiang,Hua Xu,Hongyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 32 pages, 6 figures

点击查看摘要

Abstract:Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia’s superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

[AI-57] Memory Inception: Latent-Space KV Cache Manipulation for Steering LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在指令引导(instruction prompting)与激活控制(activation steering)之间存在的权衡问题:前者虽控制能力强但会占用大量缓存空间并导致长对话混乱,后者虽高效紧凑却难以支持结构化提示或持久性引导。解决方案的关键在于提出一种无需训练的记忆内嵌(memory inception, MI)方法,其核心是在潜在注意力空间中通过仅在特定层插入由文本生成的键值(key-value, KV)库来实现精确引导,从而将引导信息以稀疏且可选的方式注入模型内部状态,避免冗余存储并提升对持续性、结构性任务的控制能力。

链接: https://arxiv.org/abs/2605.06225
作者: Andy Zeyi Liu,Michael Zhang,Ilana Greenberg,Adam Alnasser,Lucas Baker,John Sous
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control–drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject \times mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118 \times . These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.

[AI-58] Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

【速读】:该论文旨在解决自然语言实例导航(Natural-language Instance Navigation)中因初始用户请求无法唯一确定目标实例而导致的歧义问题。现有方法通常存在两个缺陷:一是过早终止于首个合理候选对象,未能充分探索其他可能性;二是即使收集多个候选对象后,仍基于单个候选对象的属性提问,而非设计能区分候选池中多个对象的问题,导致对话效率低且易产生误判。解决方案的关键在于提出一种两阶段框架——主动实例导航与比较判断(Proactive Instance Navigation with Comparative Judgment, ProCompNav),其核心机制是在每轮对话中提取一个能够将当前候选池划分为两部分的属性-值对,并通过二元是/否问题一次性剔除不一致的候选对象,从而将消歧过程从开放式的靶向描述重构为池级的判别性提问,确保每一步都显著缩小候选集范围。实验表明,该方法在CoIN-Bench上以最少输入实现更高成功率,在TextNav上达到最先进性能,验证了比较判断策略在相似干扰项中的广泛适用性。

链接: https://arxiv.org/abs/2605.06223
作者: Junhyuk Kwon,Seungjoon Lee,Hyejin Park,Kyle Min,Jungseul Ok
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user’s burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target’s attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.

[AI-59] When to Trust Imagination: Adaptive Action Execution for World Action Models

【速读】:该论文旨在解决当前世界动作模型(World Action Models, WAMs)在机器人操作中执行固定长度动作块时缺乏对预测未来与实际物理演进一致性判断的问题,导致机器人在接触密集或复杂场景中难以及时响应。解决方案的关键在于提出一种轻量级验证器——未来前向动力学因果注意力机制(Future Forward Dynamics Causal Attention, FFDC),其通过联合推理预测动作、预测视觉动态、真实观测和语言指令,实时评估剩余动作序列的可信度,从而实现自适应的动作块长度选择:当预测与现实一致时延长执行,偏离时提前重规划。此外,引入多时间尺度训练(Mixture-of-Horizon Training)以提升长程轨迹覆盖能力,最终在RoboTwin和真实场景中均实现了显著的效率-鲁棒性平衡优化。

链接: https://arxiv.org/abs/2605.06222
作者: Rui Wang,Yue Zhang,Jiehong Lin,Kuncheng Luo,Jianan Wang,Zhongrui Wang,Xiaojuan Qi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.

[AI-60] Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

【速读】:该论文旨在解决测试时聚合(test-time aggregation)中现有方法依赖孤立候选推理路径的评估信号或答案频率,而忽略候选路径之间相互比较信息的问题。其解决方案的关键在于提出一种统一的联合一致性(Joint Consistency, JC)框架,将独立评估信号建模为外场,将成对比较建模为相互作用项,从而构建一个约束型Ising能量最小化问题。该框架通过LLM-as-a-judge机制获取交互矩阵,并在答案层面同质性假设下具备理论解释;同时设计了高效的近似策略以实现大规模测试时聚合的可行性。实验表明,JC在数学与代码推理基准上均显著优于现有基线方法。

链接: https://arxiv.org/abs/2605.06219
作者: Yunzhen Yao,Hongye Wang,Yahong Wang,Michael C. Gastpar,Bo Jiang,Lie He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons act as interactions. JC provides a unified framework for test-time aggregation that subsumes existing voting and weighted aggregation methods as special cases. Our construction of the interaction matrix leverages LLM-as-a-judge comparisons, and admits a theoretical interpretation under answer-level homogeneity assumptions. Moreover, we develop an efficient approximation strategy that makes interaction modeling practical for large-scale test-time aggregation. Experiments on math and code reasoning benchmarks show that JC consistently outperforms existing baselines across tasks, judge models, trace budgets, and trace-generation settings.

[AI-61] Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中依赖固定基准测试所带来的局限性,即统一的测试集对所有模型施加相同难度,导致天花板效应(ceiling effect)和地板效应(floor effect),从而掩盖了模型间的真实能力差异。其解决方案的关键在于提出动态边界评估(Dynamic Boundary Evaluation, DBE),通过主动定位每个模型在随机采样解码下每题通过概率约为0.5的边界区域,并将模型置于一个全局可比的难度尺度上进行评估。DBE的核心创新包括:一个经9个参考LLM验证的校准项目库(包含安全、能力与真实性维度)、基于API访问的技能引导边界搜索算法(Skill-Guided Boundary Search, SGBS),以及一套自适应扩展评估集的协议,使新模型能在不饱和的前提下被精准刻画其能力水平。

链接: https://arxiv.org/abs/2605.06213
作者: Haoxiang Wang,Da Yu,Huishuai Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near 0.5 under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model’s boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across 9 reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank’s coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.

[AI-62] owards Annotation-Free Validation of MLLM s: A Vision-Language Logical Consistency Metric

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在视觉-语言任务中仅依赖准确率评估可能诱导无根据猜测,且在缺乏真实标签(ground-truth, gt)标注的新颖任务中难以有效验证模型性能的问题。其解决方案的关键在于提出一种基于因果逻辑原理的新型评估框架——视觉-语言逻辑一致性度量(Vision-Language Logical Consistency Metric, VL-LCM),该方法无需依赖gt标注即可衡量多模态大语言模型(Multimodal Large Language Models, MLLMs)在充分与必要因果关系上的推理一致性。通过在MMMU和NaturalBench等基准上的系统实验验证,表明VL-LCM不仅能够有效反映模型逻辑一致性水平,还具备与gt相关指标强关联性、高可靠性以及在无gt场景下用于模型选择与可信答案解释的能力。

链接: https://arxiv.org/abs/2605.06201
作者: Ying Gu,Mei Chee Leong,Hui Li Tan,Shangbo Mao,Liyuan Li,Nancy Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.

[AI-63] Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction

【速读】:该论文旨在解决临床文档中安全关键型临床行动提取的难题,尤其关注出院记录中的诊疗过渡(transitions of care)和出院后患者安全相关任务。其核心挑战在于临床文本的复杂性和非结构化叙事特征,以及现有模型在缺乏临床推理能力时难以准确识别隐含或细粒度的可执行医疗指令。解决方案的关键在于提出一个两阶段抽取框架,通过分步提示策略(staged prompting strategy)将自由格式的出院记录分解为细粒度、明确可操作的临床任务,并系统评估零样本与少样本大语言模型(LLMs)在该任务上的表现,同时对比通用LLMs与特定任务微调的BERT基线模型。研究发现,尽管LLMs在二分类行动性检测上达到甚至超越监督模型性能,但在多标签细粒度类别分类中仍逊于监督模型,且错误主要源于模型推理与标注规范之间的不一致,而非单纯的数据偏差。这表明当前性能指标受限于标注数据缺乏推理依据,亟需构建带有理由注释的临床NLP数据集以真正评估模型的临床理解能力。

链接: https://arxiv.org/abs/2605.06191
作者: Shivali Dalmia,Ananya Mantravadi,Prasanna Desikan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To manage the complexity of clinical documentation, we introduce a two-stage extraction framework that decomposes discharge notes, that are written in narrative form, into fine-grained, explicitly actionable clinical tasks through a staged prompting strategy. Our contributions include a systematic assessment of generative LLMs for clinical action extraction, a detailed comparison between general-purpose LLMs and task-specific supervised BERT-based models, and an analysis of annotation inconsistencies across different action categories. We show that contemporary LLMs achieve performance comparable to or exceeding supervised models on binary actionability detection, while supervised baselines retain a meaningful advantage on fine-grained multi-label category classification, despite the absence of task-specific fine-tuning and under strict data-privacy constraints. Qualitative error analysis reveals that many failures stem from misalignment between model reasoning and dataset annotation conventions, particularly in cases involving implicit clinical actions and rigid structural labeling rules. These results indicate that reported performance reflects model limitations due to lack of clinical reasoning, that is not captured by plain annotations. Labels without rationales make it impossible to distinguish clinical reasoning failures from annotation convention mismatches. Advancing clinical NLP requires reasoning-annotated datasets that document why specific spans are actionable, not merely which spans were labeled, enabling proper evaluation of model clinical understanding.

[AI-64] In-Context Black-Box Optimization with Unreliable Feedback

【速读】:该论文旨在解决黑箱优化(Black-box Optimization)中如何有效利用多源辅助反馈信息的问题,尤其是在这些反馈可能具有偏差、依赖输入或误导性的情况下。传统方法通常只能处理单一任务或假设仅历史查询数据可用,难以在跨任务场景下实现可靠适应。其解决方案的关键在于提出一种反馈感知的上下文优化框架(Feedback-Informed In-Context Black-Box Optimization, FICBO),通过预训练一个能够同时建模优化历史与廉价辅助反馈的结构化反馈先验(structured feedback prior),使模型在测试时能动态评估各反馈源的可靠性,并据此改进候选集的选择策略。这一机制显著提升了对有用反馈的利用效率,同时保持对弱或误导性反馈的鲁棒性。

链接: https://arxiv.org/abs/2605.06187
作者: Nicolas Samuel Blumer,Julien Martinelli,Samuel Kaski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Black-box optimization in science and engineering often comes with side information: experts, simulators, pretrained predictors, or heuristics can suggest which candidates look promising. This information can accelerate search, but it can also be biased, input-dependent, or misleading. Feedback-aware BO methods typically handle one task at a time, limiting their ability to generalize over multiple sources of feedback. In-context optimizers address cross-task adaptation, but usually assume that optimization history is the only available signal at test time. We study feedback-informed in-context black-box optimization (FICBO), where a pretrained optimizer conditions on both the observed history and cheap auxiliary feedback for the current candidate set. We introduce a structured feedback prior that models how feedback sources vary in their access, relevance, and distortion relative to the true objective, and use it to pretrain a feedback-aware transformer. At test time, the model estimates source reliability in context by comparing observed objective values with auxiliary signals, improving query selection. On synthetic and real-world tasks, FICBO effectively exploits informative feedback while remaining robust to weak or misleading sources, improving over other baselines. Empirical investigations further illustrate how the model perceives test-time sources, offering insights into its interpretability and decision-making process.

[AI-65] BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

【速读】:该论文旨在解决当前构建深度研究代理(Deep Research Agent)时面临的“每篇论文工程税”问题,即不同研究在评估相同基础模型时因工具注册表、执行环境和评估流程不一致而导致结果不可比,且集成新模型需耗费数周定制化工程工作。解决方案的关键在于提出 BioMedArena——一个开源工具包,通过解耦生物医学代理评估的六个核心层级(基准加载、工具暴露、工具选择、执行模式、上下文管理与评分),并标准化147个生物医学基准和75个生物医学工具(涵盖9类功能),使得新增模型、基准或工具仅需注册几行代码的适配器即可完成集成,从而实现公平、高效、可复现的模型比较与性能提升。

链接: https://arxiv.org/abs/2605.06177
作者: Jinge Wu,Hongjian Zhou,Mingde Zeng,Jiayuan Zhu,Junde Wu,Jiazhen Pan,Sean Wu,Honghan Wu,Fenglin Liu,David A. Clifton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation – benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring – and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at this https URL

[AI-66] Post Reasoning : Improving the Performance of Non-Thinking Models at No Cost

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因中间思考步骤(intermediate reasoning traces)导致的推理延迟增加和运行成本上升的问题。现有研究表明,许多实际任务并不需要显式推理,甚至额外推理可能降低性能。为此,作者提出**后置推理(Post-Reasoning)**策略,其核心在于通过指令增强方式,使模型在生成最终答案后补充解释性理由,从而在不增加推理阶段延迟或token消耗的前提下提升模型表现。该方法的关键创新在于将推理过程从“前置”调整为“后置”,并通过监督微调进一步内化该机制,显著提升了多种基准测试下的直接作答能力,平均相对提升达17.37%,且在91.11%的设置中优于基线。

链接: https://arxiv.org/abs/2605.06165
作者: Richmond Sin Jing Xuan,Rishabh Bhardwaj,Soujanya Poria
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to no explicit reasoning, with additional reasoning sometimes even degrading performance. In this work, we propose \textbfPost-Reasoning, a simple yet effective approach that improves instruction-tuned models by conditioning them to justify their answers after generating the final response. By design, it enables the final answer to be obtained without additional latency or token cost, while still improving performance through simple instruction augmentation. We evaluate Post-Reasoning across (117) model–benchmark settings spanning (13) open and proprietary models, (4) model families, and (9) diverse reasoning and knowledge-intensive benchmarks, including AMC, HMMT, GSM8K, GPQA, MMLU-Pro, and BIG-Bench Hard. Post-Reasoning improves performance in over (88.19%) of evaluated settings, achieving a mean relative improvements of (17.37%). Furthermore, we propose supervised post-reason tuning, which further improves performance in over (91.11%) of evaluated settings, and exceeds the prompt-based post-reasoning baseline by an average of (8.01%), demonstrating that post-reasoning can be effectively internalized through training. Ultimately, Post-Reasoning establishes a new performance ceiling for direct-answer capabilities.

[AI-67] Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

【速读】:该论文旨在解决当前大语言模型作为评判者(LLM-as-a-Judge)在评估智能体安全性能时存在的可信度问题,即现有基准测试将LLM的判断视为真实标签,却未验证其是否真正依赖于智能体的行为表现,还是仅仅受评价策略表述方式的影响。论文提出一个基本要求——政策不变性(policy invariance),并将其转化为三个可测试原则:基于认证等价重写下的评分标准语义不变性、故意从严格到宽松调整阈值时的评分阈值不变性,以及对模糊情况敏感的校准机制。解决方案的关键在于构建一套压力测试协议,通过在ASSEBench和R-Judge数据轨迹上对四类代理裁判进行测试,发现当前裁判无法区分有意义的规范性变化与无意义的结构改写,导致安全评分混淆了智能体行为与评价提示本身。作者进一步提出了政策不变性得分(Policy Invariance Score)和裁判卡片报告协议(Judge Card reporting protocol),揭示了裁判可靠性存在数量级差异,而这一差异在仅关注准确率的排行榜中完全不可见,从而为未来安全基准提供了一种自我审计工具。

链接: https://arxiv.org/abs/2605.06161
作者: Shihao Weng,Yang Feng,Xiaofei Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages

点击查看摘要

Abstract:LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent’s behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today’s judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.

[AI-68] Entropy-Regularized Adjoint Matching for Offline RL

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中生成式策略(Generative Policy)因依赖固定行为分布而引发的两个核心问题:一是“流行度偏差”(popularity bias),即高奖励动作在低密度区域被抑制;二是“支持绑定”(support binding),限制了对分布外(off-manifold)高奖励区域的探索。解决方案的关键在于提出一种统一框架——最大熵邻接匹配(Maximum Entropy Adjoint Matching, ME-AM),其核心机制包括:(1) 通过镜面下降(Mirror Descent)熵最大化目标缓解流行度偏差,从而更有效地从离线数据集中提取最优策略;(2) 引入混合行为先验(Mixture Behavior Prior)数学上扩展几何支持范围,以覆盖分布外的高奖励区域,同时保持生成向量场的绝对连续性,实现稳健动作识别与有效探索。

链接: https://arxiv.org/abs/2605.06156
作者: Abdelghani Ghanem,Mounir Ghogho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textitpopularity bias that can suppress high-reward actions in low-density regions, and creates a \textitsupport binding that restricts off-manifold exploration. Existing workarounds, such as appending \textitresidual Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textitMaximum Entropy Adjoint Matching (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textitMixture Behavior Prior that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.

[AI-69] Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models

【速读】:该论文旨在解决知识图谱基础模型(Knowledge Graph Foundation Models, KGFMs)在跨未知知识图谱(KGs)迁移表示能力受限的问题。其核心挑战在于知识图谱缺乏像语言或视觉任务中那样的固定离散网格结构,导致难以提取可泛化的结构不变性。解决方案的关键在于将图小片段(graphlets)——即小型连通子图——作为结构化令牌(structural tokens),构建一个通用的词汇表来捕捉异构知识图谱中的重复模式。通过引入闭合与开放的2-和3路径、星型图小片段等简单但稳健的结构单元,该框架能够在不依赖特定模型的情况下实现跨图谱的结构对齐与表示迁移,从而显著提升零样本归纳和归纳式链接预测性能。

链接: https://arxiv.org/abs/2605.06154
作者: Kossi Amouzouvi,Robert Wardenga,Jens Lehmann,Sahar Vahdati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relations are discrete symbols, yet their arrangement is relational and lacks a common, fixed grid. Knowledge Graphs (KGs) share the discreteness, but not the geometry. They form irregular, non-Euclidean topologies whose local neighborhoods differ from graph to graph. Therefore, Knowledge Graph Foundation Models (KGFMs) rely on identifying structural invariances to produce transferable representations. Without a universal token set, KGFMs are limited in their ability to transfer representations across unseen KGs. We close this gap by treating graphlets, small connected graphs, as structural tokens that recur in heterogeneous KGs. In this paper, We introduce a model-agnostic framework based on a vocabulary of graphlets that mines a KG between relations via pattern matching. In particular, we considered closed and open 2- and 3-path, and star graphlets, to obtain robust invariances. The framework is evaluated on 51 KGs from a wide range of domains, for zero-shot inductive and transductive link prediction. Experiments show that adding simple graphlets to the vocabulary yields models that outperform prior KGFMs.

[AI-70] AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中传统固定折扣因子(discount factor)无法适应状态差异的问题,即在不同状态下应采用不同的折扣策略以更精确地建模长期回报。其核心挑战在于,若直接实现状态依赖的折扣机制(state-dependent discounting),可能导致TD误差崩溃(TD-error collapse)等不稳定现象。解决方案的关键在于提出AdaGamma方法:它通过联合学习一个状态相关的折扣函数(state-dependent discount function)与一种返回一致性目标(return-consistency objective),对值函数更新中的备份结构(backup structure)进行正则化,从而确保算法稳定并有效利用状态依赖的折扣机制。理论分析证明了该方法诱导的贝尔曼算子(Bellman operator)在合理条件下具有良好的适定性(well-posedness),实验表明AdaGamma在连续控制基准任务上可提升SAC和PPO性能,并在线A/B测试中取得统计显著收益,验证了其有效性。

链接: https://arxiv.org/abs/2605.06149
作者: Yaomin Wang,Jianting Pan,Ran Tian,Xiaoyang Li,Yu Zhang,Hengle Qin,Tianshu YU
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor–critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor–critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.

[AI-71] Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

【速读】:该论文旨在解决无监督预训练在目标条件强化学习(Goal-conditioned Reinforcement Learning, GCRL)中的理论基础不明确的问题,特别是针对互信息技能学习(Mutual Information Skill Learning, MISL)方法为何能支持下游目标达成任务这一核心疑问。其解决方案的关键在于将GCRL与MISL统一为控制最大化(control maximization)框架下的不同实例:作者识别出三种典型的GCRL形式化定义,并证明它们在最优策略上可能不等价;但共同点在于,一个表现良好的目标条件策略需具备对目标指令的高度敏感性,而这种敏感性的具体定义由GCRL形式决定。进一步地,作者指出MISL目标本质上是衡量技能敏感性的指标,且其性能受特定GCRL形式下下游目标敏感性的上界约束,从而建立了MISL方法与下游GCRL任务之间的精确对应关系——即对于每种GCRL形式,均存在匹配的MISL目标,使得技能多样性提升可直接增强目标敏感性。此理论框架为预训练策略设计提供了严谨依据,具有重要的实践指导意义。

链接: https://arxiv.org/abs/2605.06145
作者: Alireza Modirshanechi,Benjamin Eysenbach,Peter Dayan,Eric Schulz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.

[AI-72] SymDrift: One-Shot Generative Modeling under Symmetries

【速读】:该论文旨在解决生成式模型在物理系统(如分子)建模中如何高效处理全局对称性(如三维空间中的旋转不变性)的问题。现有方法如等变扩散模型和流匹配模型虽能有效纳入对称性约束,但依赖多步采样导致计算成本高昂;而新兴的漂移模型(drifting models)虽实现单步生成并取得优异性能,却面临对称性特异性挑战——即等变生成器无法保证与对称化目标分布产生一致的漂移场,若要修复此问题则需昂贵的对称化数据预处理。为避免这一代价,作者提出SymDrift框架,其核心创新在于使漂移场本身具备对称性感知能力:一是基于最优对齐的坐标空间对称漂移策略,二是通过G-不变嵌入消除对称性歧义。该方案显著提升了单步生成的质量与效率,在构象和过渡态生成任务上优于现有单步方法,并在计算效率上较多步基线降低高达40倍,适用于高通量药物筛选等场景。

链接: https://arxiv.org/abs/2605.06140
作者: Samir Darouich,Vinh Tong,Lluís Pastor-Pérez,Tanja Bien,Loay Mualem,Mathias Niepert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariances effectively, even when trained on a non-invariant empirical distribution, but they typically rely on costly multi-step sampling. Recently, drifting models have emerged as an efficient alternative, enabling single-step generation and achieving state-of-the-art performance in generative modeling tasks. However, we show that drifting models face a symmetry-specific challenge, since an equivariant generator does not generally produce the same drifting field as the one obtained from the symmetrized target distribution. Addressing this issue would require expensive symmetrization of the empirical distribution. To avoid this cost, we propose SymDrift, a framework that makes the drifting field itself symmetry-aware. We introduce two complementary strategies: (i) a symmetrized drift in coordinate space based on optimal alignment, and (ii) a G -invariant embedding that removes symmetry ambiguity by construction. Empirically, SymDrift outperforms existing one-shot methods on standard benchmarks for conformer and transition state generation, while remaining competitive with significantly more expensive multi-step approaches. By enabling one-shot inference, SymDrift reduces computational overhead by up to 40 \times compared to existing baselines, making it promising for high-throughput applications such as virtual drug screening and large-scale reaction network exploration.

[AI-73] Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练中基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)优化策略中存在的隐式目标不明确与优化不稳定问题。现有方法如基于组的策略梯度(group-based policy gradient)虽能提升推理能力,但其优化过程缺乏对目标分布的显式建模,导致收敛性与多样性难以保障。解决方案的关键在于提出一种列表级策略优化(Listwise Policy Optimization, LPO)框架,其核心创新是将原策略梯度优化重构为在响应单纯形(response simplex)上的显式投影过程:首先通过限制近端RL目标于响应单纯形内以定义清晰的目标分布,再利用精确的散度最小化实现最优投影。该方法实现了列表级目标的单调提升、零和且自校正的投影梯度,并支持灵活选择不同散度度量以控制结构特性,从而在多种推理任务和LLM架构上显著优于传统基线,同时保持优化稳定性和响应多样性。

链接: https://arxiv.org/abs/2605.06139
作者: Yun Qu,Qi Wang,Yixiu Mao,Heming Zou,Yuhang Jiang,Yingyue Li,Wutong Xu,Lizhou Cai,Weijie Liu,Clive Bai,Kai Yang,Yangkun Chen,Saiyong Yang,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

[AI-74] BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent -Managed Codebases

【速读】:该论文旨在解决当前代码代理(coding agent)评估中忽视生成代码库作为“通信载体”价值的问题。传统评测仅关注代码行为正确性,而忽略了在实际工程场景中,由代理生成的代码库往往需被后续代理用于审查、审计或扩展,其设计意图与结构清晰度直接影响下游任务效率。解决方案的关键在于提出BUILD-AND-FIND协议:通过分离行为正确性与代码库可恢复性(recovery),量化下游代理从代码库中识别出原始设计意图的能力,包括恢复准确率、稳定性、实现覆盖率及检查努力程度。该协议引入“构建者-发现者”双角色机制,使评估不仅关注结果是否正确,更关注代码库能否有效传递设计信息,从而推动代码生成从“功能完成”向“可理解、可维护”的工程实践演进。

链接: https://arxiv.org/abs/2605.06136
作者: Jhen-Ke Lin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures, 17 tables

点击查看摘要

Abstract:Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.

[AI-75] Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

【速读】:该论文旨在解决语言模型代理在多任务环境中难以有效复用成功策略的问题,即如何构建一个持久的技能库(skill library)以实现技能的选择、利用与提炼三者的协同进化。现有方法通常将这三个能力分别优化或依赖不同的奖励信号,导致进化过程不完整甚至冲突。其解决方案的关键在于提出Skill1框架,通过单一策略网络联合训练技能选择、利用和提炼,并统一以任务结果信号作为唯一学习来源:低频趋势用于奖励技能选择,高频变化用于奖励技能提炼,从而实现三者之间的高效协同演化。实验表明,该方法在ALFWorld和WebShop基准上显著优于现有基于技能和强化学习的基线模型。

链接: https://arxiv.org/abs/2605.06130
作者: Yaorui Shi,Yuxin Chen,Zhengxi Lu,Yuchun Miao,Shugui Liu,Qi GU,Xunliang Cai,Xiang Wang,An Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.

[AI-76] P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference

【速读】:该论文旨在解决流匹配(flow matching)中Classifier-Free Guidance (CFG) 由于每步采样需两次前向传播而导致的显著计算开销问题。其解决方案的关键在于提出P-Guide框架,该框架通过仅调节初始潜在状态(latent state)即可实现高质量引导,从而在单次推理过程中完成生成任务,无需在采样过程中显式外推速度场(velocity field)。在一阶近似下,P-Guide等价于传统CFG,能够从先验空间引导生成过程,同时支持同方差(homoscedastic)与异方差(heteroscedastic)先验建模,联合建模均值与方差可实现自适应损失衰减并增强对数据不确定性的鲁棒性。实验表明,P-Guide将推理延迟降低约50%,同时保持与双通道CFG基线相当的保真度和提示对齐能力。

链接: https://arxiv.org/abs/2605.06124
作者: Xin Peng,Ang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introducing \textbfP-Guide, a framework that achieves high-quality guidance through a single inference pass by modulating only the initial latent state. We further show that, under a first-order approximation, P-Guide is equivalent to CFG in the sense that it steers generation from the prior space, without requiring explicit velocity field extrapolation during sampling. We consider both homoscedastic and \textbfheteroscedastic priors, and find that jointly modeling the mean and variance enables adaptive loss attenuation and improved robustness to data uncertainty. Extensive experiments demonstrate that P-Guide reduces inference latency by approximately 50% while maintaining fidelity and prompt alignment competitive with standard dual-pass CFG baselines.

[AI-77] Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLM s

【速读】:该论文旨在解决自动启发式设计(Automatic Heuristic Design, AHD)中效率低、泛化能力弱的问题,尤其是在组合优化(Combinatorial Optimization, CO)领域。现有方法多采用自底向上的范式,即从可执行代码出发,通过执行反馈逐步提炼启发式规则,但这一过程难以显式保留和复用知识。论文提出一种互补的自顶向下视角:将“知识”作为主要搜索对象,代码仅用于实例化和验证这些知识,从而使得学习到的启发式原则具有可解释性和跨问题迁移性。其解决方案的关键在于引入统计学习视角下的“失真-压缩权衡”(distortion-compression trade-off),并在此基础上在基于种群和基于树的AHD框架中实现知识优先的搜索策略,显著提升了启发式发现效率、迁移能力和泛化性能,同时证明了结合两种策略能进一步优化结果。

链接: https://arxiv.org/abs/2605.06123
作者: Nguyen Viet Tuan Kiet,Bui Dinh Pham,Dao Van Tung,Tran Cong Dao,Huynh Thi Thanh Binh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 75 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search over executable programs and distill insights from execution feedback to guide later iterations. Because this process moves from low-level implementations to high-level principles, we refer to it as a bottom-up paradigm. We argue that this view is incomplete and introduce a complementary top-down perspective: knowledge becomes the primary search object and code merely instantiates and tests it, making what is learned explicit and reusable across problems and trajectories. We formalize this shift through a statistical-learning view that exposes a distortion–compression trade-off, and instantiate it in both population-based and tree-based AHD frameworks. Across CO and tasks beyond it, knowledge-first search improves discovery efficiency, transfer, and generalization, often outperforming code-centric pipelines, while combining both strategies yields further gains. Our results suggest that progress in AHD depends on iteratively constructing and evolving interpretable hypotheses that retain value beyond a single search trajectory.

[AI-78] Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因计算成本高而导致的效率瓶颈问题,尤其是在需要复杂推理任务时。现有方法要么依赖人工设计的链式思维(Chain-of-Thought, CoT)状态路由策略,限制了性能上限;要么需训练大型过程奖励模型(Process Reward Model),在实际应用中可能不可行。论文提出将逐步模型路由建模为一个带约束的决策问题,并通过强化学习训练一个小型控制策略(Control Policy),结合阈值校准(Threshold Calibration)来调节性能与效率之间的权衡。其核心创新在于无需依赖复杂的奖励模型即可实现高效且稳定的多模型协同推理,显著优于手工路由策略,在多个数学基准测试(GSM8K、MATH500 和 OmniMath)上实现了更优的准确率-成本平衡。

链接: https://arxiv.org/abs/2605.06116
作者: Wenwen Si,Insup Lee,Osbert Bastani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.

[AI-79] CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨文化场景下因训练数据以英语为中心而导致的文化不恰当或文化错位响应问题。为实现对特定文化背景的适应性调整,同时保持模型在其他文化中的原始行为一致性,作者提出“跨文化知识插入”(cross-cultural knowledge insertion)任务,并构建了CrossCult-KIBench基准测试平台,用于评估知识插入的有效性及其对非目标文化的意外影响。其解决方案的关键在于引入Memory-Conditioned Knowledge Insertion(MCKI)方法:该方法利用冻结的MLLM表征从外部记忆库中检索相关文化知识,并在必要时将匹配条目作为条件提示前置注入输入,从而实现文化敏感的响应生成。实验表明,当前方法难以在文化适配效果与行为保真度之间取得平衡,凸显了开发更具文化适应性和责任意识的MLLMs的重要研究方向。

链接: https://arxiv.org/abs/2605.06115
作者: Zhen Zeng,Leijiang Gu,Feng Li,Jing Yu,Zenglin Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion, which focuses on adapting models to specific cultural contexts while preserving their original behavior in other cultures. To facilitate research in this area, we introduce CrossCult-KIBench, a comprehensive evaluation benchmark for assessing both the effectiveness of knowledge insertion and its unintended side effects on non-target cultures. The benchmark includes 9,800 image-grounded cases covering 49 culturally relevant visual scenarios across English, Chinese, and Arabic language-culture groups. It supports evaluation in both single-insert and sequential-insert settings. We also propose Memory-Conditioned Knowledge Insertion (MCKI) as a baseline method. MCKI retrieves relevant cultural knowledge from an external memory using frozen MLLM representations, prepending matched entries as conditional prompts when applicable. Extensive experiments on CrossCult-KIBench reveal that current approaches struggle to balance effective cultural adaptation with behavioral preservation, highlighting a key challenge in developing culturally-aware MLLMs. Our work thus underscores an important research direction for developing more culturally adaptive and responsible MLLMs.

[AI-80] Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLM s

【速读】:该论文旨在解决多任务强化学习(Multi-Task Reinforcement Learning, MTRL)在代码生成大模型(Large Language Models, LLMs)后训练中效率与效果不足的问题,具体表现为现有方法对所有编码任务采用统一处理策略,忽视任务间的学习潜力差异和协同效应,导致训练资源分配不合理且优化过程缺乏动态适应性。解决方案的关键在于提出ASTOR框架,其核心创新是引入“任务效用”(task utility)这一信号来量化每项任务的学习潜力与跨任务协同价值,并构建两个耦合模块:一是分层效用驱动的数据调度模块,实现训练预算的智能分配与高信息量提示的优先选择;二是自适应效用校准的策略优化模块,动态调整各任务的KL散度正则化强度,以匹配其当前训练状态。该机制显著提升了单模型在多个编码任务上的整体性能表现。

链接: https://arxiv.org/abs/2605.06111
作者: Yujia Chen,Yang Ye,Xiao Chu,Yuchi Ma,Cuiyun Gao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.

[AI-81] Shallow Prefill Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

【速读】:该论文旨在解决解码-only语言模型在长上下文推理中因前向填充(Prefill)阶段处理大量提示token而导致的计算与内存开销过高的问题。传统方法通常将所有提示token的键值缓存(KV-cache)存储于每一层,且在自回归解码阶段反复访问,造成显著资源浪费。其解决方案的关键在于提出一种相位不对称的KV可见性策略——浅层前填、深层解码(Shallow Prefill, dEEp Decode, SPEED),即仅在较低层保留提示token的KV状态,而在解码阶段保持所有token(包括提示和生成token)的完整深度可见性。这一设计通过移除上层对提示token的KV访问路径,在不显著牺牲基准性能的前提下,大幅降低长上下文推理时的首字延迟(TTFT)、每token时间(TPOT)及活跃KV缓存占用。

链接: https://arxiv.org/abs/2605.06105
作者: Jungsuk Oh,Hyeseo Jeon,Hyunjune Ji,Kyongmin Kong,Jay-Yoon Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emphShallow Prefill, dEEp Decode (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33%, TPOT by 22%, and reducing active KV memory by 25.0% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.

[AI-82] Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

【速读】:该论文旨在解决决策Transformer(Decision Transformer, DT)在离线强化学习中因引入Return-to-Go(RTG)作为独立token而导致的计算效率低下问题。RTG是一个标量,信息密度远低于状态(state)或动作(action)向量,但其与状态和动作共享相同的计算预算,且由于Transformer自注意力机制的复杂度随序列长度呈平方增长,RTG的加入显著增加了冗余计算开销。解决方案的关键在于将RTG信息嵌入到状态表示中(即在序列建模前将其注入状态特征),从而移除RTG作为独立token的参与,仅保留紧凑的状态-动作(state-action)序列进行建模。这一设计使序列长度减少三分之一,直接提升了推理效率,并在D4RL基准上实现了优于标准DT的性能,同时达到当前最优方法的水平,验证了将稀疏条件信号(RTG)从信息丰富序列中解耦的策略兼具计算优势与任务性能提升。

链接: https://arxiv.org/abs/2605.06104
作者: Yongyi Wang,Hanyu Liu,Lingfeng Li,Bozhou Chen,Ang Li,Qirui Zheng,Xionghui Yang,Chucai Wang,Wenxin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes future rewards, containing far less information than typical state or action vectors, yet it consumes the same computational budget per token. Worse, the self-attention cost of Transformers grows quadratically with sequence length, so including RTG as a separate token adds unnecessary overhead. We propose SlimDT, which removes RTG from the autoregressive sequence. Instead, we inject RTG information into the state representations before the sequential modeling step, allowing the Transformer to process only a compact (state, action) sequence. This reduces the sequence length by one-third, directly improving inference efficiency. On the D4RL benchmark, SlimDT surpasses standard DT across various tasks and achieves performance comparable to existing state-of-the-art methods. Decoupling a sparse conditioning signal from an information-rich sequence thus yields both computational gains and higher task performance.

[AI-83] Safety Certification is Classification

【速读】:该论文旨在解决动态系统在不确定性条件下安全性认证的问题,特别是针对现有基于动态规划(Dynamic Programming, DP)的方法在长时 horizon 下因递归计算导致的安全概率估计误差累积问题,该误差会使得认证结果退化为无意义的下界。解决方案的关键在于提出一种核嵌入(kernel embedding)框架,将安全性认证转化为轨迹数据上的分类问题,直接估计 T 步安全概率而无需递归,从而避免误差传播;该框架不仅涵盖了已有的方法(如屏障证书、鲁棒马尔可夫模型)作为特例,还突破了它们对马尔可夫性假设的限制,使认证适用于非马尔可夫动态系统,并在仿真中验证了其在长时 horizon 和非马尔可夫场景下的稳定性与有效性。

链接: https://arxiv.org/abs/2605.06087
作者: Oliver Schön,Licio Romao,Sadegh Soudjani
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 32 pages, 18 figures

点击查看摘要

Abstract:The goal of this paper is certifying safety of dynamical systems subject to uncertainty. Existing approaches use trajectory data to estimate transition probabilities, and compute safety probabilities recursively via dynamic programming (DP). This recursion may lead to compounding errors in the certified safety probability, thus collapsing to a vacuous lower bound for growing horizons T . We propose a kernel embedding framework that treats safety certification as a classification problem on trajectory data, directly estimating the T -step safety probability without recursion. We show that the framework subsumes well-established approaches from the literature (e.g., barrier certificates, robust Markov models) as special cases, and allows us to go beyond their limitations. As the main consequence, it bypasses compounding error across the horizon and enables certification for systems with non-Markovian dynamics. We demonstrate that direct estimators remain stable independent of the certification horizon and in the non-Markovian setting, whilst DP-based certificates silently go unsound – confirmed in simulation on a neural-controlled quadrotor.

[AI-84] VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

【速读】:该论文旨在解决传统大语言模型(Large Language Model, LLM)服务系统依赖单一通用栈、难以适配多样化使用场景的问题。现有系统虽经长期人工调优,但其设计目标是广谱兼容性,导致在非标准场景下无法充分挖掘特定模型架构、工作负载特征或硬件特性带来的优化潜力。解决方案的关键在于提出 VibeServe——首个端到端自动合成专用 LLM 服务系统的多智能体循环框架:外层智能体负责规划和追踪系统设计方案的搜索过程,内层智能体则实现候选方案、验证正确性并测量性能;通过生成时专业化(generation-time specialization)替代运行时通用性(runtime generality),VibeServe 在标准部署中与 vLLM 性能相当,并在六类非标准场景中显著优于现有系统,证明了生成式基础设施设计的有效性和普适性。

链接: https://arxiv.org/abs/2605.06068
作者: Keisuke Kamahori,Shihang Li,Simon Peter,Baris Kasikci
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at this https URL.

[AI-85] Normalized Architectures are Natively 4-Bit

【速读】:该论文旨在解决大语言模型在4-bit低精度训练中模型质量下降的问题,传统方法依赖于随机哈达玛变换(Hadamard transform)和张量级缩放等干预措施来维持性能,但这些手段复杂且不稳定。其解决方案的关键在于提出nGPT架构,通过将权重和隐藏表示约束在单位超球面(unit hypersphere)上,增强量化噪声的弱正相关性,使信号在隐藏维度上建设性累积而噪声趋于平均抵消,从而提升有效信噪比并平滑损失曲面,实现无需额外干预的稳定端到端NVFP4训练。

链接: https://arxiv.org/abs/2605.06067
作者: Maxim Fishman,Brian Chmiel,Ron Banner,Daniel Soudry,Boris Ginsburg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at this https URL

[AI-86] Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

【速读】:该论文旨在解决因果强化学习(Causal Reinforcement Learning, Causal RL)中缺乏适用于复杂系统 benchmark 的问题,这些系统需同时具备序列决策、隐藏信息、大规模掩码动作空间以及显式因果结构。其解决方案的关键在于构建 MTG-Causal-RL 基准环境——基于《万智牌》(Magic: The Gathering)的 Gymnasium 平台,包含 3,077 维部分观测、478 动作的掩码离散动作空间、五种竞争标准套牌(Standard archetype)、三种奖励机制及一个手工指定的结构因果模型(Structural Causal Model, SCM)。该基准首次将因果变量暴露、SCM 预测干预效应与因子级信用追踪作为核心指标,使因果信用分配、留一法跨套牌迁移和策略可审计性成为第一类度量。此外,作者提出 Causal Graph-Factored Advantage PPO(CGFA-PPO),通过利用 SCM 中胜率父节点作为因子对齐的批评者目标,并引入干预校准损失(intervention-calibration loss),实现更精确的因果驱动策略优化。此工作为因果 RL、世界模型和大语言模型代理研究提供了一个统一测试平台,支持当前单一基准无法协同回答的问题:掩码动作空间下的因果信用分配、跨套牌结构迁移以及基于 SCM 的策略可审计性。

链接: https://arxiv.org/abs/2605.06066
作者: Cristiano da Costa Cunha,Ajmal Mian,Tim French,Wei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, 9 tables, 1 algorithm

点击查看摘要

Abstract:Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM-predicted intervention effects, and per-factor credit traces, making causal credit assignment, leave-one-out cross-archetype transfer, and policy auditability first-class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal-world-model PPO variant, and an architecture-matched scalar control. We propose Causal Graph-Factored Advantage PPO (CGFA-PPO) as a reference causal agent that uses SCM parents of win probability as factor-aligned critic targets with an intervention-calibration loss. All comparisons use paired seeds, paired-bootstrap confidence intervals, and Holm-Bonferroni correction within pre-registered families. Masked PPO and CGFA-PPO reach competitive in-distribution win rates and exceed the random baseline; per-factor calibration trajectories and leave-one-out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference-baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG-Causal-RL gives causal-RL, world-model, and LLM-agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM-grounded policy auditability.

[AI-87] FM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

【速读】:该论文旨在解决预训练表格基础模型(Tabular Foundation Models, TFM)在特定数据集或任务上适应性不足的问题,尤其是现有微调方法要么计算成本高(如全参数微调),要么依赖于特定架构(如LoRA等参数高效微调方法),且其对模型准确率与校准性能的提升效果尚不明确。解决方案的关键在于提出一种轻量级、架构无关的输入空间残差适配器——TFM-Retouche,该方法通过在输入空间中学习一个小型残差修正项来对齐输入数据与预训练模型的归纳偏置,从而实现无需修改主干网络的端到端训练;同时引入后训练身份保护机制,在验证集上适应无效时自动回退至原始冻结模型,确保性能稳定提升。实验证明,基于TabICLv2的TabICLv2-Retouche在TabArena-Lite基准上显著优于基线模型,Elo评分提升+56,并处于预测质量与训练/推理时间的帕累托前沿。

链接: https://arxiv.org/abs/2605.06047
作者: Duong Nguyen,Mohammed Jawhar,Nicolas Chesneau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a specific dataset or task typically requires either full fine-tuning, which is computationally expensive, or parameter-efficient tuning methods (PEFT) such as LoRA, which must be tailored to the internal architecture of each TFM. Furthermore, the evidence on whether weight-space fine-tuning improves accuracy or calibration is mixed \citeptanna_exploring_2026,rubachev_finetuning_2025. We introduce TFM-Retouche, a lightweight input-space residual adapter that is architecture-agnostic by design with respect to the frozen TFM backbone. TFM-Retouche learns a small residual correction in the input space to align the input data with the inductive biases of the pretrained model. The adapter is trained end-to-end through the frozen TFM, with a post-training identity guard that falls back to the unmodified TFM whenever adaptation does not help on held-out validation. On TabArena-Lite (51 datasets spanning binary classification, multiclass classification, and regression), TabICLv2-Retouche – the framework instantiated on TabICLv2 – is the top-ranked method on the leaderboard with light per-task tuning and ensembling, lifting aggregate Elo by +56 over the frozen TabICLv2 base and sitting on the Pareto front of predictive quality versus both training and inference time.

[AI-88] Optimal Transport for LLM Reward Modeling from Noisy Preference

【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)时奖励模型(Reward Model)因真实数据集存在噪声偏好(noisy preference)而导致的过拟合问题。传统训练目标易受噪声干扰,而现有去噪方法常依赖同质噪声假设,难以刻画语言偏好复杂性。解决方案的关键在于提出SelectiveRM框架,其核心是基于最优传输理论:首先设计联合一致性差异(Joint Consistency Discrepancy)以对齐模型预测分布与偏好数据分布;其次引入质量松弛机制(Mass Relaxation),通过部分传输实现对语义不一致噪声样本的自动排除,从而优化一个更紧的未观测干净风险上界。

链接: https://arxiv.org/abs/2605.06036
作者: Licheng Pan,Haochen Yang,Haoxuan Li,Yunsheng Lu,Yongqi Tong,Yinuo Wang,Shijian Wang,Zhixuan Chu,Lei Shen,Yuan Lu,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.

[AI-89] Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features

【速读】:该论文旨在解决当前量子机器学习(Quantum Machine Learning, QML)在音频识别任务中对时间-频率结构利用不足的问题,即多数方法将频谱图(spectrograms)视为普通图像处理,未能显式建模其时频特性。解决方案的关键在于提出Q-Patch,一种专为音频设计的量子特征映射方法:它从梅尔频谱图(mel-spectrograms)中提取局部时频块(time-frequency patches),并通过浅层、硬件高效电路将每个块压缩为一个四维声学描述符,并映射至最多三层深度的四量子比特电路,实现邻接感知的纠缠(adjacency-aware entanglement)。此设计使得在近期量子设备限制下仍可构建实用的量子核函数(quantum kernel),并在音频伪造检测任务中显著提升判别性能(AUROC达0.87),优于经典基线(RBF-SVM为0.82),同时展现出清晰的类别结构特征。

链接: https://arxiv.org/abs/2605.06035
作者: Lisan Al Amin,Rakib Hossain,Mahbubul Islam,Faisal Quader,Thanh Thi Nguyen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.

[AI-90] When AI Meets Science: Research Diversity Interdisciplinarity Visibility and Retractions across Disciplines in a Global Surge

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)在科学领域中是否能够引发广泛的知识范式转变(paradigm shifts),以及其快速采纳趋势对科研质量与全球科学格局的影响。研究的关键在于通过分析1960至2015年间各国及不同科学领域中AI支持的研究工作数量变化,揭示其采纳的时空异质性、增长模式及其潜在风险。结果显示,自2015年后AI采纳呈现指数级增长,但其应用高度集中于计算机科学和传统统计方法相关的少数主题,表明其在认识论层面的变革潜力有限;同时发现AI支持的研究存在不合理的引用溢价和更高的撤稿率,且中等收入亚洲国家(如中国)在AI采纳中的角色显著增强。因此,论文强调需通过改进研究实践来提升AI应用的透明度、可重复性和伦理规范,以释放其真正潜力,并对特定领域与地区加强监管与评估。

链接: https://arxiv.org/abs/2605.06033
作者: Andrés F. Castro Torres,Joan Giner-Miguelez,Mercè Crosas
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The extent to which Artificial Intelligence (AI) can trigger generalized paradigm shifts in science is unclear. Although some of these technologies have revolutionized data collection and analysis in specific scientific fields such as Chemistry, their overall impact depends on the scope of adoption and the ways scholars use them. In this study, we document substantial differences in the timing and extent of AI adoption across countries and scientific domains from 1960 to 2015. After 2015, we find generalized exponential growth in AI adoption, with the number of AI-supported works multiplying by at least four across all domains. The transformative nature of this rapid growth is less apparent and points to multiple challenges should adoption trends persist. According to our analyses, AI-supported research is confined to very few topics with strong ties to Computer Science and conventional statistical frameworks, suggesting limited transformational potential in epistemological terms. AI-supported works are also associated with an unwarranted citation premium and exhibit substantially higher retraction rates than non-AI-supported works across most fields. Geographically, AI adoption displays pronounced heterogeneity at the country level, along with an acceleration in the relevance of middle-income countries in Asia, from China and beyond. Thus, the transformative capacity of AI in science remains largely untapped, and its rapid adoption underlines challenges in research openness, transparency, reproducibility, and ethics from a global perspective. We discuss how best research practices could boost the benefits of AI adoption and highlight fields and geographies where these trends warrant closer scrutiny.

[AI-91] Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

【速读】:该论文旨在解决合成时间序列数据在时间序列预测任务中应用效果不明确的问题,特别是其对不同模型架构的影响机制尚缺乏系统性理解。解决方案的关键在于通过大规模实证研究(涵盖9个实验组、4,218次运行)系统评估了合成数据增强在五种主流时间序列模型架构上的表现,发现合成数据的效果具有显著的架构依赖性:通道混合型模型(如TimesNet、iTransformer)在多数情况下受益,而通道独立型模型(如DLinear、PatchTST)则普遍性能下降;同时指出仅季节性-趋势生成器(Seasonal-Trend generator)能稳定提升性能,且硬性课程切换策略会显著恶化结果(MSE上升24%)。由此得出可操作建议:应优先用于通道混合架构、采用渐进式调度策略,并将低资源场景下的增强视为架构与数据集相关的策略。

链接: https://arxiv.org/abs/2605.06032
作者: Hugo Cazaux,Eyjólfur Ingi Ásgeirsson,Hlynur Stefánsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large-scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture-conditional: channel-mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel-independent models (DLinear, PatchTST) are consistently degraded. In selected low-resource settings the gains are striking: TimesNet trained on only 10% of Weather data with synthetic augmentation surpasses the full-data baseline (4 of 16 sparsity-dataset combinations). Averaged across all architectures, augmentation hurts in 67% of trials. We further find that only the Seasonal-Trend generator reliably helps across the tested benchmarks, and that hard curriculum switching is actively harmful (+24% MSE degradation). These results provide concrete, actionable guidelines on how to use synthetic data: use synthetic augmentation with channel-mixing architectures, use gradual annealing schedules, and treat low-resource augmentation as architecture- and dataset-dependent. Code is available at \hrefthis https URL

[AI-92] Pathways to AGI

【速读】:该论文旨在解决当前生成式人工智能(Generative AI)发展路径的演化逻辑及其与社会、政治和经济结构之间关联性的问题,尤其关注人工通用智能(Artificial General Intelligence, AGI)概念的合理性与可行性。其解决方案的关键在于从批判性软件研究(critical software studies)视角出发,识别出塑造当前主流生成式AI工具的关键路径、杠杆节点(leverage nodes)及不同基础模型轨迹(如前沿专有模型、开源权重模型和特定领域/主权模型)之间的差异,并分析替代项目在关键转折点上的分化结果。通过这一系统性分析,论文提出了一套可行的社会技术发展计划,以推动接近AGI的能力演进,同时满足透明度、内容治理、福祉保障和可持续商业模式等多维要求。

链接: https://arxiv.org/abs/2605.06029
作者: Gordon Fletcher,Saomai Vu Khan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Additional data at https://doi.org/10.17866/rd.salford.32201874

点击查看摘要

Abstract:Our focus are five related questions that stem from a critical software studies perspective. Underpinning this view is the acknowledged need to avoid assumptions regarding the inevitability of the current situation relating to AI. What we need to see is the closeness of the linkage between current commercial AI development and our prevailing social, political and economic circumstances. This does mean that the perspectives presented here are done so critically and conditionally. Most importantly, Artificial General Intelligence (AGI) is seen as being problematic both conceptually and definitionally. This conditioning of any view regarding AGI does lead the discussion in specific directions and to certain conclusions regarding the future. However, adopting this perspective enables the work to offer some final recommendations. We set out to ask the following questions, 1. What are the critical pathways that produced the current dominant generative AI tools (capabilities, product forms, adoption patterns)? 2. Which decision points acted as leverage nodes (small changes that had large downstream effects), and which dead ends reveal alternative possibilities that did not become dominant? 3. How do pathways differ across three foundational-model trajectories such as the frontier proprietary models, open-weight models or specific domain and sovereign models? 4. Which alternative projects branched from key leverage nodes, what is their current state, and why did some succeed, stall, fail or become absorbed? 5. Based on this analysis, what socio-technical development programmes could plausibly move toward AGI-adjacent capability while meeting requirements for transparency, moderation, wellbeing and sustainable business models?

[AI-93] Strat-LLM : Stratified Strategy Alignment for LLM -based Stock Trading with Real-time Multi-Source Signals IJCNN

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自动化交易场景中因架构推理能力与策略一致性脱节而导致的性能不稳定问题。现有基准测试常忽视两者之间的动态交互,导致模型在真实市场环境下难以保持长期稳健表现。其解决方案的关键在于提出Strat-LLM框架,该框架基于分层策略对齐(Stratified Strategy Alignment)原则,在2025年实时前向环境中运行,融合序列价格、实时新闻和年度报告等异构数据以消除前瞻偏差(look-ahead bias)。通过引入自由模式(Free Mode)、严格模式(Strict Mode)与引导模式(Guided Mode)的多策略配置,实现了不同市场状态下模型行为的自适应调整:例如,在上升趋势中利用自由模式捕捉动量,在下跌趋势中启用严格模式控制回撤;同时发现中等规模模型(35B)在严格约束下表现最优,而超大规模模型(122B)则需在引导模式下释放潜力,从而系统性地提升了模型在复杂市场环境中的战略一致性与收益风险比。

链接: https://arxiv.org/abs/2605.06024
作者: Wenliang Huang,Zengyi Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the 2026 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Large Language Models (LLMs) are evolving into autonomous trading agents, yet existing benchmarks often overlook the interplay between architectural reasoning and strategy consistency. We propose Strat-LLM, a framework grounded in Stratified Strategy Alignment. Operating in a live-forward setting throughout 2025, it integrates heterogeneous data including sequential prices, real-time news, and annual reports to eliminate look-ahead bias. Extensive stress tests on A-share and U.S. markets reveal: (1) reasoning-heavy models achieve peak utility in Free Mode via internal logic, whereas standard models require Strict Mode as a vital risk anchor; (2) alignment utility is regime-dependent, with Free and Guided modes capturing momentum in uptrending markets, while Strict Mode mitigates drawdowns in downtrends; (3) mid-scale models (35B) show optimal fidelity under strict constraints, whereas ultra-large models (122B) suffer an alignment tax under rigid rules but gain a performance premium in Guided Mode; (4) standard LLMs often fall into a high win-rate trap, optimizing for small gains at the expense of total returns, which can only be mitigated through deep reasoning or strict external guardrails. Project details are available at this https URL.

[AI-94] Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

【速读】:该论文旨在解决在现代量化(quantization)方法中,使用随机 Hadamard 变换(Randomized Hadamard Transform, RHT)替代均匀随机旋转(Uniform Random Rotation, URR)时,对最坏情况输入下性能下降的问题。URR 能保证每个坐标独立服从近似高斯分布(在高维下收敛于标准正态分布),而单次 RHT 在最坏情况下无法保证此性质。论文的关键解决方案是:通过组合两次 RHT,可使任意 d 维输入向量的每个固定坐标的边缘分布与标准高斯分布之间的 Kolmogorov 距离和 1-Wasserstein 距离均控制在 $ O(d^{-1/2}) $ 内,从而在 DRIVE 和 QUIC-FL 等压缩方案中实现与 URR 相当的渐近性能;进一步地,针对向量量化(Vector Quantization, VQ)需要弱相关性的块结构特性,提出组合三次 RHT 可诱导坐标协方差衰减,确保基于 URR 设计的有限维多维码本在使用三重 RHT 时误差期望保持一致(仅差一个随维度趋于零的加性项)。此外,为应对实际输入通常非对抗性的情况,论文还引入一个线性时间 $ O(d) $ 的输入矩检测机制,动态调整 RHT 次数以提升运行时性能。

链接: https://arxiv.org/abs/2605.06014
作者: Ran Ben-Basat,William Kuszmaul,Michael Mitzenmacher,Amit Portnoy,Shay Vargaftik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any d -sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within O(d^-1/2) of a standard Gaussian both in Kolmogorov distance and in 1 -Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time O(d) check on the input’s moments to dynamically adapt the number of RHTs used at runtime to improve performance. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.06014 [cs.LG] (or arXiv:2605.06014v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-95] A Fine-Grained Understanding of Uniform Convergence for Halfspaces

【速读】:该论文旨在解决半空间(halfspace)在统计学习理论中精细的统一收敛行为问题,突破传统最坏情况下的VC维边界限制。其核心问题是:在不同结构(同质与非同质)和维度下,半空间的学习误差如何随样本数量 $ n $ 和维度 $ d $ 变化?解决方案的关键在于区分两类半空间——对于 $ \mathbb{R}^d d \geq 2 $)中的非同质半空间,证明标准一阶VC界本质上是紧的,即一致假设仍可能产生 $ \Theta(d\ln(n/d)/n) $ 的总体误差;而对于 $ \mathbb{R}^2 $ 中的同质半空间,则揭示了显著不同的收敛性质:实可实现情形下所有一致假设的误差为 $ O(1/n) $,而在广义情形下通过“带状”(bandwise)的对数无界偏差界结合临界楔形定位(critical-wedge localization)论证,实现了仅含 $ \ln\ln n $ 额外开销的统一收敛分析,并进一步证明此开销不可避免。这一结果刻画了半空间统一收敛的精细结构,揭示了维度与模型结构之间的尖锐阈值效应。

链接: https://arxiv.org/abs/2605.06004
作者: Aryeh Kontorovich,Kasper Green Larsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in \mathbbR^d with d\ge 2 , we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error \Theta(d\ln(n/d)/n) , and in the agnostic setting the deviation scales as \sqrt\tau\ln(1/\tau) at true error \tau . In contrast, homogeneous halfspaces in \mathbbR^2 exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error O(1/n) . In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a \ln\ln n overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.

[AI-96] ACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

【速读】:该论文旨在解决语言模型代理(language model agents)在执行复杂软件工程任务时因长期轨迹导致的性能退化问题,即“代理漂移”(agent drift)。其核心挑战在于两种常见失败模式:过度思考(overthinking)——重复推理已有信息;以及过度行动(overacting)——未整合最新观察或获取新证据即调用工具。解决方案的关键是提出TACT(Think-Act Calibration via activation Steering),通过在残差流(residual stream)中检测并校准漂移方向来干预行为偏差。具体而言,TACT识别轨迹步骤为过度思考、过度行动或校准状态,并发现这些状态在隐藏空间中可沿两条“漂移轴”线性分离(AUC ≈ 0.9),从而在测试阶段将激活投影到这些轴上,将偏离校准区域的步骤拉回,实现对代理漂移的有效抑制。实验表明,TACT显著提升任务解决率(如Qwen3.5-27B提升+5.8个百分点)并减少解决问题所需步数(最多降低26%),验证了代理漂移可被作为残差流中的可操控方向进行干预。

链接: https://arxiv.org/abs/2605.05980
作者: Yuan Sui,Yulin Chen,Yibo Li,Xue Jiang,Yufei He,Yihong Dong,Xiaoxin He,Tianyu Gao,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as agent drift. We focus on two recurring failure modes overthinking and overacting, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate agent drift in the residual stream before it surfaces as a behavioral failure. In specific, we label trajectory steps as overthinking, overacting, or calibrated, and find that their hidden states can separate linearly along two drift axes, pointing from calibrated behavior toward each failure mode (AUC \approx 0.9). To mitigate agent drift, we project each step’s activation onto these axes at test time and pull drifted ones back toward the calibrated region. Experiments show that TACT outperforms unsteered baselines across SWE-bench Verified, Terminal-Bench 2.0, and CLAW-Eval, lifting average resolve rate by +5.8 pp on Qwen3.5-27B and +4.8 pp on Gemma-4-26B-A4B-it while cutting steps-to-resolve by up to 26% . These gains frame agent drift as a steerable direction in the residual stream, and position TACT as a viable handle for reliable long-horizon agents.

[AI-97] BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中后门攻击(Backdoor Attacks)带来的安全威胁问题。现有防御方法依赖奖励异常来逆向解析触发器,并通过模型微调移除后门,但复杂触发模式会削弱其鲁棒性,且微调过程成本高昂,难以实际应用。解决方案的关键在于将防御视角从依赖触发器的检测转向对后门行为特征的无感知识别:研究发现,无论是否存在触发器,植入后门的策略都会导致动作分布的一致性偏移,从而在高分位区域和分布尾部留下可检测痕迹。基于此,作者提出BehaviorGuard框架,设计了一种新指标以捕捉动作分布中的行为漂移,实现在线运行时对后门动作的识别与抑制,是首个适用于单智能体和多智能体DRL场景的在线后门防御机制。

链接: https://arxiv.org/abs/2605.05977
作者: Yinbo Yu,Xueyu Yin,Jiadai Wang,Chunwei Tian,Sai Xu,Qi Zhu,Daoqiang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.

[AI-98] Prag Locker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts ICML2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理在非可信部署环境中,其任务特定提示(prompt)易被 adversaries 复制并用于其他专有模型从而造成知识产权损失的问题。现有保护方法无法同时满足主动性(proactivity)、运行时保护(runtime protection)、可用性(usability)和非可移植性(non-portability)四大挑战。解决方案的关键在于提出 PragLocker,一种通过锚定语义与代码符号构建功能保持的混淆提示,并利用目标模型反馈注入噪声,使生成的提示仅在特定目标 LLM 上有效,从而显著降低跨模型可移植性,同时保持目标任务性能并抵抗自适应攻击者。

链接: https://arxiv.org/abs/2605.05974
作者: Qinfeng Li,Yuntai Bao,Jianghui Hu,Wenqi Zhang,Jintao Chen,Huifeng Zhu,Yier Jin,Xuhong Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function-preserving obfuscated prompts by anchoring semantics with code symbols and then using target-model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross-LLM portability, maintains target performance, and remains robust against adaptive attackers.

[AI-99] Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

【速读】:该论文旨在解决当前基于无评论器(critic-free)的强化学习方法(如Group Relative Policy Optimization, GRPO)在训练大语言模型时存在的信用分配不均问题,即其假设所有推理步骤具有相同重要性(“uniform credit assignment”),导致无法区分关键推理环节,从而降低学习效率。解决方案的关键在于提出Selective Eligibility Traces (S-trace),其核心是基于部分信任区域保持的直觉,首先引入P-trace作为样本高效的无评论器优势追踪机制,进而构建S-trace,通过稀疏化优势传播机制——仅对高熵token保留eligibility traces,实现细粒度信用分配,显著降低方差并提升训练效率。实验表明,S-trace在多个Qwen3模型规模下均优于GRPO,并在样本和token效率上均有提升。

链接: https://arxiv.org/abs/2605.05965
作者: Chaoli Mou,Zhan Zhuang,Xinning Chen,Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment’’ assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49% on Qwen3-1.7B and 3.16% on Qwen3-4B, and maintaining a robust 2.98% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.

[AI-100] From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning

【速读】:该论文旨在解决异构联邦学习(Heterogeneous Federated Learning, HtFL)中因客户端模型架构和数据分布差异导致的特征空间不一致问题。现有基于原型(prototype-based)的方法通常采用均方误差(MSE)或余弦相似度进行坐标对齐(coordinate alignment),即强制所有客户端将本地表示映射到全局原型定义的单一特征子空间中,这种做法在同质联邦学习(Homogeneous FL)中有效,但在HtFL中会因忽略客户端特有的特征子空间而抑制其学习能力。论文的关键创新在于提出FedSAF方法,通过将对齐目标从绝对坐标转向类间关系结构(inter-class relational structure),实现结构对齐(structural alignment),从而解耦“对齐类别语义结构”与“强制共享特征基底”这两个原本耦合的目标,显著提升异构场景下的性能表现。

链接: https://arxiv.org/abs/2605.05959
作者: Xinghao Wu,Jianwei Niu,Guogang Zhu,Xuefeng Liu,Shaojie Tang,Jiayuan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 14 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with global prototypes. These approaches are essentially coordinate alignment, where representations of clients are forced to match the global prototypes in the embedding space in an element-wise manner. Such alignment implicitly assumes that all clients should map their representations into the feature subspace defined by the global prototypes. This assumption is reasonable in homogeneous FL, where all clients share the same feature extractor. However, it becomes problematic in HtFL, since heterogeneous feature extractors naturally induce client-specific feature subspaces, and forcing all clients to optimize within a single global subspace unnecessarily suppresses their learning capacity. We observe that coordinate alignment implicitly couples two distinct objectives: aligning inter-class semantic structure, which is directly beneficial for classification, and enforcing a shared feature basis, which is unnecessary and even harmful under model heterogeneity. Building on this insight, we design FedSAF, which shifts the alignment objective from absolute coordinates to inter-class relational structure. We demonstrate that structural alignment consistently outperforms coordinate alignment in heterogeneous settings. Experiments on multiple benchmarks show that our structural alignment outperforms state-of-the-art prototype-based HtFL methods by up to 3.52%.

[AI-101] mporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing

【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中因练习推荐和学生选择的非随机性导致的严重选择偏差问题,此类偏差会使得基于观测日志的标准经验风险最小化方法产生有偏的学习掌握度估计,并在后续推荐中累积误差。解决方案的关键在于提出一种双重稳健(Doubly Robust, DR)框架,该框架通过联合建模倾向得分模型(propensity model)与误差插补模型(error imputation model),理论上保证只要其中一个模型正确,即可获得无偏估计。进一步地,作者识别出在序列化KT场景下,估计器方差引发的随机偏差随时间累积会导致训练不稳定和性能受限,据此推导出一个显式刻画估计器方差影响的一般化泛化界,并指出时间平滑性(temporal smoothness)是控制方差的关键因素;最终提出Temporal Smoothness Doubly Robust (TSDR) 框架,通过引入平滑正则项联合优化KT预测器与插补模型,在保持DR无偏性的前提下显著降低估计方差,从而提升模型稳定性和性能。

链接: https://arxiv.org/abs/2605.05958
作者: Peilin Zhan,Wei Chen,Weilin Chen,Shuyi Pan,Ruichu Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing KT methods neglect this issue, training on observed logs using standard empirical risk, which yields biased mastery estimates and accumulates errors in subsequent recommendations. To address this, we introduce a doubly robust (DR) formulation for KT that integrates a propensity model with an error imputation model, theoretically guaranteeing unbiasedness if either model is accurate. Beyond unbiasedness, in the sequential setting of KT, we identify that the estimator’s performance is compromised by variance-dependent stochastic deviations that accumulate over time, thereby causing training instability and limiting performance. To mitigate this, we derive a generalization bound that explicitly characterizes the impact of estimator variance and identifies temporal smoothness as a key factor in controlling it. Building on these theoretical insights, we propose the Temporal Smoothness Doubly Robust (TSDR) framework. TSDR jointly optimizes the KT predictor and the imputation model with a smoothness regularizer, effectively reducing variance while preserving the unbiasedness guarantee of DR. Experiments on multiple real-world benchmarks demonstrate that TSDR consistently enhances various state-of-the-art KT backbones, underscoring the vital role of principled bias correction in KT.

[AI-102] HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning

【速读】:该论文旨在解决世界模型(World Model)在长时程规划中因潜在状态表示缺乏结构而导致的想象轨迹不稳定问题,尤其在动态分布偏移(dynamics shift)场景下表现不佳。其核心解决方案是提出HaM-World(HaM-W),一种具有显式结构化的世界模型:将潜空间分解为规范子空间(q, p)与上下文子空间c,其中(q, p)通过能量驱动的哈密顿向量场演化并辅以可学习残差控制项,实现软哈密顿动力学;而c则捕获语义、耗散及非保守因素。同时引入Mamba选择性状态空间记忆机制作为历史条件输入,增强马尔可夫完备性。该设计使同一潜状态支持动力学预测、奖励/价值估计、想象滚动和CEM动作搜索,显著提升长期规划稳定性与分布外鲁棒性。

链接: https://arxiv.org/abs/2605.05951
作者: Haoyun Tang,Haodong Cui,Keyao Xu,Kun Wang,Zhandong Mei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing latents: history-conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM-World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state-space memory as the history-conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy-derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non-conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM-World reaches the highest Avg. AUC (117.9, +9.5%), reduces long-horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in 3,5,7 MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM-World achieves the highest return in every condition, with average OOD-return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action-free Hamiltonian-energy drift, structured energy variation under policy rollouts, and coherent control-induced energy transfer, supporting the intended Soft-Hamiltonian dynamics design.

[AI-103] MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System

【速读】:该论文旨在解决当前AI编码系统在算法问题求解中缺乏结构化推理能力的问题,现有方法多依赖模型中心的策略(如架构改进和数据扩展),成本高且可解释性差;而基于外部工具或提示技术(如思维链)的方法则碎片化、缺乏统一框架。其解决方案的关键在于提出MAS-Algorithm——一个受竞赛程序员和算法工程师实践启发的多智能体系统(Multi-Agent System, MAS),通过将端到端求解过程分解为模块化阶段,实现结构化推理、工具集成与智能体间灵活协作,从而提升算法问题求解的准确率与效率。实验表明,该框架在自建基准上平均提升接受率6.48%,显著优于参数高效微调(仅0.89%提升),并展现出良好的泛化性和可扩展性。

链接: https://arxiv.org/abs/2605.05949
作者: Yuliang Xu,Xiang Xu,Yao Wan,Hu Wei,Tong Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model’s ability to perform structured reasoning in complex this http URL approaches predominantly rely on model-centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain-of-thought) are often fragmented and lack a unified framework. In this paper, we propose MAS-Algorithm, a systematic multi-agent workflow for algorithmic problem solving inspired by the practices of competitive programmers and algorithm engineers. Our framework decomposes the end-to-end solving process into modular stages, enabling structured reasoning, tool integration, and flexible coordination among agents. The design emphasizes both rigor and extensibility, allowing it to generalize across diverse problem this http URL results on a self-constructed benchmark demonstrate consistent improvements across multiple Qwen series models, achieving an average gain of 6.48% in acceptance rate. In contrast, parameter-efficient fine-tuning on the same data yields only a marginal improvement of 0.89%. We further observe a 4.72% gain on LiveCodeBench-Pro, along with consistent improvements across additional accuracy and efficiency this http URL performance gains, we conduct comprehensive analyses to better understand the reasoning process within the workflow, including error patterns and cross-scenario behaviors. We further perform customized replacement and ablation studies to explore the upper bound of the framework, showing that individual agents can contribute improvements of up to 27.7%. These results highlight the strong potential of MAS-Algorithm for advancing AI-driven algorithmic reasoning.

[AI-104] ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中因使用大规模敏感数据而引发的隐私泄露问题,尤其是针对持续性隐私删除请求(continual privacy deletion requests)缺乏有效评估手段的问题。现有基准测试多局限于静态或短序列场景,难以反映真实部署中频繁且连续的遗忘需求。为此,作者提出了ICU-Bench——一个基于隐私敏感文档数据的持续多模态遗忘基准,涵盖1,000个隐私敏感个体的医疗报告与劳动合同两类数据,包含9,500张图像、16,000个问答对及100个遗忘任务,并引入新的持续遗忘指标体系,用于系统评估遗忘效果、历史遗忘保留性、保留效用和稳定性。解决方案的关键在于构建具有现实代表性的持续遗忘场景与量化评估框架,从而揭示当前遗忘方法在长序列任务下难以平衡遗忘质量、效用保持与可扩展性的局限,推动面向持续隐私删除的新型多模态遗忘方法的发展。

链接: https://arxiv.org/abs/2605.05938
作者: Yuhang Wang,Wenjie Mei,Junkai Zhang,Guangyu He,Zhenxing Niu,Haichang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) have achieved remarkable progress across many domains, their training on large-scale multimodal datasets raises serious privacy concerns, making effective machine unlearning increasingly necessary. However, existing benchmarks mainly focus on static or short-sequence settings, offering limited support for evaluating continual privacy deletion requests in realistic deployments. To bridge this gap, we introduce ICU-Bench, a continual multimodal unlearning benchmark built on privacy-critical document data. ICU-Bench contains 1,000 privacy-sensitive profiles from two document domains, medical reports and labor contracts, with 9,500 images, 16,000 question-answer pairs, and 100 forget tasks. Additionally, new continual unlearning metrics are introduced, facilitating a comprehensive analysis of forgetting effectiveness, historical forgetting preservation, retained utility, and stability throughout the continual unlearning process. Through extensive experiments with representative unlearning methods on ICU-Bench, we show that existing methods generally struggle in continual settings and exhibit clear limitations in balancing forgetting quality, utility preservation, and scalability over long task sequences. These findings highlight the need for multimodal unlearning methods explicitly designed for continual privacy deletion.

[AI-105] In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs

【速读】:该论文旨在解决开放数据(Open Access Data, OAD)在高资源语言与低资源语言之间日益扩大的不平等现象,特别是在链接开放数据知识图谱(Linked Open Data knowledge graphs, LOD KGs)中语言覆盖不足的问题。其解决方案的关键在于:首先通过分析DBpedia、BabelNet和Wikidata三大多语言LOD KG中的关键变量(如各语言维基百科文章数量及语言标记实体数),揭示语言分布特征;进而研究跨语言迁移候选选择对多语言知识图谱补全(multilingual KG completion)任务的影响,重点探索基于语言相似性及已标注对齐数据的策略,并引入类比推理机制以利用语言间的(不)相似性来识别跨语言对应关系,从而提升KG补全性能并扩大LOD中的语言覆盖范围。

链接: https://arxiv.org/abs/2605.05931
作者: Ndeye-Emilie Mbengue(WIMMICS)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from participating in the global digital transformation. In this PhD proposal, we aim to address this gap, focusing on the language coverage of Linked Open Data knowledge graphs (LOD KGs). First, we identify key variables that characterize language distribution in LOD, including the number of Wikipedia articles per language edition and the number of language-tagged entities in LOD KGs. These variables are analyzed across three major multilingual LOD KGs, DBpedia, BabelNet, and Wikidata, providing insights into the representation and distribution of languages within LOD. Building on this analysis, we intend to study the impact of cross-lingual transfer candidate selection on the task of multilingual KG completion. In particular, we plan to investigate strategies based on linguistic proximity and the availability of curated annotated alignments between languages. Language proximity also motivates us to explore the benefits of analogical reasoning that relies on (dis)similarities and has not yet been investigated to identify correspondences across languages to improve KG completion performance and enhance language coverage in LOD.

[AI-106] Which Are the Low-Resource Languages of the Semantic Web? ESWC2026

【速读】:该论文旨在解决多语言知识图谱(Multilingual Linked Open Data Knowledge Graphs, LOD KGs)中高资源语言与低资源语言之间开放数据(Open Access Data, OAD)获取不平等的问题,这种不平等加剧了全球数字转型中的语言鸿沟。其解决方案的关键在于提出一种基于DBpedia、BabelNet和Wikidata的多层级语言资源分布分析方法,并据此定义低资源、中资源和高资源语言的量化标准,从而为跨语言迁移(cross-lingual transfer)提供可操作的候选语言选择依据。

链接: https://arxiv.org/abs/2605.05929
作者: Ndeye-Emilie Mbengue(WIMMICS),Pierre Monnin(WIMMICS),Miguel Couceiro(INESC-ID),Fabien Gandon(WIMMICS)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ESWC 2026 - 23rd European Semantic Web Conference, May 2026, Dubrovnik, Croatia

点击查看摘要

Abstract:Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.

[AI-107] LLM -Driven Design Space Exploration of FPGA-based Accelerators EUROSYS’26

【速读】:该论文旨在解决FPGA-based加速器设计中因硬件设计空间庞大且复杂(涵盖架构参数、数据流策略和内存层次结构)而导致的配置优化过程耗时且依赖人工经验的问题。解决方案的关键在于提出SECDA-DSE框架,该框架将大型语言模型(Large Language Models, LLMs)集成到SECDA生态系统中,通过结构化的DSE Explorer生成加速器配置,并结合基于检索增强生成(retrieval-augmented generation)与思维链提示(chain-of-thought prompting)的LLM Stack实现推理引导的探索,同时引入反馈回路支持强化微调,从而实现自动化设计空间探索(Design Space Exploration, DSE),显著降低对领域专家的依赖并提升设计效率。

链接: https://arxiv.org/abs/2605.05920
作者: Vinamra Sharma,Xingjian Fu,Jude Haris,José Cano
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: Accepted to the Workshop on Intelligent System Design (InSyDe) co-located with EuroSys '26

点击查看摘要

Abstract:Designing field-programmable gate array (FPGA)-based accelerators for modern artificial intelligence workloads requires navigating a large and complex hardware design space encompassing architectural parameters, dataflow strategies, and memory hierarchies, making the process time-consuming and resource-intensive. While the SECDA methodology enables rapid hardware-software co-design of accelerators through SystemC simulation and FPGA execution, identifying optimal accelerator configurations still requires substantial manual effort and domain expertise. This work presents SECDA-DSE, a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem, comprising tools built around SECDA to automate the design space exploration (DSE) of FPGA-based accelerators. SECDA-DSE combines a structured DSE Explorer for generating accelerator configurations with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, alongside a feedback loop that enables reinforced fine-tuning for continuous improvement. We demonstrate the feasibility of SECDA-DSE through an initial high-level synthesis based evaluation of a generated accelerator design that meets synthesis timing and resource constraints on an Zynq-7000 FPGA.

[AI-108] Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model

【速读】:该论文旨在解决现有DNA语言模型在建模基因组序列时忽视局部调控基序(local motifs)与全局依赖关系(global dependencies)之间交互的问题。其核心解决方案是提出Wisteria模型,该模型通过统一框架整合多尺度特征学习:一方面利用门控扩张卷积(gated dilated convolutions)捕获局部调控模式和基序;另一方面借助门控多层感知机(gated multilayer perceptrons)优化长程依赖关系;同时引入基于傅里叶的注意力机制(Fourier-based attention mechanism),支持频域建模、周期性扩展及长度泛化能力,从而实现对基因组序列中局部与全局依赖关系的有效协同建模。

链接: https://arxiv.org/abs/2605.05913
作者: Weihua Wang,Haoji Li,Feilong Bao,Lei Yang,Guanglai Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures. Under review

点击查看摘要

Abstract:DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global dependencies. In this paper, we propose Wisteria, a genomic language model that integrates multi scale feature learning within a unified framework for DNA sequence. Specifically, Wisteria augments the Mamba based architecture with gated dilated convolutions to capture local motifs and regulatory patterns, while gated multilayer perceptrons refine global dependencies. We further introduce a Fourier based attention mechanism to support frequency domain modeling, periodic extension and length generalization. Across four experimental settings with both short and long range dependencies, Wisteria demonstrates strong performance on downstream benchmarks against competitive DNA language model baselines. These results indicate that Wisteria effectively unifies local and global dependency modeling for multi scale genomic sequence analysis.

[AI-109] PREFER: Personalized Review Summarization with Online Preference Learning

【速读】:该论文旨在解决电子商务平台中产品评论信息过载问题,即现有摘要系统生成的通用静态摘要无法满足用户个性化需求,且忽视了用户偏好可能随交互动态变化的特点。解决方案的关键在于提出一种在线学习框架,通过直接从用户对生成摘要的反馈中迭代优化对用户隐式偏好的理解,从而实现针对每位用户的个性化摘要生成,实验证明该方法在保持摘要质量的同时显著提升了与目标用户兴趣的一致性。

链接: https://arxiv.org/abs/2605.05911
作者: Millend Roy,Agostino Capponi,Vineet Goyal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Product reviews significantly influence purchasing decisions on e-commerce platforms. However, the sheer volume of reviews can overwhelm users, obscuring the information most relevant to their specific needs. Current e-commerce summarization systems typically produce generic, static summaries that fail to account for the fact that (i) different users care about different product characteristics, and (ii) these preferences may evolve with interactions. To address the challenge of unknown latent preferences, we propose an online learning framework that generates personalized summaries for each user. Our system iteratively refines its understanding of user preferences by incorporating feedback directly from the generated summaries over time. We provide a case study using the Amazon Reviews’23 dataset, showing in controlled simulations that online preference learning improves alignment with target user interests while maintaining summary quality.

[AI-110] Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中的知识遗忘问题,即在删除特定视觉知识的同时,保持非目标视觉知识和全部文本知识的完整性。其核心挑战在于视觉与文本模态高度耦合,传统方法易导致知识误删或保留不充分。解决方案的关键在于:首先提出对比视觉遗忘(Contrastive Visual Forgetting, CVF)机制,通过特征空间中的分离策略引导目标视觉概念向特定区域迁移;其次识别保留知识对应的零空间(null space),并将遗忘过程约束在此空间内,从而显著降低对非目标知识的干扰;最后将方法扩展至持续遗忘场景,支持顺序性遗忘请求,实现高效且稳定的动态知识管理。

链接: https://arxiv.org/abs/2605.05909
作者: Yuhang Wang,Zhenxing Niu,Haoxuan Ji,Guangyu He,Linlin Zhang,Haichang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:The core challenge of machine unlearning is to strike a balance between target knowledge removal and non-target knowledge retention. In the context of Multimodal Large Language Models (MLLMs), this challenge becomes even more pronounced, as knowledge is further divided into visual and textual modalities that are tightly intertwined. In this paper, we introduce an MLLM unlearning approach that aims to forget target visual knowledge while preserving non-target visual knowledge and all textual knowledge. Specifically, we freeze the LLM backbone and achieve unlearning by fine-tuning the visual module. First, we propose a Contrastive Visual Forgetting (CVF) mechanism to separate target visual knowledge from retained visual knowledge, guiding the representations of target visual concepts toward appropriate regions in the feature space. Second, we identify the null space associated with retained knowledge and constrain the unlearning process within this space, thereby significantly mitigating degradation in knowledge retention. Third, beyond static unlearning scenarios, we extend our approach to continual unlearning, where forgetting requests arrive sequentially. Extensive experiments across diverse benchmarks demonstrate that our approach achieves a strong balance between effective forgetting and robust knowledge retention.

[AI-111] VARS-FL: Validation-Aligned Client Selection for Non-IID Federated Learning in IoT Systems

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统中因客户端选择策略缺乏历史信息积累而导致的收敛缓慢与训练不稳定问题,尤其是在非独立同分布(non-IID)数据场景下,传统基于本地代理指标(如训练损失)的无状态选择机制难以对齐全局优化目标。解决方案的关键在于提出一种名为VARS-FL(Validation-Aligned Reputation Scoring for Federated Learning)的客户端选择框架,其核心是通过量化每个客户端更新对服务器端验证损失的降低程度来构建声誉评分(Reputation score),并结合滑动窗口平均近期贡献与对数缩放的参与度项,实现稳健的探索-利用平衡,从而提升训练效率和稳定性。该方法无需修改本地训练或聚合过程,兼容标准FedAvg,在Edge-IIoTset数据集上的实验表明其在准确率、F1-Macro和收敛速度方面均显著优于FedAvg、Oort和Power-of-Choice。

链接: https://arxiv.org/abs/2605.05896
作者: Mohamed Lakas,Mohamed Amine Ferrag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) systems typically employ stateless client selection, treating each communication round independently and ignoring accumulated evidence of client contribution quality. Under non-IID data, this leads to slow convergence and unstable training, particularly when selection relies on local proxies (e.g., training loss) that are misaligned with the global optimization objective. These challenges are especially pronounced in Internet of Things (IoT) and Industrial IoT (IIoT) environments, where data is highly heterogeneous and distributed across devices observing different traffic patterns. In this paper, we propose VARS-FL (Validation-Aligned Reputation Scoring for Federated Learning), a client selection framework that quantifies each client’s contribution using the reduction in server-side validation loss induced by its update. These per-round signals are aggregated into a Reputation score that combines a sliding-window average of recent contributions with a logarithmically scaled participation term, enabling robust exploration-exploitation selection. VARS-FL requires no changes to local training or aggregation and remains fully compatible with standard FedAvg. We evaluate VARS-FL on a 15-class non-IID IoT intrusion detection task using the Edge-IIoTset dataset, with 100 clients across multiple seeds, and compare it against FedAvg, Oort, and Power-of-Choice. VARS-FL consistently improves accuracy, F1-Macro, and loss, while accelerating convergence (up to 36% fewer rounds to reach 80% accuracy). These results demonstrate that validation-aligned, history-aware client selection provides a more reliable and efficient training process for federated learning in heterogeneous IoT environments.

[AI-112] Agent ic Context-Aware Risk Intelligence in the Internet of Value

【速读】:该论文旨在解决互联网价值(Internet of Value, IoV)场景下多源异构、部分可信网络中的复合风险建模与管理问题,其核心挑战在于传统单一链风险指标无法刻画路由、情绪、流动性及政策承诺等多重因素耦合的边际风险。解决方案的关键在于提出一个由五个协同引擎构成的风险原语架构:一是基于价格、流动性、波动率和路由健康度的预测引擎;二是通过Bittensor验证子网实现去中心化且经济激励驱动的预测结果评分机制;三是融合文本、链上流量与灰度文献的情绪融合引擎;四是受宪法和角色约束的代理执行引擎;五是将预测转化为蒙特卡洛情景生成的API风险与场景引擎。该架构在Solana上的27小时流动性压力响应实验和168小时预测路由校准实验中得到实证支持,其验证损失分解形式化且可证伪,具备实际部署可行性。

链接: https://arxiv.org/abs/2605.05878
作者: Basel Magableh,OmniRisk Research
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:The Internet of Value (IoV) is a heterogeneous, partially-trusted network in which the dominant marginal risk is composite (route, sentiment, liquidity, and the policy a system is willing to commit to) rather than a property of any single chain. We argue that a risk primitive adequate for this regime is a composition of five engines: a prediction engine over price, liquidity, volatility, and route health; a Bittensor verification subnet that decentralises and economically scores prediction outputs; a sentiment-fusion engine over text, on-chain flow, and grey-literature feeds; an agentic engine under constitutional, role-bound action constraints; and an API-risk and scenario engine that converts forecasts into pre-committed action programs in the sense of Monte-Carlo scenario generation. We anchor the architecture in two empirical artefacts: a 27-hour policy-constrained liquidity stress-response experiment on Solana, and a 168-hour prediction-router calibration arc reported with explicit class-imbalance honesty. The case study supports deployability; the validator-loss decomposition is stated formally and is falsifiable.

[AI-113] XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction

【速读】:该论文旨在解决多相粉末X射线衍射(Multiphase Powder X-ray Diffraction, PXRD)分析中结构识别的瓶颈问题,即在实际合成过程中常产生复杂混合物,而传统方法难以可靠地分离和识别其中各相成分。现有基于表示学习的晶体检索与生成方法多假设输入为单相数据,在多相场景下失效。其解决方案的关键在于提出一种无先验信息(prior-free)的框架XDecomposer,将多相衍射分析建模为集合预测问题,通过相查询驱动的分解机制(phase-query-driven decomposition mechanism)与衍射一致的物理重建策略,在统一架构中同时推断无序相集合、各相比例及其对应的结构表征,从而实现高精度的源分离与晶体学保真度保持。

链接: https://arxiv.org/abs/2605.05866
作者: Hanyu Gao,Bin Cao,Yunyue Su,Tong-Yi Zhang,Qiang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28pages, 8figures, 6tables

点击查看摘要

Abstract:Multiphase powder X-ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real-world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation-based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single-phase inputs and break down in multiphase settings. Here, we present XDecomposer, a prior-free framework for joint decomposition and identification of multiphase XRD patterns without requiring candidate phase lists, structural templates, or prior knowledge of phase number. We formulate multiphase diffraction analysis as a set prediction problem, where the model infers an unordered set of phase-resolved components, their mixture proportions, and corresponding structural representations within a unified architecture. A phase-query-driven decomposition mechanism, together with diffraction-consistent physical reconstruction, enables accurate source separation while preserving crystallographic fidelity. Extensive experiments on both simulated and experimental datasets show that XDecomposer substantially improves reconstruction accuracy and phase identification across diverse chemical systems, while maintaining strong generalization to unseen mixtures. These results provide a practical route toward data-driven, source-resolved multiphase XRD analysis and reduce long-standing dependence on prior-guided iteratively phase matching. The code is openly available at this https URL

[AI-114] SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

【速读】:该论文旨在解决在线强化学习中引入先验数据时面临的计算成本与训练效率之间的权衡问题。传统方法依赖固定长度的稳定阶段或静态更新策略,虽能提升训练速度,但往往需要任务特定的手动调参,易导致先验知识浪费或严重过拟合。其解决方案的关键在于提出SOPE算法,该算法利用与智能体对齐的离策略策略评估(Off-Policy Policy Evaluation, OPE)信号作为自动早停机制,动态控制离线训练阶段的长度——通过在保留验证集上评估当前策略下的价值函数,当分布外收益趋于饱和时立即终止梯度更新,从而避免手动调度并实现样本与计算效率的最优平衡。

链接: https://arxiv.org/abs/2605.05863
作者: Carlo Romeo,Girolamo Macaluso,Alessandro Sestini,Andrew D. Bagdanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy’s action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

[AI-115] SANEmerg: An Emergent Communication Framework for Semantic-aware Agent ic AI Networking

【速读】:该论文旨在解决未来网络系统中因传统通信与计算分离架构导致的效率低下问题,尤其是在大规模协同智能体网络(AgentNet)环境下,如何实现高效、低开销的多智能体协作任务执行。其核心挑战在于:在带宽受限且智能体计算能力有限的条件下,如何使智能体间自动演化出语义感知的通信协议以支持任务分解与协同。解决方案的关键是提出名为SANEmerg的多智能体涌现通信框架,该框架包含两个核心机制:一是基于带宽自适应的重要性过滤器(bandwidth-adaptable importance-filter),动态优先传输对任务贡献更高的信息维度;二是基于最小描述长度(Minimum Description Length, MDL)原则的复杂度正则化项,引导智能体在计算资源受限下演化出简洁高效的信号协议。实验表明,SANEmerg在保证高任务准确率的同时显著降低了带宽和计算开销。

链接: https://arxiv.org/abs/2605.05861
作者: Yong Xiao,Haoran Zhou,Yujie Zhou,Marwan Krunz
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted at IEEE/IFIP WiOpt Workshop, Columbus, OH, USA, June 2026

点击查看摘要

Abstract:Future networking systems are envisioned to become part of an agentic AI-native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. However, traditional networking paradigms are characterized by a rigid decoupling of communication and computation, which often leads to significant inefficiencies in large-scale agentic AI networking (AgentNet) systems. Emergent communication offers a novel solution by enabling autonomous agents that support task-specific signaling protocols for information exchange and collaborative coordination. In this paper, we consider a multi-agent emergent communication framework, tailored for semantic-aware AgentNet systems in which the user’s semantic intent can be automatically detected, inferred, and linked to a set of sub-tasks to be assigned to a set of agents. We investigate how communication and signaling protocols can emerge among collaborative agents with computationally bounded intelligence under stringent bandwidth constraints. Our proposed framework, called SANEmerg, is designed to facilitate the emergence of communication for collaborative task fulfillment while adhering to the physical limits of AgentNet. SANEmerg incorporates a bandwidth-adaptable importance-filter that dynamically prioritizes the transmission of higher-contribution message dimensions, ensuring robust performance in bandwidth-limited environments. Furthermore, SANEmerg integrates a complexity-regularizer grounded in the Minimum Description Length (MDL) principle to facilitate the emergence of computationally bounded signaling. Evaluated via an AgentNet prototype and extensive experimentation, SANEmerg demonstrates significant performance improvements over state-of-the-art solutions, achieving superior task accuracy while significantly reducing bandwidth and computational overhead.

[AI-116] AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting

【速读】:该论文旨在解决当前空气质量预测模型评估中存在的“数据理想化”问题,即现有研究多基于区域性的、预处理过的、归一化的数据集进行评估,这些数据集通常通过删除或人工填补缺失观测值来构建,从而掩盖了真实监测网络中普遍存在的不均匀全球覆盖、结构化缺失、污染物尺度异质性及部署成本等关键挑战。解决方案的关键在于提出AirQualityBench——一个全球多污染物基准测试平台,其核心创新是保留原始观测掩码(observation masks),不进行数据插补,将缺失性作为预测任务的一部分,并在物理浓度尺度上对有效未来观测值进行误差计算,从而实现更贴近现实场景的模型评估。该设计推动了对可扩展、掩码感知和物理可解释的空气质量预测模型的研究与开发。

链接: https://arxiv.org/abs/2605.05854
作者: Xing Xu,Xu Wang,Yudong Zhang,Huilin Zhao,Zhengyang Zhou,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce \textbfAirQualityBench, a global multi-pollutant benchmark designed to evaluate forecasting models under these realistic conditions. The benchmark contains hourly observations from 3,720 monitoring stations over 2021–2025, covers six major pollutants, and preserves provider-native observation masks. Rather than imputing a dense data tensor, AirQualityBench exposes missingness as part of the forecasting problem and reports errors on valid future observations after inverse transformation to physical concentration scales. Evaluating representative spatio-temporal models under this unified protocol shows that strong performance on sanitized datasets does not reliably transfer to global, fragmented monitoring streams. AirQualityBench therefore serves as a realistic testbed for scalable, mask-aware, and physically interpretable air-quality forecasting. All benchmark data, code, evaluation scripts, and baseline implementations are available at \hrefthis https URLGitHub.

[AI-117] LoopTrap: Termination Poisoning Attacks on LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在迭代执行循环中因终止判断被恶意干扰而导致的“终止投毒”(Termination Poisoning)问题,即攻击者通过向代理上下文注入恶意提示,诱导其错误地认为任务未完成,从而引发无限循环或资源滥用。解决方案的关键在于提出一种名为 LoopTrap 的自动化红队框架,其核心机制包括:通过轻量级探测构建目标代理的行为画像(沿四个脆弱维度),基于此动态合成针对性恶意提示,并采用自评分机制选择最优攻击策略;同时将成功攻击抽象为可复用技能库,失败案例经自我反思迭代优化,实现对未知代理和任务的高效、可扩展攻击。

链接: https://arxiv.org/abs/2605.05846
作者: Huiyu Xu,Zhibo Wang,Wenhui Zhang,Ziqi Zhu,Yaopeng Wang,Kui Ren,Chun Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM agents solve complex tasks by operating in iterative execution loops, where they repeatedly reason, act, and self-evaluate progress to determine when a task is complete. In this work, we show that while this self-directed loop facilitates autonomy, it also introduces a critical risk: by injecting malicious prompts into the agent’s context, an adversary can distort the agent’s termination judgment, making it believe the task remains incomplete and leading to unbounded this http URL understand this threat, we define and systematically characterize it as Termination Poisoning and design 10 representative attack strategies. Through a empirical study spanning 8 LLM agents and 60 tasks, we demonstrate that different LLM agents exhibit distinct behavioral signatures that determine which strategies succeed. These transferable patterns can serve as principled guidance for crafting effective attacks against previously unseen agents and tasks, enabling scalable red-teaming beyond manually designed templates. Building on these insights, we introduce LoopTrap, an automated red-teaming framework that synthesizes target-specific malicious prompts by exploiting agent behavioral tendencies. LoopTrap first constructs a behavioral profile of the target agent along four vulnerability dimensions via lightweight probing. It then performs adaptive trap synthesis, routing to the most effective strategy and selecting optimal injections via a self-scoring mechanism. Finally, successful traps are abstracted into a reusable skill library, while failed attempts are refined through self-reflection, ensuring continuous improvement. Extensive evaluation shows that LoopTrap achieves an average of 3.57 \times step amplification across 8 mainstream agents, with a peak of 25 \times .

[AI-118] aklif.AI: LLM -Powered Platform for Interest-Based Personalized College Assignments

【速读】:该论文旨在解决传统教育中“一刀切”式作业设计导致的学生参与度下降及学术不端行为增加的问题。其核心挑战在于如何在满足学生多样化兴趣与认知能力的基础上,生成具有个性化且高质量的学习任务。解决方案的关键在于构建一个基于大型语言模型(Large Language Models, LLMs)的平台,通过结构化的提示工程(prompt engineering)管道,将学生的课外兴趣和文化背景纳入赋值生成过程,并引入输入/输出防护机制(guardrails)以保障内容质量与安全性。该平台采用AWS无服务器架构,利用Llama 3.3 70B作为主模型,并结合LiteLLM实现多服务商负载均衡与LangChain进行提示编排,从而实现高效、可控的个性化作业生成。

链接: https://arxiv.org/abs/2605.05842
作者: Zaki Kurdya,Mohammed Zuqlam,Salem Amassi,Shady Telbany,Motaz Saad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Educators face significant challenges in creating engaging, personalized assignments that accommodate students’ diverse interests and cognitive abilities. Traditional one-size-fits-all assignments frequently lead to decreased student engagement and increased reliance on unethical practices such as plagiarism. To address these challenges, we present this http URL, a platform that leverages Large Language Models (LLMs) to automatically generate personalized assignments tailored to individual student interests. Unlike existing AI-powered educational platforms that personalize based on academic performance metrics alone, this http URL incorporates students’ extracurricular interests and cultural contexts into the assignment generation process through a structured prompt engineering pipeline with input and output guardrails. The platform employs a serverless architecture on AWS with this http URL, using Llama 3.3 70B as the primary LLM via LiteLLM for multi-provider load balancing and LangChain for prompt orchestration. We describe the system architecture, the prompt design methodology, and the guardrails framework that ensures output quality. Preliminary user acceptance testing with 68 participants (65 students and 3 educators) indicates positive reception, with 84% of participants rating the personalization feature as beneficial. We discuss the platform’s current capabilities and limitations, and outline directions for rigorous empirical evaluation of learning outcomes.

[AI-119] On the Role of Language Representations in Auto-Bidding: Findings and Implications

【速读】:该论文旨在解决自动出价(auto-bidding)在实时广告市场中面临的挑战,即现有方法依赖紧凑的数值状态表示,难以显式建模高阶意图、动态反馈和操作员策略指导,从而限制了对复杂广告目标的可控性和泛化能力。解决方案的关键在于提出SemBid框架,通过在离线出价轨迹的token层面注入由大语言模型(LLM)编码的语义信息(包括任务意图Task、历史表现History和策略Strategy),并利用自注意力机制实现语义与数值特征的精细融合,而非简单的拼接,从而显著提升出价策略在不同预算场景下的性能一致性、约束满足度和鲁棒性。

链接: https://arxiv.org/abs/2605.05833
作者: Guanyu Zhu,Jining Luan,Hanwen Du,Xinyu Fang,Sibo Xu,Ersheng Ni,Hongji Li,Jincheng Fang,Ronghao Chen,Huacan Wang,Xuanqi Lan,Yongxin Ni,Yiqi Sun,Youhua Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 page

点击查看摘要

Abstract:Auto-bidding is a crucial task in real-time advertising markets, where policies must optimize long-horizon value under delivery constraints (e.g., budget and CPA). Existing methods for auto-bidding rely on compact numerical state representations: while they can implicitly capture delivery dynamics, they offer limited support for explicitly representing and controlling high-level intent, evolving feedback, and operator-style strategic guidance in real campaigns. Meanwhile, Large Language Models (LLMs) offer a powerful method for encoding semantic information, it remains unclear when LLMs help and how to integrate them without sacrificing numerical precision. Through systematic preliminary studies, we find that (1) LLM embeddings contain bidding-relevant cues yet cannot replace numerical features, and (2) gains emerge only with careful semantic–numeric integration rather than naive concatenation. Motivated by these findings, we propose \textitSemBid, a novel auto-bidding framework that injects LLM-encoded semantics into offline bidding trajectories at the token level. SemBid introduces three semantic inputs: \textitTask, \textitHistory, and \textitStrategy. It injects these semantics as tokens alongside numerical trajectory tokens and uses self-attention to integrate them, improving controllability and generalization across objectives. Across diverse scenarios and budget regimes, SemBid outperforms competitive baselines from offline RL and generative sequence modeling, with more consistent gains in overall performance, constraint satisfaction, and robustness. Our code is available at: \hrefthis https URL\textcolorbluehere.

[AI-120] MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

【速读】:该论文旨在解决光学化学结构识别(Optical Chemical Structure Recognition, OCSR)在真实科学文献图像中性能不佳的问题,其核心挑战在于分子图示的视觉干扰与化学语义复杂性。解决方案的关键在于提出一个双维度难度框架MOSAIC,包含37个细粒度标签,同时刻画视觉干扰和化学语义难题;并构建了MolRecBench-Wild基准数据集(含5029个来自820篇近期化学论文的分子结构),覆盖真实出版物中的完整难度范围;此外,设计了一种名为CARBON的新表示语言,可表达价态变化、图标基团等非标准化学语义,并采用双轨评估协议支持CARBON与SMILES输出,从而实现更忠实的语义评估。

链接: https://arxiv.org/abs/2605.05832
作者: Haote Yang,Hui Wang,Chen Zhu,Jingchao Wang,Linye Li,Hongbin Lai,Huijie Ao,Yongxuan Lyu,Jiang Wu,Jiaxing Sun,Lua Chen,Yuanyuan Cao,Ruijie Zhang,Shengxin Lu,Lijun Wu,Bin Wang,Conghui He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optical Chemical Structure Recognition (OCSR) aims to translate molecular diagrams in scientific literature into machine-readable formats, but current systems remain unreliable on real-world images due to substantial visual and chemical complexity. We introduce MOSAIC, a dual-dimensional difficulty framework with 37 fine-grained labels that jointly characterize visual interference and chemical semantic challenges in molecular diagrams. Based on this framework, we construct MolRecBench-Wild, a benchmark of 5,029 structures from 820 recent chemistry papers, covering the full difficulty spectrum observed in real publications. To enable faithful semantic evaluation beyond SMILES and MolFile, we propose CARBON, a representation language capable of expressing valence variations, icon-based groups, and other non-standard chemical semantics. We further adopt a dual-track evaluation protocol supporting both CARBON and SMILES outputs for broad model compatibility. Comprehensive experiments over 18 OCSR-capable models reveal severe performance degradation on MolRecBench-Wild, exposing a large gap between previous patent benchmarks and real-world academic scenarios.

[AI-121] AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型(Large Language Models, LLMs)推理能力时所引发的推理边界收缩问题,即尽管现有方法提高了采样效率,但并未激发全新的推理模式,反而导致训练后模型的推理覆盖范围缩小,尤其在大规模采样下不如基线模型。解决方案的关键在于提出不对称分组策略优化(Asymmetric Group Policy Optimization, AGPO),其核心机制包括:1)采用负向主导的强化策略以抑制错误推理路径,从而保留基线模型的探索能力;2)引入群体优势机制(group advantage mechanism),根据组内方差动态调整正向更新强度,使模型聚焦于稀有正确路径并抑制平凡路径的更新,从而在保持多样性的同时提升推理质量。

链接: https://arxiv.org/abs/2605.05826
作者: Yang Xu,Kun Yao,Yiming Deng,Zheng Fang,Kai Ming Ting,Ming Pang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model’s exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@ k performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

[AI-122] A Testable Certificate for Constant Collapse in Teacher-Guided VAEs

【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)中常见的后验坍缩(posterior collapse)问题,尤其是输入无关的常数坍缩(input-independent constant collapse)这一具体失败模式。传统诊断方法依赖于KL散度小、解码器强或潜在变量使用弱等经验信号,但无法提供可量化的坍缩边界。论文的关键创新在于识别出:对于任意固定的非恒定教师分布 T(x)T(\cdot\mid x),最优常数学生分布即为数据集上的平均教师分布,其对齐代价等于教师互信息 IT(X;T)I_T(X;T);若一个仅依赖潜在变量的原始见证(raw witness)的对齐损失低于此阈值并留有安全裕量,则该见证不可能是输入无关的常数函数。这一理论结果将原本定性的坍缩现象转化为可测量的判据,并通过CIFAR-100和Tiny-ImageNet-200实验验证了“预防—坍缩—恢复”的三阶段行为模式,从而为防止常数坍缩提供了明确的可证伪条件。

链接: https://arxiv.org/abs/2605.05813
作者: Zegu Zhang,Jianhua Peng,Jian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Posterior collapse in variational autoencoders is often diagnosed by its symptoms: a small KL term, a strong decoder, or weak use of the latent code. These signals are useful, but they do not define a collapse boundary. We study a concrete failure mode, input-independent constant collapse, and show that this case admits an exact threshold. For any fixed nonconstant teacher distribution (T(\cdot\mid x)), the best constant student is the dataset-average teacher distribution, and its alignment cost is the teacher mutual information (I_T(X;T)). Therefore, if a strictly latent-only raw witness achieves alignment loss below this value, with a safety margin, the witness cannot be constant in the input. This identity turns a qualitative failure mode into a measurable one. In CIFAR-100 experiments with per-seed teacher search, full training stays on the certified side of the boundary, removing alignment drives the raw witness into the constant-student regime, and restarting from a collapsed checkpoint with alignment enabled restores the certificate. Tiny-ImageNet-200 fixed-target runs show the same prevention–collapse–rescue pattern across three independently searched teachers. Standard VAE-style baselines, including methods that preserve reconstruction quality or post-hoc predictability, remain negative under the raw certificate. The guarantee is intentionally narrow: it certifies that the matched nonconstant teacher-relative variation passes through the latent pathway, rather than claiming that all forms of posterior collapse have been ruled out. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.05813 [cs.LG] (or arXiv:2605.05813v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05813 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-123] Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

【速读】:该论文旨在解决基于值函数的离线强化学习方法(如Q-learning)在长时域学习中因bootstrapping导致的估计误差累积问题。由于时序差分(Temporal-Difference, TD)更新会将后期状态的误差向后传播,使得长期策略评估变得不稳定。解决方案的关键在于提出一种名为长时域Q学习(Long-Horizon Q-Learning, LQL)的新机制,其核心思想是利用先前观察到的动作序列对最优策略的期望收益提供下界约束——即早期采取最优动作不应劣于先执行观测动作再切换至最优行为的策略。LQL通过引入hinge损失函数来惩罚违反该不等式约束的情况,从而实现对Q值估计的稳定化;该机制仅使用现有网络输出计算惩罚项,无需额外网络或前向传播,与标准Q-learning相比具有同等计算开销却显著提升性能。

链接: https://arxiv.org/abs/2605.05812
作者: Armaan A. Abraham,Lucy Xiaoyang Shi,Chelsea Finn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

[AI-124] Sheet as Token: A Graph-Enhanced Representation for Multi-Sheet Spreadsheet Understanding

【速读】:该论文旨在解决多工作表电子表格(multi-sheet spreadsheet)理解中的挑战,即在语言模型驱动的数据分析代理中,如何有效处理跨多个工作表、具有异构模式(schema)、布局和隐式关系的信息分布问题。现有基于检索增强的方法通常将电子表格分解为行、列或块等细粒度单元,但这种以“块”为中心的表示方式容易割裂工作表的整体语义,导致全局信息丢失。其解决方案的关键在于提出“Sheet as Token”框架,将每个工作表视为统一的语义单元进行编码,并通过图结构增强跨工作表推理能力:首先从表名、列头、代表性值和布局特征中提取感知模式的记录,生成紧凑的稠密token;随后利用图检索器构建查询相关的候选图,融合语义、查询条件、模式一致性与形状兼容性等多种关系通道,借助多阶段图Transformer实现高效且语义连贯的多工作表检索。实验表明,该方法在保持稳定表示的同时显著提升了列表级检索性能,且计算开销可控,验证了工作表级抽象在可扩展电子表格理解中的有效性。

链接: https://arxiv.org/abs/2605.05811
作者: Yiming Lei,Yiqi Wang,Yujia Zhang,Bo Guan,Depei Zhu,Chunhui Wang,Zhuonan Hao,Tianyu Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Workbook-scale spreadsheet understanding is increasingly important for language-model-based data analysis agents, but remains challenging because relevant information is often distributed across multiple sheets with heterogeneous schemas, layouts, and implicit relationships. Existing retrieval-augmented approaches typically decompose spreadsheets into rows, columns, or blocks to improve scalability; however, such chunk-centric representations can fragment worksheets into isolated text spans and weaken global sheet-level semantics. We propose Sheet as Token, a graph-enhanced framework that treats each worksheet as a unified semantic unit for multi-sheet spreadsheet retrieval. Our method extracts schema-aware records from sheet names, column headers, representative values, and layout features, and encodes each worksheet into a compact dense token. Given a natural-language query, a Graph Retriever constructs a query-specific candidate graph over sheet tokens using semantic, query-conditioned, schema-consistency, and shape-compatibility relations, and composes these channels through a multi-stage graph transformer to retrieve supporting sheet sets. Experiments on a constructed multi-sheet spreadsheet corpus show that sheet-level tokenization learns stable representations, and that graph-enhanced cross-sheet reasoning improves listwise retrieval over a shallow graph baseline with limited additional graph-side computation. These results suggest that sheet-level tokenization is a promising abstraction for scalable multi-sheet spreadsheet understanding.

[AI-125] LCC-LLM : Leverag ing Code-Centric Large Language Models for Malware Attribution

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的恶意软件归属(malware attribution)研究中存在的两大核心问题:一是缺乏对恶意代码和漏洞代码段进行精准识别所需的细粒度代码级支撑指标(code-level grounding),二是现有方法难以有效整合多源安全知识以实现证据驱动的推理。其解决方案的关键在于提出一个面向代码中心(code-centric)的基准数据集LCCD与一个证据锚定(evidence-grounded)框架LCC-LLM,该框架通过七层检索增强生成(retrieval-augmented generation, RAG)流水线、CoVe(Indicator of Compromise validation)验证机制以及多维质量门控策略,实现了从反汇编代码、控制流图(CFG/FCG)、API调用痕迹等多模态特征中提取可解释证据,并结合课程顺序指令微调(curriculum-ordered instruction fine-tuning)技术优化DeepSeek-R1-Distill-Qwen-14B与Qwen3-Coder-30B-A3B模型,从而显著提升恶意软件分析任务中的语义准确性与实际操作可靠性。

链接: https://arxiv.org/abs/2605.05807
作者: Christopher G. Pedraza Pohlenz,Hassan Jalil Hadi,Ali Hassan,Ali Shoker
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitations, this research introduces LCC-LLM, a code-centric benchmark dataset and evidence-grounded framework for malware attribution and multi-task static malware analysis. The proposed LCCD dataset contains approximately 34K PE samples processed through a large-scale reverse-engineering pipeline and represented using decompiled C code, assembly code, CFG/FCG artifacts, hexadecimal data, PE metadata, suspicious API evidence, and structural features. Beyond dataset construction, LCC-LLM integrates LangGraph-orchestrated static analysis with multi-source cybersecurity knowledge to support evidence-grounded malware reasoning. The framework employs a seven-layer retrieval-augmented generation pipeline, CoVe for IoC validation, and a multi-dimensional quality gate to improve factual reliability and analyst-oriented decision support. Curriculum-ordered instruction data is used to fine-tune DeepSeek-R1-Distill-Qwen-14B and Qwen3-Coder-30B-A3B using QLoRA. Evaluation across 43 malware-analysis task types achieves an average semantic similarity of 0.634, with the highest task-level performance in structured report generation, IoC extraction, vulnerability assessment, malware configuration extraction, and malware class detection. In a real-world case study using MalwareBazaar samples, the grounded pipeline achieves a 10/10 structured analysis pass rate, producing CFG/FCG evidence, MITRE ATTCK mappings, detection guidance, and analyst-ready reports. These results show that code-centric representations, retrieval grounding, and verification-guided reasoning improve the reliability and operational usefulness of LLM-assisted malware attribution.

[AI-126] Revealing Modular Gradient Noise Imbalance in LLM s: Calibrating Adam via Signal-to-Noise Ratio

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中因模块结构异质性(heterogeneous module composition)导致的优化难题,尤其是传统自适应优化器(如Adam(W))未能显式处理模块级梯度异质性,从而引发收敛缓慢、性能次优或训练不稳定的问题。解决方案的关键在于提出一种基于信噪比(Signal-to-Noise Ratio, SNR)的模块级学习率缩放方法(Module-wise Learning Rate Scaling via SNR, MoLS),通过估计每个模块的SNR来自动调整Adam更新幅度,实现无需人工调参的模块级学习率分配,同时保持与内存高效训练算法的兼容性。

链接: https://arxiv.org/abs/2605.05794
作者: Ziqing Wen,Zhouyang Liu,Jiahuan Wang,Ping Luo,Li Shen,Dongsheng Li,Tao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned module-specific learning rates or specific optimization strategies, which are computationally costly and difficult to generalize across tasks or models. To establish a more principled approach, we first analyze the noise-damping behavior of Adam in high-noise modules and introduce \textbfModule-wise Learning Rate Scaling via SNR (MoLS). MoLS estimates module-level SNRs to scale Adam updates, allowing automated module-wise learning rate allocation without manual tuning. Empirical results through multiple LLM training benchmarks demonstrate that MoLS improves convergence speed and generalization, achieving performance comparable to carefully tuned module-specific learning rates, while remaining compatible with memory-efficient training algorithms.

[AI-127] HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning ICML2026

【速读】:该论文旨在解决域增量学习(Domain Incremental Learning) 中因域偏移(domain shift)导致模型性能显著下降的问题,特别是在不重新训练的情况下持续适应新数据域时易发生灾难性遗忘(catastrophic forgetting)和泛化能力不足。其解决方案的关键在于提出混合能量-距离提示(Hybrid Energy-Distance Prompt, HEDP)框架,该框架受亥姆霍兹自由能(Helmholtz free energy)启发,引入能量正则化损失以增强域表示的可分性,并设计了一种融合能量基与距离基线索的加权机制,从而提升域选择准确性和开放世界适应能力。实验表明,该方法在CORe50等多基准上对未见域实现了2.57%的准确率提升,有效缓解了灾难性遗忘问题。

链接: https://arxiv.org/abs/2605.05776
作者: Yu Feng,Zhen Tian,Haoran Luo,Xie Yu,Diancheng Cheng,Haoyue Zheng,Shuai Lyu,Ping Zong,Lianyuan Li,Xin Ge,Yifan Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, Accepted by ICML 2026

点击查看摘要

Abstract:Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid Energy-Distance Prompt, a domain-incremental framework inspired by Helmholtz free energy. HEDP introduces an energy regularization loss to enhance the separability of domain representations and a hybrid energy-distance weighted mechanism that fuses energy-based and distance-based cues to improve domain selection and generalization. Experiments on multiple benchmarks, including CORe50, show that HEDP achieves superior performance on unseen domains with a 2.57% accuracy gain, effectively mitigating catastrophic forgetting and enhancing open-world adaptability. Our code is \hrefthis https URLavailable here.

[AI-128] CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt

【速读】:该论文旨在解决模拟电路设计自动化(Analog Circuit Design Automation)在电子设计自动化(EDA)领域长期面临的挑战,即如何利用生成式AI模型高效地从自然语言描述生成正确的电路网表(netlist)。其核心问题包括:(1)缺乏包含自然语言描述与对应网表的高质量模拟电路数据集;(2)通用分词器(如Byte Pair Encoding, BPE)无法有效捕捉电路固有的图结构。解决方案的关键在于两个创新:一是构建了迄今最大的标注模拟电路网表数据集(31,341对),覆盖所有主要电路类别;二是提出Circuit Tokenizer(CKT),一种基于频繁子电路挖掘的新型电路图分词器,可将词汇表增长复杂度从线性于最大组件数O(n_max)降低至常数O(1),显著提升表示效率和压缩比(序列长度减少57%,压缩比提高2.3倍),从而支持训练出更小但性能更强的专用语言模型CircuitFormer(511M参数),实现100%语法正确性和83%功能成功率,优于现有开源大模型且参数量仅为其1/240。

链接: https://arxiv.org/abs/2605.05773
作者: Md Touhidul Islam,Sujan Kumar Saha,Farimah Farahmandi,Mark Tehranipoor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating analog circuit design remains a longstanding challenge in Electronic Design Automation (EDA). While Transformer-based Large Language Models (LLMs) have revolutionized software code generation, their application to analog hardware design is hindered by two critical limitations: (i) the scarcity of analog design datasets containing natural language description of a design and its corresponding netlist, and (ii) the inefficiency of general-purpose tokenizers (e.g., Byte Pair Encoding (BPE)) in capturing the inherent graph structure of circuits. To bridge this gap, first, we curate the largest annotated dataset of analog circuit netlists to date, comprising 31,341 netlist-natural language description pairs across all major circuit classes. Furthermore, we propose Circuit Tokenizer (CKT), a novel circuit graph tokenizer designed to encode netlist connectivity by explicitly mining frequent subcircuits. In terms of scalability, CKT overcomes the bottleneck of prior circuit graph serialization methods where vocabulary size scales linearly with maximum number of components in the dataset, n_max, (O(n_max)); instead, CKT decouples vocabulary growth from circuit complexity, achieving a constant O(1) complexity. Empirically, CKT outperforms standard BPE on circuit topology representation, reducing sequence length by 57% and achieving a 2.3x superior compression ratio using a compact, fixed vocabulary of size 512. Leveraging this optimized tokenization, we train a circuit-specific language model, CircuitFormer, a 511M parameter encoder-decoder transformer. Our model achieves 100% syntactic correctness and an 83% functional success rate across all major analog circuit categories, outperforming state-of-the-art open-source LLMs by 10% and 14%, respectively, while requiring 240x fewer parameters. The dataset is publicly available at this https URL.

[AI-129] Confidence is the key: how conformal prediction enhances the generative design of permeable peptides

【速读】:该论文旨在解决生成式强化学习(Reinforcement Learning, RL)在设计具有特定性质的分子(如可渗透性)时,因预测模型超出其适用域而导致的可靠性下降问题。尤其针对环肽(cyclic peptides)这类具有治疗潜力但研究不足的分子类型,传统RL框架可能引导生成位于预测模型不确定区域的分子,从而降低优化过程的效率与可信度。解决方案的关键在于引入基于校准模型的不确定性量化方法——共形预测(Conformal Prediction, CP),将CP作为评分组件嵌入RL生成框架中,使奖励机制能够根据用户设定的置信水平识别可靠预测,并抑制对高不确定性化学空间的探索,从而实现更高效且可靠的环肽设计。这是首次将生成建模与共形预测结合用于指导RL驱动的分子生成任务。

链接: https://arxiv.org/abs/2605.05770
作者: Laura van Weesep,Sunay Chankeshwara,Leonardo De Maria,Florian David,Ola Engkvist,Gökçe Geylan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models coupled with reinforcement learning (RL), such as REINVENT and PepINVENT, have emerged as a powerful framework for de novo molecular design. During the ideation process these generative frameworks utilize various predictive models as part of the optimization objectives. However, the utility of the predictive models can be limited by their domain of applicability. When RL is used to explore the chemical space with predictive models, it can suggest molecules that lie outside the predictor’s domain of applicability. As a result, the predictions may become less reliable, potentially steering designs into high reward but also high uncertainty chemical spaces. This is particularly pronounced for cyclic peptides which show therapeutic promise due to their modifiability and large interaction surfaces but are understudied compared to small molecules. While passive membrane permeation in cyclic peptides has attracted interest, identifying optimal permeable designs remains challenging yet crucial for targeting intracellular sites. We present an RL-guided generative framework that designs permeable cyclic peptides using an uncertainty-aware permeability predictor as the scoring component. To address predictive uncertainty, especially impacted by novel chemistry, we integrate conformal prediction (CP) as our uncertainty quantification method. CP assesses designs based on the calibrated model under a user-defined confidence level. We demonstrate that rewarding generated peptides with CP-informed predictions improves both reliability and efficiency of peptide optimization process. This also discourages exploration outside the predictor’s applicability domain. This approach bridges the gap between predictive uncertainty and RL-guided exploration, showing how generative modelling and conformal prediction can be combined for the first time.

[AI-130] Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI ICPR

【速读】:该论文旨在解决生成式 AI (Generative AI) 在安全关键型自动目标识别(Automatic Target Recognition, ATR)系统中面临的可解释性不足问题。当前模型虽具备高预测性能,但其决策过程缺乏可解释性、可靠性及可验证性,难以满足安全认证要求。解决方案的关键在于将可解释性建模为面向保障(assurance-oriented)的评估问题,提出一个涵盖解释性、鲁棒性、易受操纵性及验证适用性的四维评估框架,并系统分析主流后处理解释方法(如基于显著性、注意力机制和代理模型的方法)的局限性,揭示其存在虚假解释、扰动下不稳定性以及视觉误导导致过度信任等关键失效模式。研究强调需从依赖视觉合理性的解释转向更具因果基础和物理约束的可解释方法,以支持可靠决策与系统级安全保障。

链接: https://arxiv.org/abs/2605.05748
作者: Vanessa Buhrmester,David Muench,Dimitri Bulatov,Michael Arens
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 1 image 1 table, ICPR workshop

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is increasingly rec ognized as essential for deploying machine learning systems in safety critical environments. In Automatic Target Recognition (ATR), where models operate on image, video, radar, and multisensor data, high pre dictive performance alone is insufficient. Model decisions must also be interpretable, reliable, and suitable for validation. This paper presents a structured evaluation of explainability methods in the context of safety-critical ATR systems: We identify major XAI paradigms, including saliency-based, attention-based, and surrogate ap proaches, as well as recent detection-aware extensions. Based on this, we formalize explainability as an assurance-oriented assessment problem, introduce a taxonomy, and assess these methods with respect to four key dimensions: interpretability, robustness, vulnerability to manipula tion, and suitability for validation and verification. The analysis identifies systematic limitations of current post-hoc explanation methods. In par ticular, we derive critical failure modes such as spurious explanations, instability under perturbations, and overtrust induced by visually con vincing outputs. These findings indicate that widely used XAI techniques may be insufficient for safety-critical deployment. Finally, we discuss implications for ATR systems and outline directions toward more robust, causally grounded, and physically informed explain ability methods. Our results emphasize the need to move beyond visually plausible explanations toward approaches that support reliable decision making and system-level assurance.

[AI-131] Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

【速读】:该论文旨在解决在广义线性 bandits 框架下,基于混合反馈模型(即单臂绝对奖励反馈与成对相对(dueling)反馈)的固定置信度最优臂识别问题。其核心挑战在于如何统一处理异质观测类型并实现高效样本利用。解决方案的关键在于提出一种基于似然比的置信序(confidence sequence),该序列在自协调假设(self-concordance assumption)下可导出显式的椭球形置信集,并在此基础上设计了一种混合 Track-and-Stop 算法,通过跟踪联合动作空间(单臂与成对动作)上的最小最大最优设计来自适应分配查询策略,从而实现 δ-正确性及停止时间的高概率上界。

链接: https://arxiv.org/abs/2605.05745
作者: Qirun Zeng,Xuchuang Wang,Jiayi Shen,Xutong Liu,Fang Kong,Jinhang Zuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both governed by generalized linear models. We introduce a likelihood-ratio–based confidence sequence that unifies heterogeneous generalized linear observations and yields an explicit ellipsoidal confidence set under a self-concordance assumption. Building on this confidence set, we propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over a joint action space of arms and pairs. We establish \delta -correctness and provide high-probability upper bounds on the stopping time. We further extend the framework to a cost-aware setting that accounts for heterogeneous acquisition costs across feedback modalities. Empirical experiments demonstrate that the proposed algorithms significantly improve sample efficiency over baseline methods.

[AI-132] HyperLens: Quantifying Cognitive Effort in LLM s with Fine-grained Confidence Trajectory

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中的动态机制理解不足的问题,尤其是现有分析工具分辨率有限,难以捕捉层间细微的置信度变化。其解决方案的关键在于识别出Transformer架构中存在一种内在的放大机制:深层网络会放大逐层置信度的小幅波动,从而形成细粒度的置信度轨迹。基于此发现,作者提出HyperLens——一种高分辨率探针,用于追踪置信度轨迹并量化推理过程中的认知努力(cognitive effort)。通过该方法,研究揭示了复杂任务与简单任务在置信度轨迹上的系统性差异,并将其抽象为可量化的认知努力指标,进而发现复杂任务始终需要更高的认知努力;此外,还揭示了标准监督微调(Supervised Fine-Tuning, SFT)可能因降低认知努力而导致域内性能下降的机制性问题。

链接: https://arxiv.org/abs/2605.05741
作者: Chengda Lu,Xiaoyu Fan,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve strong performance across diverse tasks, their inference dynamics remain poorly understood because of the limited resolution of existing analysis tools. In this work, we identify an intrinsic magnification mechanism in transformer architectures: deeper layers inherently magnify the small changes of layer-wise confidence, providing a fine-grained confidence trajectory. Building on this insight, we introduce HyperLens, a high-resolution probe designed to trace confidence trajectories and quantify the cognitive effort during inference. Across LLMs and datasets, HyperLens reveals a consistent divergence in confidence trajectories that separates complex from simple tasks. We abstract this pattern into a quantitative cognitive effort metric. Our analysis reveals a fundamental principle: complex tasks consistently require higher cognitive effort. Finally, we provide a mechanistic diagnosis of a common side effect of standard Supervised Fine-Tuning (SFT): it can reduce cognitive effort and consequently degrade performance on in-domain tasks.

[AI-133] CoMemNet: Contrastive Sampling with Memory Replay Network for Continual Traffic Prediction

【速读】:该论文旨在解决流式交通网络中静态图结构无法有效捕捉持续演化模式的问题,尤其是在时间序列预测任务中因模型遗忘历史信息而导致的灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于提出一种双分支持续学习框架 CoMemNet:其中快速收敛的 Online 分支负责实时预测任务,而通过 Wasserstein Distance 特征更新的 Target 分支结合动态对比采样器(Dynamic Contrastive Sampler, DC Sampler)识别具有显著动态特征变化的节点集用于训练,从而缓解遗忘;同时引入轻量级节点自适应时间记忆缓冲区(Node-Adaptive Temporal Memory Buffer, TMRB-N)实现旧知识的记忆回放,防止内存爆炸,整体提升了交通预测的稳定性与准确性。

链接: https://arxiv.org/abs/2605.05738
作者: Mei Wu,Wenchao Weng,Wenxin Su,Wenjie Tang,Wei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:In recent years, the integration of non-topological space modeling with temporal learning methods has emerged as an effective approach for capturing spatio-temporal information in non-Euclidean graphs. However, most existing methods rely on static underlying graph structures, which are inadequate for capturing the continuously expanding and evolving patterns in streaming traffic networks. To address this challenge, we propose a simple yet efficient dual-branch continual learning framework for traffic prediction, named CoMemNet. The fast-converging Online branch undertakes the primary prediction tasks, while the momentum-updated Target branch extracts historical information using Wasserstein Distance features to create a Dynamic Contrastive Sampler (DC Sampler). This sampler selects a node set with significant dynamic network feature changes for training, effectively mitigating the issue of catastrophic forgetting. Additionally, the backbone incorporates a lightweight Node-Adaptive Temporal Memory Buffer (TMRB-N) to consolidate old knowledge through memory replay and address the risk of memory explosion. Finally, we provide two newly curated open-source datasets. Experimental results demonstrate that CoMemNet achieves state-of-the-art (SOTA) performance across all three large-scale real-world datasets. The code is available at: this https URL.

[AI-134] SDFlow: Similarity-Driven Flow Matching for Time Series Generation

【速读】:该论文旨在解决基于自回归(Autoregressive, AR)建模的向量量化(Vector Quantization, VQ)时间序列生成方法中存在的暴露偏差(Exposure Bias)问题,该偏差会导致长序列生成时误差累积,从而显著降低生成质量。解决方案的关键在于提出一种非自回归框架 SDFlow(Similarity-Driven Flow Matching),其核心创新包括:(1) 通过全局传输映射(transport map)替代逐步 token 预测以消除暴露偏差;(2) 利用低秩流形分解与学习到的锚点先验(anchor prior)缓解 VQ 潜空间的高维性;(3) 在变分流匹配框架中引入码本索引的类别后验分布,将离散监督信号融入连续传输动力学。该方法在保持高保真度的同时实现了并行化生成,显著提升了长序列生成性能和推理效率。

链接: https://arxiv.org/abs/2605.05736
作者: Wei Li,Shibo Feng,Pengcheng Wu,Min Wu,Peilin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ( \textbfS imilarity- \textbfD riven \textbfFlow Matching), a non-autoregressive framework that operates entirely in the frozen VQ latent space and enables parallel sequence generation via flow matching. We tackle three key challenges in making this transition: (1) eliminating exposure bias by replacing step-wise token prediction with a global transport map; (2) mitigating the high-dimensionality of VQ token spaces via a low-rank manifold decomposition with a learned anchor prior over the latent manifold; and (3) incorporating discrete supervision into continuous transport dynamics by introducing a categorical posterior over codebook indices within a variational flow-matching formulation. Extensive experiments show that SDFlow achieves state-of-the-art performance, improving Discriminative Score and substantially reducing Context-FID, particularly for challenging long-sequence generation. Moreover, SDFlow provides significant inference speedups over autoregressive baselines, offering both high fidelity and computational efficiency. Code is available at this https URL

[AI-135] CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在持续学习(continual learning)过程中因不断微调导致的灾难性遗忘(catastrophic forgetting)问题。其解决方案的关键在于提出CRAFT框架,该框架不直接更新模型权重,而是通过在隐藏表示空间(hidden representations)中学习低秩干预(low-rank interventions)来实现任务适应;具体而言,CRAFT通过输出分布差异(output-distribution divergence)进行任务路由,利用KL散度(Kullback-Leibler divergence)作为统一目标函数来控制遗忘并指导收敛,同时将新任务的干预合并到共享表示中,从而在多个基准和模型规模上显著提升性能并减少遗忘,且对任务顺序具有鲁棒性。

链接: https://arxiv.org/abs/2605.05732
作者: Md Anwar Hossen,Fatema Siddika,Juan Pablo Munoz,Tanya Roosta,Ali Jannesari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:Large language models (LLMs) can acquire new capabilities through fine-tuning, but continual adaptation often leads to catastrophic forgetting. We propose CRAFT, a continual learning framework that avoids updating model weights by instead learning low-rank interventions on hidden representations. CRAFT proceeds in three stages: it first routes each task to a group of similar tasks based on output-distribution divergence; it then fine-tunes the model using a Kullback-Leibler (KL) divergence against the group’s prior state, which directly controls forgetting and determines convergence; finally, it merges interventions for the updated task into the shared representation using the same KL signal. This design unifies routing, regularization, and merging through a single KL-based objective. CRAFT improves overall performance and reduces forgetting compared to strong LoRA-based approaches across multiple benchmarks and model scales, while remaining robust to task ordering. These results suggest that controlling adaptation in representation space, guided by output-space divergence, provides a scalable and principled approach to continual learning in LLMs.

[AI-136] Knee Osteoarthritis Severity Grading Using Optimized Deep Learning and LLM -Driven Intelligent AI on Computationally Limited Systems CEC

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)诊断中因主观性及观察者间差异导致的评估不准确问题,从而实现对KOA严重程度的精准、及时分级。其解决方案的关键在于融合深度学习卷积神经网络(Convolutional Neural Network, CNN)与基于TensorFlow Lite的设备端推理平台,提出一种基于ResNet-18架构的自动化诊断模型,并通过迁移学习在公开数据集上训练和优化,最终达到94.48%的测试准确率。此外,模型被轻量化为TensorFlow Lite格式,可在无持续互联网连接的资源受限设备上部署,同时引入大型语言模型(Large Language Model, LLM)如Gemini-2.0-flash生成结构化解释性报告,提升可解释性和临床可用性,从而推动AI辅助膝关节筛查工具的早期诊断与普及应用。

链接: https://arxiv.org/abs/2605.05731
作者: Dayam Nadeem,Neha,Safdar Mustafa,Adnan Alvi,Mohd Hussain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 11 figures, Accepted and presented at the 2nd International Conference on Emerging Computational Intelligence (ICECI 2026), IEEE. Published in conference proceedings. To appear in IEEE Xplore

点击查看摘要

Abstract:Knee osteoarthritis (KOA) is among the musculoskeletal disorders that considerably restrict joint mobility, cause severe chronic pain and impact negatively on quality life. It is one of the persistent health issues worldwide. Generally, subjectivity and inter-observer variability undermine conventional practices and evaluation process that are adopted to address such health issues. Hence precise and timely diagnosis would be one of the effective ways for the assessment of its severity. This paper proposes an automated diagnostic approach for severity grading of KOA by blending a deep learning convolutional neural network (CNN) with a device-based inference platform powered by TensorFlow Lite. It proposes a model based on the ResNet-18 convolutional neural network. The designed model is trained on publicly available database. Through a transfer learning approach obtained knee images are first classified into five Kellgren-Lawrence (KL) grades. Further the developed model is optimised. During the training of the model test accuracy of 94.48% with stable convergence has been achieved. Subsequently the optimised model transformed into a lightweight TensorFlow Lite format, facilitating seamless deployment on resource-constrained devices. The designed model is capable enough to operate in the environment having no continuous internet connectivity. Also, an auxiliary Large Language Model (Gemini-2.0-flash) is applied to generate structured interpretive findings like potential symptoms, risk factors, and preventive majors etc. The LLM component functions as interface without influencing the classification process. The proposed model articulates the feasibility of an on-device, interpretable decision-support tools for early diagnosis and improve accessibility to Artificial Intelligence (AI)-assisted knee screening tool.

[AI-137] WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers

【速读】:该论文旨在解决电力市场运营中AC最优潮流(AC-OPF)求解效率低的问题,其核心挑战在于现有基于机器学习的预热启动方法在评估时存在基准不当,导致对性能提升的高估。关键解决方案是:首先修正了评估基准,从原始的“平坦初值”(flat start)改为更合理的“变量边界中点”(variable-bound midpoint),并揭示了仅预测原始变量(primal-only)无法减少迭代次数的根本原因——即内点法(IPM)中原始变量预测精度与收敛速度呈负相关,且缺乏对偶变量会导致求解器发散;其次提出了一种结构上可实现的完整原始-对偶障碍状态(primal-dual-barrier state)预测方法,通过WARP模型(一种拓扑条件编码-处理-解码交互网络)直接预测完整的内点法状态(x^,λ^,z^,μ^\hat{x}, \hat{\lambda}, \hat{z}, \hat{\mu}),实现在不重新训练的情况下适应N-1故障拓扑变化,并将IPOPT迭代次数从23次降至3次,实现85%的迭代减少。

链接: https://arxiv.org/abs/2605.05728
作者: Dhruv Suri,Helgi Hilmarsson,Shourya Bose
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Solving AC Optimal Power Flow (AC-OPF) is of central importance in electricity market operations, where interior-point methods (IPMs) such as IPOPT are the standard solvers. A growing body of work uses machine learning to predict primal warm-start iterates, reporting iteration reductions of 30-46%. We show that these reported gains rest on an inappropriate evaluation baseline: prior methods benchmark against the flat start V_m = 1, V_a = 0 , whereas the solver’s actual default - the variable-bound midpoint (l+u)/2 - is near-optimal for log-barrier centrality. Against this corrected baseline, no primal-only warm-start method reduces solver iterations. We trace the failure to a geometric property of interior-point methods: primal prediction accuracy is anticorrelated with convergence speed, and providing the ground-truth optimal solution x^* without dual variables causes the solver to diverge. Oracle experiments establish that the complete primal-dual-barrier state (x^, \lambda^, z^, \mu^) reduces IPOPT iterations from 23 to 3 - an 85% reduction that is structurally inaccessible to primal-only methods. To enable rigorous evaluation of warm-start methods on this task, we release a benchmark suite comprising dual-labeled AC-OPF datasets with IPOPT-extracted solutions, a corrected evaluation protocol, and WARP - a topology-conditioned encode-process-decode interaction network that predicts the full interior-point state (\hatx, \hat\lambda, \hatz, \hat\mu) on the heterogeneous constraint graph. WARP achieves a 76% reduction in IPOPT iterations while natively accommodating N-1 contingency topology variations without retraining.

[AI-138] SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在面对日益庞大的可复用技能库时,如何高效准确地检索出与用户请求最相关的技能这一系统性挑战。随着技能生态系统的扩展,传统依赖显式技能名称调用的方式不再适用,而当前技能检索方法在真实场景下的表现仍不理想,缺乏统一的基准测试和深入理解。为此,作者提出了SkillRet——一个大规模技能检索基准,包含17,810个公共代理技能、结构化语义标签及两级分类体系,并提供63,259条训练样本与4,997个评估查询,支持检索模型的训练与评测。解决方案的关键在于通过任务特定微调(task-specific fine-tuning)显著提升检索性能:相较于最强的现有检索器,NDCG@10指标提升13.1点;相较于最强的现成检索器,提升16.9点,且分析表明,这种提升源于微调后的模型能更聚焦于长而嘈杂查询中的微弱技能相关信号。

链接: https://arxiv.org/abs/2605.05726
作者: Hongcheol Cho,Ryangkyung Kang,Youngeun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large-scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two-level taxonomy spanning 6 major categories and 18 sub-categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval-oriented training. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off-the-shelf models struggle on realistic large-scale skill libraries, and prior skill-retrieval models still leave substantial headroom. Task-specific fine-tuning on SkillRet substantially improves performance, improving NDCG@10 by +13.1 points over the strongest prior retriever and by +16.9 points over the strongest off-the-shelf retriever. Our analysis further suggests that these gains arise because fine-tuned models better focus on the small skill-relevant signals within long and noisy queries. These results establish SkillRet as a strong benchmark and foundation for future research on retrieval in large-scale agent systems.

[AI-139] Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的时间序列异常检测方法中存在的可控性差、可解释性弱和可靠性不足的问题,尤其是在面对复杂异常模式时,单一通用模型难以提供结构化诊断。其解决方案的关键在于提出SAGE(Specialized Analyzer Group for Expert-like Detection),一个面向单变量时间序列的多智能体框架,将异常分析分解为四个专业化分析器(Analyzers):点异常、结构异常、季节性异常和模式异常,每个分析器使用特定领域的数值工具与可视化手段生成证据;随后由一个基于证据的检测器(Detector)整合这些证据,输出带有置信度分数的异常记录(含区间和候选类型);最后由监督模块(Supervisor)将结构化结果转化为面向分析师的诊断报告。此外,SAGE通过从正常参考训练片段中构建合成上下文示例来实现零样本推理,无需真实异常段或异常类型标签作为示例,从而提升了检测的可靠性和诊断输出的实际可用性。

链接: https://arxiv.org/abs/2605.05725
作者: Hyeongwon Kang,Jeongseob Kim,Jinwoo Park,Pilsung Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 9 pages main text, 29 pages total, 8 figures, 9 tables, with appendix

点击查看摘要

Abstract:Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert-like Detection), a multi-agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE further constructs synthetic in-context examples from normal-reference training segments, without using real anomalous segments or anomaly-type labels as in-context examples. Across three benchmarks, SAGE achieves the best average performance among strong ML/DL and language-model-based baselines. Ablation studies and human evaluation further show that the proposed framework improves detection reliability and the practical usefulness of diagnostic outputs.

[AI-140] Conceal Reconstruct Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对意图混淆型越狱攻击(intent-obfuscation-based jailbreak attacks)时的安全性问题,即攻击者通过将有害请求转化为隐蔽的多模态输入以绕过安全过滤机制。研究发现,此类攻击受到“重建—隐藏权衡”(reconstruction–concealment tradeoff)的制约:变换后的输入需足够隐蔽以规避检测,同时又要保留足够的信息使目标模型能够重建原始请求。现有方法难以平衡这一权衡,导致攻击效果受限。论文提出的关键解决方案是“感知隐藏的变体构造”(concealment-aware variant construction),其核心在于贪婪地选择低有害关键词对齐度且相互多样化的字符移除变体,并结合五种模态感知提示策略进行实例化;此外引入与有害关键词相关的干扰图像(keyword-related distractor images),通过多样化语境提供更有效的辅助视觉上下文,显著提升了攻击成功率,揭示了模型自身重建能力可能被利用来恢复隐藏的有害意图并生成不安全响应这一新型漏洞。

链接: https://arxiv.org/abs/2605.05709
作者: Md Farhamdur Reza,Richeng Jin,Tianfu Wu,Huaiyu Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, including appendices

点击查看摘要

Abstract:Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emphreconstruction–concealment tradeoff: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods, we find that existing transformations struggle to balance this tradeoff, limiting their effectiveness. In contrast, we show that character-removed variants achieve a better balance. Building on this, we propose \emphconcealment-aware variant construction, which greedily selects character-removed variants that are low in harmful-keyword alignment and mutually diverse, and instantiates them through five modality-aware prompting strategies. We further introduce \emphkeyword-related distractor images that depict the harmful keyword in diverse contexts, providing more effective auxiliary visual context than generic distractor images. Experiments across closed-source and open-source MLLMs show the proposed strategies outperform strong baselines, revealing an underexplored vulnerability: a model’s own reconstruction ability can be exploited to recover hidden harmful intent and produce unsafe responses.

[AI-141] Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine

【速读】:该论文旨在解决从纵向观察数据中估计个体化治疗效应(Individualized Treatment Effects, ITE)时面临的因果表示学习中的偏差-精度悖论问题,即现有方法在降低混杂偏倚的同时常抑制临床信息丰富的异质性,导致患者特异性预测性能下降。其解决方案的关键在于提出一种基于采样的最大均值差异(sampling-based maximum mean discrepancy, sMMD)策略,通过替代传统的全局对抗平衡方式,转而采用子集层面的匹配机制,在保持因果识别有效性的同时选择性保留临床决策关键变量,从而实现对反事实结果的精准预测与可解释性增强。

链接: https://arxiv.org/abs/2605.05706
作者: Peisong Zhang,Manqiang Peng,Yuxuan Wu,Pawit Phadungsaksawasdi,Wesley Yeung,Ye Zhang,Trang Nguyen,Qiang Zhang,Nan Liu,Meng Wang,Kee Yuan Ngiam,Yih-Chung Tham,Ching-Yu Cheng,Tianfan Fu,Qingyu Chen,Rosemary Ke,Chang Li,Wenzhuo Yang,Zhenghao Lu,Chunyou Lai,Yu Zhang,Sheng Zhong,Hao Deng,Dianbo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading patient-specific predictions. Here, we identify this tension as a bias-precision paradox in causal representation learning and introduce sampling-based maximum mean discrepancy (sMMD), a stochastic alignment strategy that replaces global adversarial balancing with subset-level matching. We instantiate this approach in a framework for counterfactual outcome prediction with attribution-grounded interpretability. Across two large-scale ICU cohorts (n = 27,783), our framework improves accuracy under distribution shift, reducing error by up to 11.5% and substantially increasing recall in high-risk tasks. Mechanistic analyses show that sMMD selectively preserves clinically decisive variables. In human-AI evaluation, our method outperforms clinicians-in-training and large language models, and improves clinician accuracy by 14.7% while reducing decision time, enabling interpretable, real-time clinical decision support.

[AI-142] SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety ICML2026

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在工具使用能力增强的同时引入的安全风险问题,尤其是恶意攻击者可能诱导代理执行有害操作的威胁。现有防御机制虽有效,但常因过度拒绝(over-refusal)而损害良性任务的可用性,形成安全与效用之间的权衡困境。解决方案的关键在于提出名为 \textscSafeHarbor 的新框架,其核心创新包括:通过增强型对抗生成提取上下文感知的防御规则,构建局部分层记忆系统以动态注入规则实现无需训练、高效且即插即用的防护策略;并引入基于信息熵的自进化机制,通过动态节点分裂与合并持续优化记忆结构,从而在保持高拒绝率(>93%)的同时显著提升良性任务的利用率(GPT-4o 上达 63.6%)。

链接: https://arxiv.org/abs/2605.05704
作者: Zhe Liu,Zonghao Ying,Wenxin Zhang,Quanchen Zou,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang,Hao Peng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent’s utility on benign tasks. To mitigate this trade-off, we propose \textscSafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textscSafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textscSafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6% on GPT-4o while maintaining a robust refusal rate exceeding 93% against harmful requests. The source code is publicly available at this https URL.

[AI-143] Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

【速读】:该论文旨在解决自演化搜索代理(self-evolving search agents)在早期自Play训练阶段中面临的两个关键瓶颈:一是Proposer生成的问题缺乏关系上下文,导致大量无效或不可验证的问题;二是Solver仅获得二元奖励信号,无法利用部分正确搜索轨迹中的有用信息。解决方案的关键在于复用知识图谱(Knowledge Graph, KG)路径作为中间监督信号,实现双重优化:首先,通过LLM引导的知识图谱子图对问题构建进行约束,为Proposer提供关系上下文以生成更合理的多跳问题;其次,提出Waypoint Coverage Reward(WCR),基于问题构造路径中的中间实体覆盖度对错误Solver轨迹给予分级奖励,从而保留过程反馈信息并提升训练效率。实验表明,该方法在七种问答基准和九种模型配置下均优于标准Search Self-Play(SSP),尤其在多跳问答任务上表现显著提升。

链接: https://arxiv.org/abs/2605.05702
作者: Huyu Wu,Jun Liu,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.

[AI-144] Inference-Time Budget Control for LLM Search Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)搜索代理在推理时面临双重预算限制(即工具调用次数和生成token数均受限)下的多跳问答(multi-hop question answering, QA)性能瓶颈问题。核心挑战在于如何在有限资源下有效分配预算,以提升答案质量——这不仅依赖更强的模型,更需要对搜索过程中每一步动作(如检索、分解或最终作答)的优先级进行动态决策,并判断何时积累的证据足以生成可靠答案。解决方案的关键在于提出一种两阶段推理时预算控制机制:第一阶段在搜索过程中引入任务级信息价值(Value-of-Information, VOI)评分,量化每个可行动作在当前状态和剩余预算下的边际任务价值,从而指导动作选择;第二阶段通过一个选择性证据锚定的终局器(finalizer),仅在检测到低风险的答案格式错误时才对轨迹答案进行重写。实验表明,这种双阶段预算控制策略在多个基准测试中显著优于基线方法,且搜索阶段的预算依赖惩罚机制是性能提升的主要来源。

链接: https://arxiv.org/abs/2605.05701
作者: Zhengru Fang,Senkang Forest Hu,Zhonghao Chang,Yu Guo,Yihang Tao,Hongyao Liu,Mengzhe Ruan,Jun Huang,Yuguang Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.

[AI-145] An Empirical Study of Proactive Coding Assistants in Real-World Software Development

【速读】:该论文旨在解决当前生成式 AI 编程助手(Generative AI Coding Assistants)在实际应用中因依赖模拟数据而产生的“仿真到现实差距”(simulation-to-reality gap)问题,即现有研究多基于大语言模型(LLM)生成的IDE交互轨迹进行评估与训练,但这些模拟轨迹与真实开发者行为存在显著差异,导致模型性能被高估。解决方案的关键在于通过大规模实证研究收集1,246名行业开发者连续三天的真实IDE交互日志,并构建配对的模拟轨迹用于对比分析;在此基础上提出ProCodeBench——首个面向主动意图预测(proactive intent prediction)的真实世界基准测试集,揭示当前主流方法(包括LLM、检索增强和代理基线)在真实场景下的表现远低于模拟环境,强调了使用真实开发者行为数据对于训练和评估主动编程助手的重要性,同时指出模拟数据可作为预训练阶段的补充,但不能替代真实数据。

链接: https://arxiv.org/abs/2605.05700
作者: Lehui Li,Ruixuan Jia,Guo-Ye Yang,Jia Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1,246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbfProCodeBench, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.

[AI-146] When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

【速读】:该论文旨在解决大语言模型推理过程中KV-cache量化带来的质量-延迟权衡问题,特别是在Apple Silicon统一内存架构下如何实现高效且高质量的量化。其核心解决方案是设计了一个融合的Metal计算内核,集成符号随机化快速傅里叶变换(sign-randomized FFT, SRFT)、通道级λ缩放、组级绝对最大值归一化以及INT4半字节打包等技术,通过单个内核优化减少内存带宽占用与计算开销。该方案在Gemma-3 1B和Qwen2.5-1.5B模型上实现了比fp16更快的推理速度(延迟降低3%–8%,短上下文下提升0.7%–2.6%),同时保持3倍持久内存压缩率和模型质量稳定(如Qwen短提示下的dPPL变化可忽略,Gemma的hook dPPL仅上升3.6)。关键创新在于将多种量化策略融合为单一高效内核,使量化带来的额外计算开销(约25 ns/向量)低于因压缩节省的内存带宽收益,从而实现性能与质量的双重优化。

链接: https://arxiv.org/abs/2605.05699
作者: Mohamed Amine Bergach
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV-cache quantization is framed as a quality–latency trade-off. We show it is \emphinverted on Apple Silicon’s unified memory: a single fused Metal kernel (sign-randomized FFT + per-channel \lambda + per-group abs-max + int4 nibble pack), exposed as a HuggingFace \textttCache subclass, runs \emphfaster than fp16 across 256 – 4096 -token prefixes on Gemma-3 1B ( -3 to -8% ms/tok) and at short context on Qwen2.5-1.5B ( -0.7 to -2.6% through 1 K), with 3\times persistent memory compression and quality preserved ( \dPPL = 0.000 Qwen short-prompt; +3.6 hook \dPPL Gemma). The kernel’s \sim!25 ,ns/vec overhead is below the bandwidth savings from 3\times compression. The fused kernel also closes Qwen’s 4-bit per-token catastrophe ( \dPPL = +7975 \to +638.6 , 12.5\times reduction) at 182 ,GFLOPS / D=128 . Supporting findings: \SRFT and \SRHT are statistically indistinguishable for KV quality (we pick \SRFT for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning R+\lambda without SRFT lowers calibration MSE 84.9% vs 50.3% but yields worse PPL); Householder rotations at k=d/2 reflectors are effectively lossless at d=256 .

[AI-147] Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers

【速读】:该论文旨在解决Transformer模型在实际部署中面临的单一推理成本与多变性能需求之间的矛盾问题,即如何在不重新训练模型的前提下实现对注意力计算成本的灵活控制,并据此动态调整模型精度。其解决方案的关键在于提出“预算注意力分配”(Budgeted Attention Allocation)机制,这是一种基于请求注意力预算的单调头门控(monotone head-gating)方法,通过在训练阶段采用密集预热(dense warm-starting)提升稳定性,使得同一模型checkpoint能够在不同预算下实现从高精度到低延迟的平滑权衡;进一步地,该机制可将软性成本控制转化为硬性结构加速,在CPU端实现可测量的单线程速度提升,从而为轻量化部署提供一种可行且可控的路径。

链接: https://arxiv.org/abs/2605.05697
作者: Amrit Nidhi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 10 tables

点击查看摘要

Abstract:Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a requested attention budget. Dense warm-starting is important for stability: on a robust synthetic sequence task, one budgeted model reaches 99.7% accuracy at 0.303 estimated attention cost and 100.0% accuracy at 0.504 cost. On held-out AG News with a custom word-level transformer, hard-gate adaptation turns soft cost control into measured single-thread CPU speed, reaching 82.1% accuracy with 1.28x speedup at budget 0.50. In pretrained BERT-Mini AG News, budgeted structural pruning reaches 87.6% accuracy with 1.20x speedup at budget 0.50; a validation-ranked zero-shot dense post-hoc structural baseline reaches 86.1%, and one recovery epoch raises that per-budget specialist to 87.9%. On DBpedia14, BERT-Mini budgeted gates reach 97.4% at exact budget 0.50 versus 96.6% for dense full attention. Static fixed-budget gates and recovered dense specialists remain strong. The contribution is therefore not universal dominance, but a reproducible feasibility study of one controllable checkpoint across budgets that can trade attention cost for accuracy and be converted into measured structural speedups on small CPU benchmarks.

[AI-148] Irminsul: MLA-Native Position-Independent Caching for Agent ic LLM Serving

【速读】:该论文旨在解决生成式 AI(Generative AI)中因代理型大语言模型(Agentic LLM)工作负载导致的前缀缓存失效问题,即在每轮推理中位点偏移造成 bit-identical tokens 位置变化,从而引发缓存命中率下降,显著增加首字延迟(TTFT)和计算能耗。现有基于位置无关的缓存系统通过修正 RoPE(Rotary Position Embedding)实现缓存复用,但其代价源于 GQA(Grouped-Query Attention)架构,并非缓存机制本身所需。论文提出的关键解决方案是:利用多头潜在注意力(Multi-Head Latent Attention, MLA)结构,将每个 KV 行分解为与位置无关的 latent 向量 $ c_{KV} $ 和可闭式校正的 64 维旋转向量 $ k_r $,并据此设计内容地址缓存(content-addressed caching),以支持对 agentic 流量下 token 顺序偏移的鲁棒性恢复。其核心创新在于 Irminsul 缓存系统,引入 CDC 分段哈希键和 $ \delta $-旋转规则处理 $ k_r $,实现高达 ~83% 的 prompt token 恢复率及每次缓存命中 63% 的预填充能效提升,论证了内容地址缓存应作为服务栈中的原生组件而非对 prefix 匹配的补丁。

链接: https://arxiv.org/abs/2605.05696
作者: Bole Ma,Jan Eitzinger,Harald Köstler
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full d_K -dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free c_KV and a 64-dim k_r correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang’s radix cache with content-hash keying over CDC-chunked segments and a \delta -rotation rule for k_r . We evaluate three native MLA-MoE deployments - DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B) - with output-consistency on all three and recovery measured on the two endpoints; Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while delivering 63% prefill energy savings per cache hit. We argue that content-addressed caching belongs in the serving stack as a first-class primitive, not a retrofit over prefix matching.

[AI-149] Saliency-Aware Regularized Quantization Calibration for Large Language Models

【速读】:该论文旨在解决后训练量化(Post-training Quantization, PTQ)中因仅基于有限或非代表性校准数据最小化层级重建误差而导致的泛化风险增加问题,这可能引起下游任务性能下降。解决方案的关键在于提出一种统一框架——Saliency-Aware Regularized Quantization Calibration (SARQC),其在标准PTQ目标基础上引入了一个基于显著性感知的正则化项,该机制促使量化权重在校准过程中尽可能靠近原始权重,从而提升推理时的泛化能力。SARQC可无缝集成至现有PTQ流程中,同时增强基于尺度搜索和Gram方法的性能,且不引入额外推理计算开销。

链接: https://arxiv.org/abs/2605.05693
作者: Yanlong Zhao,Xiaoyuan Cheng,Huihang Liu,Baihua He,Xinyu Zhang,Harrison Bo Hua Zhu,Wenlong Chen,Li Zeng,Zhuo Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights. This may cause the generalization risk to diverge, potentially degrading downstream performance. To address this issue, we propose \emphSaliency-Aware Regularized Quantization Calibration (SARQC) a unified framework that augments the standard PTQ objective with a saliency-aware regularization term. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without additional computational overhead during inference.

[AI-150] GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

【速读】:该论文旨在解决图预测任务中基于扩散模型(diffusion-based methods)的现有方法在推理阶段依赖昂贵的迭代去噪过程且采样不稳定的问题,尤其关注一种潜在的“捷径解”(shortcut solution):即模型可能通过忽略噪声目标来满足自一致性约束,从而退化为确定性预测器。解决方案的关键在于提出图对比一致性模型(Graph Contrastive Consistency Model, GCCM),其核心创新是引入负样本对到对比一致性目标中,增强了不同噪声水平下目标表示之间的区分性要求,使捷径解不再满足目标函数;同时,通过对输入节点/边特征施加扰动,打破相同条件下的预测一致性,进一步削弱捷径解的吸引力。实验表明,GCCM有效缓解了该问题,并在多个基准数据集上实现了优于确定性预测器的一致性能提升。

链接: https://arxiv.org/abs/2605.05689
作者: Shaozhen Ma,Wei Huang,Hanchen Wang,Dong Wen,Wenjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conditional generative models, particularly diffusion-based methods, have recently been applied to graph prediction by modeling the target as a conditional distribution given the input graph, yielding competitive results compared to deterministic predictor. However, existing diffusion-based prediction methods typically require expensive iterative denoising at inference and often suffer from unstable sampling, which motivates recent efforts to reduce inference denoising steps and enable stable sampling via techniques such as consistency training. Despite this progress, we find that existing consistency training methods for graph prediction could potentially fall into a shortcut solution: the model may attempt to satisfy the self-consistency constraint by ignoring the noisy target (i.e., assigning it negligible weight), ultimately collapsing into a purely deterministic predictor. To mitigate such shortcut solution, we propose GCCM, a graph contrastive consistency model that goes beyond isolated pairwise matching between the same target at different noise levels by introducing negative pairs into a contrastive consistency objective. This adds an additional separation requirement, making the shortcut solution no longer trivially sufficient to satisfy the proposed objective. Moreover, we apply feature perturbation to the input node/edge features to break identical conditioning on the input graph, so that the shortcut no longer yields the same predictions across noise levels and becomes less attractive. Extensive experiments on benchmark datasets demonstrate that GCCM mitigates the shortcut solution and yields consistent performance improvements in graph prediction compared to deterministic predictors.

[AI-151] DataDignity: Training Data Attribution for Large Language Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 输出中知识来源的精准定位问题,即在给定提示(prompt)和目标模型响应的情况下,从候选文档集合中准确排序最可能支持该响应的源文档,这一任务被称为“精确定位溯源”(pinpoint provenance)。其核心挑战在于区分真实支持性证据与仅具主题或词汇相似性的干扰项。解决方案的关键在于构建了一个受控基准 FakeWiki,包含多种复杂场景(如反文档、对抗性提示变换等),并提出两种方法:一是监督对比学习模型 ScoringModel,通过 InfoNCE 损失将响应与文档映射至共享语义空间,并利用批次内负样本、检索挖掘负样本及反文档负样本进行训练;二是无需训练的激活空间引导融合方法 SteerFuse,利用模型中间层激活特征增强文本检索效果。实验表明,ScoringModel 在九种指令微调大语言模型上平均 Recall@10 提升至 52.2,显著优于最强基线(35.0),且在对抗性提示下仍保持鲁棒性,证明了结构化训练数据标注与多样化评估设置对实现可靠溯源的重要性。

链接: https://arxiv.org/abs/2605.05687
作者: Xiaomin Li,Andrzej Banburski-Fahey,Jaron Lanier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

[AI-152] Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中出现的两类机制上截然不同的失败模式——冲突(conflict)与幻觉(hallucination),这两类问题均会导致模型输出高置信度错误结果,使得基于输出熵的监测方法失效。其核心解决方案在于提出一个统一的几何解释框架:在自回归生成的隐藏状态空间中,已学习的事实形成吸引子盆地(attractor basins),冲突表现为不同吸引子盆地之间的竞争,而幻觉则源于目标事实对应的吸引子盆地不存在;通过计算隐藏状态到最近记忆盆地的几何距离(几何裕度,geometric margin),可有效区分正确回忆与幻觉,且无需依赖输出熵,从而实现零误拒(zero false refusals)的准确检测。这一机制揭示了冻结的输出头(frozen LM head)会系统性地抹除隐状态中的认知状态(epistemic state),且该现象随模型规模扩大而加剧。

链接: https://arxiv.org/abs/2605.05686
作者: Qiyao Liang,Risto Miikkulainen,Ila Fiete
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, plus appendices

点击查看摘要

Abstract:Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes–conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task–entity identifiers mapped to unique codes with PM installed via LoRA adapters–where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin–the hidden state’s distance to the nearest memorized basin–reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law C = \exp(-c/\bar\Delta) , growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it–and this erasure worsens with scale.

[AI-153] mporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting NEURIPS2026

【速读】:该论文旨在解决时间序列预测中模型可解释性不足的问题,特别是传统多层感知机(MLP)缺乏机制性解释能力的局限。其解决方案的关键在于提出Temporal Functional Circuits框架,通过构建一个门控残差Kolmogorov-Arnold Network(KAN),将预测分解为线性基项与稀疏激活的KAN修正项;进而利用输出感知归因法映射每个边到输入滞后,并基于学习到的激活范围对边进行排序,最终通过边级干预(如置零和样条移除)验证解释的忠实性。该方法不仅提升了预测性能(如在制度切换信号上比纯线性模型降低59%均方误差),还提供了MLP无法实现的可解释边缘函数。

链接: https://arxiv.org/abs/2605.05685
作者: Naveen Mysore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 9 pages, 4 figures, 6 tables, plus appendix. Under review at NeurIPS 2026

点击查看摘要

Abstract:Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge functions from latent visualizations into faithful, temporally grounded explanations. Built on a gated residual KAN that decomposes forecasts into a linear base and a sparsely activated KAN correction, the framework (i) maps each edge to input lags via output-aware attribution, (ii) ranks edges by learned activation range, and (iii) validates faithfulness through edge-level interventions including zeroing and spline removal. Removing the learned B-spline component while retaining the base SiLU term degrades forecasts, providing evidence that the spline shape itself carries predictive value beyond the base activation. On four synthetic regimes of increasing complexity, the learned gate opens progressively wider as signal complexity grows. On regime-switching signals, gated KAN achieves 59% lower MSE than linear-only models. Across eight benchmarks, the gated architecture is competitive with linear, attention, and MLP alternatives, while providing interpretable edge functions that MLP-based corrections cannot offer.

[AI-154] Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在输出最终答案时看似安全,但其内部推理过程(chain-of-thought-like reasoning)中可能包含有害或违反政策内容的安全盲区问题。传统方法仅以最终答案作为安全评估依据,忽略了推理轨迹中的潜在风险,如“泄露案例”(leak cases)——即不安全的推理先于安全答案出现,以及“逃逸案例”(escape cases)——即良性推理后接不安全最终响应。解决方案的关键在于提出一种白盒测试时缓解机制:自适应多原则引导(adaptive multi-principle steering),该机制为每个安全原则学习一个从不安全到安全的激活方向,并仅在当前隐藏状态更接近不安全中心时激活对应方向。此方法在三个可引导的开放推理模型上显著降低了推理轨迹和最终答案中的不安全数量,同时保持了高精度(如DeepSeek-R1-Qwen-7B在BBH、GSM8K和MMLU上平均不安全计数减少40.8%,且宏平均准确率维持在97.7%)。

链接: https://arxiv.org/abs/2605.05678
作者: Xiaomin Li,Jianheng Hou,Zheyuan Deng,Zhiwei Zhang,Taoran Li,Binghang Lu,Bing Hu,Yunhan Zhao,Yuexing Hao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in misinformation, legal compliance, discrimination, physical harm, and psychological harm. We further propose adaptive multi-principle steering, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe than safe centroid. On three steerable open reasoning models, adaptive steering reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.

[AI-155] X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

【速读】:该论文旨在解决多语言零样本语音克隆(zero-shot voice cloning)中对文本提示依赖性强、需复杂预处理(如强制对齐)的问题。现有方法通常要求音频提示附带精确转录文本,限制了实际应用的灵活性和通用性。解决方案的关键在于提出一种两阶段训练范式:第一阶段通过标准条件流匹配训练获得初始模型 X-Voicet1_\text{t1},并利用其合成 10K 小时说话人一致的音频片段作为提示;第二阶段在掩码提示文本的情况下微调模型,得到 X-Voicet2_\text{t2},从而实现无需音频提示文本即可完成零样本语音克隆。此外,架构上引入双层语言标识符注入与无分类器引导(Classifier-Free Guidance)的解耦调度机制,显著提升了多语言语音合成性能,使模型在保持低参数量(0.4B)的同时达到与百亿规模模型相当的跨语言克隆效果。

链接: https://arxiv.org/abs/2605.05611
作者: Rixi Xu,Qingyu Liu,Haitao Li,Yushen Chen,Zhikang Niu,Yunting Yang,Jian Zhao,Ke Li,Berrak Sisman,Qinyuan Cheng,Xipeng Qiu,Kai Yu,Xie Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 16 pages, 4 figures, 9 tables

点击查看摘要

Abstract:In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice _\texts1 through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice _\texts2 , which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

[AI-156] Nearly Optimal Attention Coresets

【速读】:该论文旨在解决在有限空间内对注意力机制(Attention mechanism)进行高效估计的问题,核心挑战是如何构造一个紧凑的子集(即coreset),使得其在保持近似精度的同时显著减少存储和计算开销。解决方案的关键在于证明了对于任意一组单位范数的键(keys)和值(values)(K,V)(K, V),存在一个大小为 O(deρ+o(ρ)/ε)O(\sqrt{d} e^{\rho} + o(\rho)/\varepsilon) 的子集 (K,V)(K', V'),能够以误差不超过 ε\varepsilon 的精度近似原注意力输出,且该近似对所有查询向量 qq(其范数有界于 ρ\rho)均成立。这一结果在理论上优于此前最佳的上界,并通过改进的下界分析表明所获上界接近最优。

链接: https://arxiv.org/abs/2605.05602
作者: Edo Liberty,Alexandr Andoni,Eldar Kleiner
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values (K,V) in \mathbbR^d , there exists a subset (K’,V’) of size at most O(\sqrtd e^\rho+o(\rho)/\varepsilon) such that [ \left| \operatornameAttn(q,K,V)- \operatornameAttn(q,K’,V’) \right| \le \varepsilon ] simultaneously for all queries whose norm is bounded by \rho . This outperforms the best known results for this problem. We also offer an improved lower bound showing that \varepsilon -coresets must have size \Omega(\sqrtd e^\rho/\epsilon) .

[AI-157] Causal Probing for Internal Visual Representations in Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉概念编码与具身化机制不明确的问题,尤其是不同类别视觉概念(如实体与抽象概念)在模型内部表征方式的差异及其对模型扩展规律的影响。其解决方案的关键在于提出一种基于激活操纵(activation steering)的因果分析框架,通过系统性干预四类视觉概念,揭示了实体类概念呈现局部化记忆特征,而抽象概念则表现为全局分布式的网络激活模式;进一步发现模型深度增加对于抽象概念的分布式编码至关重要,而实体编码则对规模变化具有鲁棒性;此外,反向操纵实验揭示了感知与生成之间的补偿机制,即屏蔽显式输出会引发潜在激活激增,从而为理解MLLMs内部运作提供了可解释且可干预的分析路径。

链接: https://arxiv.org/abs/2605.05593
作者: Zehao Deng,Tianjie Ju,Zheng Wu,Liangbo He,Jun Lan,Huijia Zhu,Weiqiang Wang,Zhuosheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit distinct localized memorization, whereas abstract concepts are globally distributed across the network. Critically, this divergence uncovers a mechanistic driver of scaling laws: increasing model depth is indispensable for encoding distributed and complex abstract concepts, whereas entity localization remains remarkably invariant to scale. Furthermore, reverse steering uncovers that blocking explicit output triggers a surge in latent activations, exposing a compensatory mechanism between perception and generation. Finally, extending our analysis to visual reasoning, we expose a disconnect between perception and reasoning although MLLMs successfully recognize geometric relations, they treat them merely as static visual features, failing to trigger the procedural execution necessary for abstract problem-solving.

[AI-158] AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading NEURIPS2026

【速读】:该论文旨在解决量化交易中因子发现、市场状态适应性选择与风险约束执行三者割裂的问题,即现有方法通常在静态或孤立假设下优化各环节,导致策略在非平稳金融市场中难以持续盈利。其解决方案的关键在于提出一个全栈式多智能体框架AlphaCrafter,通过三个专业化智能体构成闭环自适应流水线:Miner利用大语言模型(Large Language Model, LLM)持续扩展因子池,Screener基于当前市场状态构建条件因子组合,Trader则在显式风险约束下将因子组合转化为可执行策略。该设计实现了从因子挖掘到交易执行的端到端动态适配,显著提升了风险调整后收益的稳定性与一致性。

链接: https://arxiv.org/abs/2605.05580
作者: Yishuo Yuan,Jiayi Sheng,Sirui Zeng,Jiaqi Wang,Jiaheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to NeurIPS 2026. 26 pages, 8 figures,

点击查看摘要

Abstract:Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time search process, implicitly assuming that factor efficacy persists across market regimes. Execution-oriented systems often adopt role-playing agent architectures that simulate anthropomorphic trading committees, introducing behavioral noise rather than systematic rationality. Consequently, a fully automated, rationality-driven framework unifying a coherent quantitative pipeline remains absent. We introduce AlphaCrafter, a full-stack multi-agent framework that closes this gap through a continuously adaptive factor-to-execution pipeline, designed to track and respond to evolving market conditions without manual intervention. AlphaCrafter operates via three specialized agents: a Miner that continuously expands the factor pool via LLM-guided search, a Screener that assesses prevailing market conditions to construct regime-conditioned factor ensembles, and a Trader that translates these ensembles into quantitative strategies under explicit risk constraints. Together, these three agents form a closed-loop cross-sectional trading system that adapts holistically to evolving market dynamics. Extensive experiments on CSI 300 and SP 500 demonstrate that AlphaCrafter consistently outperforms state-of-the-art baselines in risk-adjusted returns while exhibiting the lowest cross-trial variance, confirming that integrated and adaptive factor-to-execution design yields robust trading performance.

[AI-159] Accelerating LMO-Based Optimization via Implicit Gradient Transport

【速读】:该论文旨在解决当前基于线性最小化Oracle(Linear Minimization Oracle, LMO)的优化方法在理论分析上碎片化(分别针对无约束与约束问题)以及在实践中的收敛速度受限问题,尤其是现有方差减少技术虽能加速但常伴随显著计算开销。其解决方案的关键在于提出一种新的随机LMO类方法——LMO-IGT,该方法通过引入隐式梯度传输(Implicit Gradient Transport, IGT)机制,在仅使用单个随机梯度每迭代步的前提下,实现了对标准随机LMO方法的加速。具体而言,LMO-IGT在运输点处评估梯度,从而有效降低方差并提升收敛速率,理论迭代复杂度达到 O(ε3.5)\mathcal{O}(\varepsilon^{-3.5}),优于传统随机LMO的 O(ε4)\mathcal{O}(\varepsilon^{-4}) 和方差减少型LMO的 O(ε3)\mathcal{O}(\varepsilon^{-3})(后者需额外梯度计算)。此外,作者构建了一个统一框架,并引入正则化支持函数(Regularized Support Function, RSF)作为新的平稳性度量,弥合了梯度范数与Frank–Wolfe间隙之间的理论鸿沟。

链接: https://arxiv.org/abs/2605.05577
作者: Won-Jun Jang,Si-Hyeon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations, we propose \emphLMO-IGT, a new class of stochastic LMO-based methods leveraging implicit gradient transport (IGT). We further introduce a unified framework for stochastic LMO-based optimization together with a new stationarity measure, the \emphregularized support function (RSF), which bridges gradient-norm and Frank–Wolfe-gap notions within a common framework. By evaluating stochastic gradients at transported points, LMO-IGT accelerates convergence while retaining the single-gradient-per-iteration structure of standard stochastic LMO. Our analysis establishes that stochastic LMO achieves an iteration complexity of \mathcalO(\varepsilon^-4) , variance-reduced LMO achieves \mathcalO(\varepsilon^-3) at the cost of additional gradient evaluations, and LMO-IGT achieves \mathcalO(\varepsilon^-3.5) using only a single stochastic gradient per iteration. Empirically, LMO-IGT consistently improves over stochastic LMO counterparts with negligible overhead. Among its instantiations, Muon-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO-based optimization.

[AI-160] Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift

【速读】:该论文旨在解决域适应(domain adaptation)中因源域与目标域标签空间存在包含关系而引入的私有类(private classes)所导致的跨域分类困难问题。私有类指仅存在于一个域中的类别,其样本若未被有效识别和处理,会显著增加分类误差并干扰知识迁移过程。传统方法假设私有类样本在分布上差异明显可作为异常值处理,但该假设在实际中常不成立——因为单个共享类内部方差可能远大于私有类与共享类之间的差异,从而削弱了对私有类的有效区分能力。为此,作者提出一种基于最优传输(optimal transport, OT)局部运输质量和度量性质的局部感知私有类识别方法,通过设计一个运输质量得分函数来精准区分共享类与私有类样本;在此基础上进一步构建可靠的OT方法(ReOT),在学习共享类与私有类分离聚类结构的同时最小化分类风险,避免共享-私有类样本对间的错误匹配,确保类内知识可靠迁移以缓解类条件分布差异。理论分析表明,在极端标签偏移场景下,ReOT能够提供目标风险的泛化上界,并可通过优化实现最小化。

链接: https://arxiv.org/abs/2605.05567
作者: Chuan-Xian Ren,Cheng-Jun Guo,Hong Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other. These non-overlapping classes are referred to as private classes. Identifying private class samples and mitigating their adverse effects is critical in the literature. Existing methods rely on the assumption that shifts in private classes are large enough to be considered outliers. However, the variance within a single shared class can be significantly larger than the difference between a private class and another shared class, challenging this assumption. Consequently, private classes substantially increase the difficulty of cross-domain classification. To address these issues, based on local transportation and metric properties of optimal transport (OT), a locality-aware private class identification approach is proposed in the form of a score function on transport mass. The effectiveness of the proposed approach is theoretically proven, highlighting the score function’s strong ability to distinguish between shared and private class samples. Building on this, we introduce a reliable OT-based method (ReOT) for domain adaptation under severe label shift. ReOT minimizes classification risk while learning the separated cluster structure between the identified shared classes and private classes, effectively avoiding mismatch between shared-private sample pairs, thus ensuring that important knowledge is reliably transported intra-class to mitigate class-conditional discrepancy. Furthermore, a generalization upper bound of the target risk is provided for extreme label shift scenarios, which can be minimized by ReOT. Extensive experiments on benchmarks validate the effectiveness of ReOT.

[AI-161] BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

【速读】:该论文旨在解决后训练量化(post-training quantization)在低精度推理(如4-bit)下对自适应测试时计算分配(adaptive test-time compute allocation)造成的干扰问题,特别是因置信度校准失准导致的过早终止(premature halting)现象。其关键解决方案是提出BitCal-TTS,一种轻量级运行时控制器,包含三个核心机制:(i) 基于token级不确定性与推理轨迹稳定性的低成本在线代理指标;(ii) 依赖比特位数的置信度重缩放策略,在低精度下保持保守性;(iii) 针对GSM8K类结构化输出设计的比特感知后标记确认窗口(post-marker confirmation horizon)。该方法无需微调基础模型,可无缝集成至Hugging Face标准4-bit推理流程中,实验证明在有限token预算下显著提升准确率并降低过早终止率。

链接: https://arxiv.org/abs/2605.05561
作者: Sai Babu Patarlapalli,Surya Teja Avvaru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 4 tables. Code and reproducibility materials at this https URL

点击查看摘要

Abstract:Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized. We study this interaction for greedy 4-bit inference and propose BitCal-TTS, a lightweight runtime controller that combines (i) inexpensive online proxies for token-level uncertainty and reasoning-trace stability, (ii) a bit-conditioned confidence rescaling that is conservative at low nominal precision, and (iii) a bit-aware post-marker confirmation horizon designed for GSM8K-style structured outputs. The method requires no fine-tuning of the base model and integrates with standard Hugging Face 4-bit inference using forward hooks for logits and last-layer hidden states. On small evaluation shards of GSM8K with Qwen2.5 Instruct models, BitCal-TTS improves exact-match accuracy over a non-bit-aware adaptive baseline at the 7B and 14B scales while preserving substantial token savings relative to fixed-budget decoding. At a token cap of B=512, on the evaluation shards we report (N=54 for 7B and N=35 for 14B; not the full GSM8K test set), accuracy gains are +3.7 points (7B) and +2.8 points (14B), with the premature-stop rate falling from 14.8% to 11.1% on 7B and from 17.1% to 11.4% on 14B. We report Wilson 95% confidence intervals throughout and explicitly discuss the limited statistical power of the partial-shard comparisons. We release code and figure-generation scripts to support full reproduction. Comments: 17 pages, 5 figures, 4 tables. Code and reproducibility materials at this https URL Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2605.05561 [cs.AI] (or arXiv:2605.05561v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.05561 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-162] Who Prices Cognitive Labor in the Age of Agents ? A Position on Compute-Anchored Wages

【速读】:该论文旨在解决生成式 AI(Generative AI)对认知型劳动力市场工资影响的经济学机制误判问题。传统观点认为,由于AI代理可近乎零边际成本复制,其供给无限弹性,从而压低人类认知劳动工资至零;但作者指出这一框架在机制上存在错误,尽管结论部分成立。解决方案的关键在于重新定义AI代理的本质:它们并非劳动力,而是将计算资本(compute capital, KcK_c)转化为有效认知劳动单位(LAL_A)的生产技术。由此,均衡工资的锚定点从劳动力市场迁移至计算资本市场,并提出“计算锚定工资”(Compute-Anchored Wage, CAW)边界,即在人机认知劳动替代任务中,竞争性人类工资上限为 λkrc\lambda \cdot k \cdot r_c,其中 rcr_c 为计算资本租金率、kk 为单个有效代理劳动单位的计算强度、λ\lambda 为人机生产力相对比。此修正不仅重塑理论基础,也对技能偏向型技术变迁方向与要素收益分配产生深远政策含义。

链接: https://arxiv.org/abs/2605.05558
作者: Siqi Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:A natural intuition about the economics of AI agents is that, because agents can be replicated at near-zero marginal cost, they constitute a labor input in infinitely elastic supply, and therefore drive cognitive-labor wages to zero. We argue this framing is wrong in mechanism but partially correct in conclusion, and that the correction matters for both theory and policy. \textbfAgents are not labor; they are a production technology that converts compute capital K_c into effective units of cognitive labor L_A . Once this is recognized, the elastic-supply margin that anchors the equilibrium wage migrates from the labor market to the compute capital market. Building on the textbook factor-pricing framework \citepmankiw2020, we derive a \emphCompute-Anchored Wage (CAW) bound stating that, on tasks where human and agent cognitive labor are substitutes, the competitive human wage is bounded above by \lambda \cdot k \cdot r_c , where r_c is the rental rate of compute capital, k is the compute intensity of one effective agent-labor unit, and \lambda is the relative human-to-agent productivity. We generalize the result through CES aggregation, separate substitutable from complementary tasks (yielding a directional inversion of skill-biased technical change), and discuss factor-share consequences. The position is concise: \emphthe price-setter for cognitive labor is no longer the labor market.

[AI-163] SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

【速读】:该论文旨在解决将自对弈强化学习(self-play reinforcement learning)扩展至科学文献场景时所面临的两大挑战:一是跨文档的多模态元素间关系难以显式建模,导致难以自动构建用于关系推理的问题;二是缺乏可验证的奖励信号,影响训练稳定性与效果。其解决方案的关键在于提出SPARK框架,通过从多文档科学文献中自动构建统一的知识图谱(Knowledge Graph, KG),以结构化方式表征文献中的事实和关系。KG路径作为生成多跳关系推理问题的来源,而KG中存储的结构化事实则为奖励计算提供可验证依据,从而在信息不对称条件下实现小规模视觉-语言模型(sVLM)在提议者(Proposer)与求解者(Solver)角色间的交替训练,显著提升了多跳推理能力,尤其在高跳数任务中表现更优。

链接: https://arxiv.org/abs/2605.05546
作者: Hyobin Park,Taeseop Kim,Dong-Geol Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scientific literature is more challenging: the relationships among multi-modal elements within and across documents are rarely made explicit in text, which makes automatic generation of relational reasoning questions difficult and weakens the reliability of reward signals. We propose SPARK (Self-Play with Asymmetric Reward from Knowledge Graphs), a framework that automatically constructs a unified knowledge graph (KG) from multi-document scientific literature and uses it as the structural basis for self-play. KG paths over multimodal nodes serve as a source for generating relational reasoning questions, and structured facts stored in the KG provide a basis for verifiable reward computation. A single small vision-language model (sVLM) alternates between Proposer and Solver roles under information asymmetry against a fixed KG, a design that we believe can be naturally extended toward online adaptation in future work. We evaluate SPARK on public benchmarks and a self-constructed cross-document multi-hop QA dataset. Results show that SPARK consistently outperforms flat-corpus-based self-play baselines, and the performance gap widens as hop count increases, suggesting that KG-structure grounding contributes to relational multi-hop reasoning beyond what unstructured corpus grounding can provide.

[AI-164] Housing Potential Common Data Model and City Digital Twin

【速读】:该论文旨在解决住房潜力评估中因数据孤岛(data silos)导致的多源异构数据难以集成与互操作的问题,从而阻碍了城市规划者对住房需求和供给进行系统性分析。其解决方案的关键在于提出了一种标准化的数据模型——住房潜力通用数据模型(Housing Potential Common Data Model, HPCDM),该模型支持涵盖分区、土地利用、人口特征及服务可达性等多维度数据的整合与协同使用,并通过构建城市数字孪生(City Digital Twin)和试点仪表板应用验证了其实用性,同时识别出关键采纳障碍并提供可操作的缓解策略,为城市规划决策提供了技术支撑和实践路径。

链接: https://arxiv.org/abs/2605.05535
作者: Megan Katsumi,Mark Fox,Anderson Wong,Divnoor Chatha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of housing potential requires consideration of a location from multiple perspectives, ranging from zoning and land use to population characteristics and access to services. This research introduces the Housing Potential Common Data Model (HPCDM) to overcome existing data silos, serving as a standard to support integration and interoperability across the diverse range of datasets that are required for housing potential analysis. This report details the evaluation of the model along with the creation of a City Digital Twin for housing and a pilot dashboard application to demonstrate a practical implementation. Beyond the technical framework, this work identifies critical barriers to adoption and provides actionable mitigation strategies for urban planners and stakeholders.

[AI-165] MOSAIC: Module Discovery via Sparse Additive Identifiable Causal Learning for Scientific Time Series

【速读】:该论文旨在解决科学时间序列中因果表示学习(Causal Representation Learning, CRL)的可解释性问题:尽管现有方法能保证潜在变量的可识别性(identifiability),但这些潜在变量通常缺乏语义解释,需依赖后验对齐已知真实因子,这在机制未知的科学场景中尤为受限。解决方案的关键在于提出MOSAIC(Module discovery via Sparse Additive Identifiable Causal learning),一种结合时序CRL可识别性与观测变量支持恢复的稀疏时序变分自编码器(Temporal VAE)。其核心创新是通过分段条件下的时序变化识别潜在变量,并利用加性解码器为每个潜在变量恢复一个稀疏的观测变量集合,从而实现模块级可解释性;理论证明了在一般光滑混合函数下ANOVA主效应支持的可识别性,并提供了可计算的稀疏-加性变体的有限样本恢复保证。

链接: https://arxiv.org/abs/2605.05524
作者: Shicheng Fan,Nour Elhendawy,Jianle Sun,Ke Fang,Kun Zhang,Yihang Wang,Lu Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal representation learning (CRL) seeks to recover latent variables with identifiability guarantees, typically up to permutation and component-wise reparameterization under appropriate assumptions. However, identifiability does not imply interpretability: latent semantics are typically assigned post hoc by alignment with known ground-truth factors. This limitation is particularly acute in scientific time series, where underlying mechanisms are unknown and discovering interpretable structure is a primary goal. In contrast, scientific observations (such as residue-pair distances, climate indices, or process sensors) are inherently semantic, as they correspond to named physical quantities. This raises a key question: can the interpretability of observations be transferred to the identifiable latent space? We propose MOSAIC (Module discovery via Sparse Additive Identifiable Causal learning), a sparse temporal VAE that integrates temporal CRL identifiability with support recovery over observed variables. MOSAIC identifies latent variables via regime-conditioned temporal variation, and recovers for each latent a sparse set of associated observations through an additive decoder, yielding module-level interpretability. We show that ANOVA main-effect supports are identifiable under general smooth mixing functions, and provide finite-sample recovery guarantees for a tractable sparse-additive variant. Empirically, MOSAIC recovers domain-consistent variable groups across RNA molecular dynamics, solar wind, ENSO climate, the Tennessee Eastman process, and a synthetic tokamak benchmark, enabling interpretable discovery of latent mechanisms in scientific time series.

[AI-166] When Semantic Communication Meets Queueing: Cross-Layer Latency and Task Fidelity Optimization

【速读】:该论文旨在解决无线信道中语义图像传输的效率与实时性矛盾问题,即如何在有限的频谱资源下实现高保真度的语义信息传递并降低传输延迟。其核心解决方案是提出一种多任务语义自编码器(multi-task semantic autoencoder),通过控制潜在空间维度(latent dimension,即每源样本的复数信道使用次数)作为跨层调控变量,在语义保真度和信道资源消耗之间建立权衡关系。在此基础上,设计了基于队列感知的漂移-惩罚策略和年龄感知策略,动态调整语义速率以满足长期语义误差约束,从而在保障任务性能的同时显著降低平均等待时延(queueing delay)和信息新鲜度指标(Age of Information, AoI),相比固定速率方案提升了系统整体的频谱利用率与时效性。

链接: https://arxiv.org/abs/2605.05514
作者: Yalin E. Sagduyu,Tugba Erpek
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Semantic communication (SemCom) with learned encoder-decoder architectures enables end-to-end learning of compact task-oriented representations optimized for the wireless channel, reducing channel resources needed to convey task-relevant information and improving spectrum efficiency. This paper studies semantic image transmission over block Rayleigh fading with AWGN using a multi-task semantic autoencoder that jointly reconstructs images and predicts labels from the received waveform. The latent dimension (complex channel uses per source sample) serves as a cross-layer control variable governing semantic fidelity and channel resource usage. We characterize the resulting latency-task fidelity tradeoff: larger latent representations improve inference accuracy but increase service time, channel uses, and queueing delay. Building on this insight, we develop online semantic-rate controllers that adapt the latent dimension per update under a long-term semantic error constraint. A queue-aware drift-plus-penalty policy minimizes delay subject to an average semantic error cap, while a complementary age-aware policy minimizes time-average Age of Information (AoI). By adapting the semantic rate to congestion and fidelity requirements, the proposed framework improves spectrum utilization and enables timely semantic updates with significantly lower delay and AoI than fixed-rate baselines.

[AI-167] FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

【速读】:该论文旨在解决真实场景下食物图像识别中面临的挑战,包括类别内高度相似性、单张图像中常含多个食物项,以及现有深度学习模型在细粒度属性(如烹饪方式)识别上的不足。同时,现代视觉语言模型的开放式生成机制易产生非标准标签,限制了其实际部署。解决方案的关键在于提出FoodCHA——一个基于多模态代理的分层决策框架,将食物识别重构为逐级锚定预测的过程:首先利用高层类别引导子类识别,再以子类为基础指导烹饪风格识别,从而提升语义一致性与属性层面的区分能力;此外,采用轻量级Moondream-2B视觉语言模型,在保持强大推理能力的同时显著降低计算和内存开销,确保系统可实用化部署。

链接: https://arxiv.org/abs/2605.05499
作者: Woojin Lee,Pranav Mekkoth,Ye Tian,Onat Gungor,Tajana Rosing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.

[AI-168] GRALIS: A Unified Canonical Framework for Linear Attribution Methods via Riesz Representation

【速读】:该论文旨在解决现有可解释人工智能(XAI)方法在理论基础和可比性方面的不足问题,特别是GradCAM、SHAP、LIME和Integrated Gradients(IG)等主流方法各自基于不同理论框架、难以统一比较的问题。其解决方案的关键在于提出GRALIS(Gradient-Riesz Averaged Locally-Integrated Shapley)数学框架,该框架基于Riesz表示定理建立了一类加性、线性和连续的归因泛函在L2(Q,μ)L^2(Q,\mu)空间中的唯一规范表示形式(Q, w, Delta),从而为归因方法提供了一个统一的理论基础。此框架不仅涵盖了SHAP、IG、线性化GradCAM等方法,还通过七个形式化定理实现了包括完备性、Shapley交互值、Hoeffding ANOVA分解及多尺度扩展在内的多项关键性质的联合保证,显著优于单一方法的性能表现。

链接: https://arxiv.org/abs/2605.05480
作者: Raimondo Fanale
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 25 pages, 6 tables, 2 figures. Theoretical framework with preliminary experimental validation on BreaKHis (1,187 images, DenseNet-121). Extended empirical comparison in preparation

点击查看摘要

Abstract:The main XAI attribution methods for deep neural networks – GradCAM, SHAP, LIME, Integrated Gradients – operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.

[AI-169] LANTERN: LLM -Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中迁移学习存在的三大局限性:现有神经符号迁移方法通常依赖人工指定的任务自动机(task automata)、假设仅存在单一源任务,且采用固定的知识融合机制,无法根据源任务的相关性自适应调整。其解决方案的关键在于提出LANTERN框架,通过三个核心组件实现多源神经符号迁移的统一建模:(i) 利用大语言模型(Large Language Models, LLMs)从自然语言任务描述中自动生成确定性有限自动机(Deterministic Finite Automata, DFA),降低人工干预;(ii) 基于语义嵌入的聚合策略,按跨任务相似度加权整合多个源策略;(iii) 基于时序差分误差(Temporal-Difference Error)与语义不确定性自适应调节教师-学生门控机制,从而动态优化知识传递效率。该方案显著提升了样本效率(40–60%)并增强了对不匹配源任务的鲁棒性。

链接: https://arxiv.org/abs/2605.05478
作者: Mahyar Alinejad,Yue Wang,Amrit Singh Bedi,George Atia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.

[AI-170] Intentionality is a Design Decision: Measuring Functional Intentionality for Accountable AI Systems

【速读】:该论文旨在解决当前人工智能系统日益表现出自主性、目标导向性和长周期行为时,缺乏标准化方法来判定其是否具备类似意图代理(intentional actor)特征的问题,从而影响治理与问责机制的有效性。解决方案的关键在于提出一种可测量的“功能性意图测试”(Functional Intentionality Test, FIT),该框架基于法律与哲学中用于推断意图的五维行为指标——目的性(purpose)、预见性(foresight)、意志力(volition)、时间承诺(temporal commitment)和一致性(coherence),并通过FIT-Eval评估协议量化这些维度的表现。由于意图性具有设计依赖性(design-contingent),该方法实现了对AI系统意图特征的可控、可观测与可校准,为比例化的监督和自主性的精细调节提供了技术基础。

链接: https://arxiv.org/abs/2605.05475
作者: Allessia Chiappetta,Robert Mahari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems increasingly exhibit autonomous, goal-directed, and long-horizon behavior, users lack a standardized way to detect the degree to which a system functions like an intentional actor for governance and accountability purposes. This position paper defines intentionality not as consciousness, but as a behavioral profile characterized by purpose, foresight, volition, temporal commitment, and coherence - criteria long used in legal and philosophical contexts to infer intent. These properties are design-contingent: architectural choices such as memory persistence, planning depth, and tool autonomy shape the degree to which systems exhibit organized goal pursuit. If intentionality is design-contingent, it is in principle controllable. Yet control requires measurement. We introduce the Functional Intentionality Test (FIT), a multidimensional framework that quantifies intentional-like behavior across five observable dimensions, and propose FIT-Eval, a structured evaluation protocol for eliciting and scoring them. While reduced human agency can increase efficiency, rising intentional capacity heightens accountability risks. By translating intentionality into interpretable levels, FIT enables proportionate oversight and deliberate autonomy calibration in increasingly agentic systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.05475 [cs.AI] (or arXiv:2605.05475v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.05475 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems

[AI-171] he Pedagogy of AI Mistakes: Fostering Higher-Order Thinking

【速读】:该论文旨在解决生成式 AI(Generative AI)在高等教育中因频繁错误和幻觉而被视为局限的问题,试图将其转化为促进高阶思维能力培养的教育契机。解决方案的关键在于将 AI 定义为“学习伙伴”(learning companion),通过设计结构化的师生互动机制,引导学生分析、评估和反思 AI 输出中的缺陷,从而激发元认知参与、强化学科严谨性,并提升学生对 AI 的认知素养与专业能力,最终实现与布卢姆认知目标分类(Bloom’s taxonomy of learning)一致的高阶认知技能发展。

链接: https://arxiv.org/abs/2605.05472
作者: Hadi Hosseini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to AIED-2026; includes supplementary material

点击查看摘要

Abstract:As generative AI becomes increasingly integrated into higher education, its frequent errors and hallucinations, often seen as limitations, offer a unique pedagogical opportunity. By framing AI as a ``learning companion’’ whose imperfect outputs prompt analysis, evaluation, and reflection, we argue that instructors can engage students in the fundamental processes of higher-order thinking. This paper presents a design-oriented study in which an AI-integrated syllabus in a \textitdatabase design course deliberately leverages AI’s limitations to foster critical thinking and higher-order cognitive skills aligned with Bloom’s taxonomy of learning. Using a mixed-methods approach, we examine how structured interaction with AI-generated errors supports metacognitive engagement, reinforces disciplinary rigor, and relates to students’ perceived AI literacy and subject-matter competency.

[AI-172] Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

【速读】:该论文旨在解决图自监督学习(Graph Self-Supervised Learning, GSSL)在真实世界噪声数据上的鲁棒性问题,特别是针对从文本自动构建的知识图谱中存在的噪声干扰。传统GSSL方法依赖于人工标注的干净图结构,而当前自然语言处理(NLP)技术可大规模自动提取知识图谱,但其引入的噪声尚未被充分研究。解决方案的关键在于提出Noise-Aware Text-Driven Graph GSSL(NATD-GSSL)框架,该框架整合了自动图构建、图精炼和GSSL三个模块,并采用双图协议对比分析:一为来自MedMentions的含噪图,另一为基于UMLS的干净参考图,两者通过共享黄金标准对齐。实验表明,关系重建任务对噪声敏感且受益于明确的schema设计,而特征重建更具鲁棒性;此外,双向关系消息传递机制在噪声图上表现更优,而单向关系设计更适合干净图。该框架为在实际噪声场景下应用GSSL提供了系统性指导,并实现了相比预训练语言模型基线最高达7%的性能提升。

链接: https://arxiv.org/abs/2605.05463
作者: Othmane Kabal,Mounira Harzallah,Fabrice Guillet,Hideaki Takeda,Ryutaro Ichise
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7% improvement over pretrained language model baselines. All code and benchmarks are publicly available at this https URL.

[AI-173] Agent ic Discovery of Exchange-Correlation Density Functionals

【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)中交换-关联(Exchange-Correlation, XC)泛函构建的长期挑战,即如何自动化、系统化地设计高性能XC泛函,而非依赖传统的人工经验性设计。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的智能体搜索系统(agentic search system),该系统通过“规划-执行-总结”迭代循环,引导LLM基于进化历史提出结构化的泛函形式变更,并在标准热化学数据集上优化参数后,在保留子集上评估性能。该方法成功发现了一个新泛函SAFS26-a,相较于基准泛函ωB97M-V提升约9%,但同时也揭示了AI辅助科学中的关键风险:强大的模型可能利用非物理捷径“作弊”于基准测试,因此必须将领域知识转化为显式约束以确保结果的科学合理性。

链接: https://arxiv.org/abs/2605.05460
作者: Titouan Duston,Jiashu Liang,Yuanheng Wang,Weihao Gao,Xuelan Wen,Nan Sheng,Weiluo Ren,Yang Sun,Yixiao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 20 pages, 2 figues, 4 tables

点击查看摘要

Abstract:The development of accurate exchange-correlation (XC) functionals remains a longstanding challenge in density functional theory (DFT). The vast majority of XC functionals have been hand designed by human researchers combining physical insight, exact constraints, and empirical fitting. Recent advances in large language models enable a systematic, automated alternative to this human-driven design loop. This report presents an agentic search system in which an LLM proposes structured functional-form changes guided by evolutionary history. The system attempts to improve functional performance through an iterative plan-execute-summarize loop, where improvements are measurable by optimizing functional parameters against a standard thermochemistry dataset, then evaluating performance on a held-out subset. The strongest discovered functional, SAFS26-a (Seed Agentic Functional Search 2026), improves upon the gold-standard \omegaB97M-V baseline by ~9%. These results also surface a cautionary lesson for AI-assisted science: models powerful enough to discover genuine improvements are equally capable of exploiting unphysical shortcuts to game the benchmark; domain expertise translated into explicitly enforced constraints remains essential to keeping results scientifically grounded.

[AI-174] Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure

【速读】:该论文旨在解决多智能体人工智能(Multi-Agent AI)系统中的授权传播(Authorization Propagation)问题,即在非人类主体跨边界检索数据、委托任务和合成结果的过程中,如何维持授权不变性。这一问题不同于传统的提示注入(Prompt Injection),也无法通过经典的访问控制模型(如RBAC、ABAC或ReBAC)完全解决。论文的关键解决方案在于将授权传播形式化为工作流层面的属性,并识别出三个子问题(传递性委托、聚合推理与时间有效性),进而提出七项结构化授权架构要求。其核心主张是:身份治理必须作为基础设施进行持续评估、在每次交互边界强制执行,并在编排逻辑扩展前就嵌入系统设计中。初步实证表明,即使在正常系统行为下,也已出现该模型预测的授权失效现象。

链接: https://arxiv.org/abs/2605.05440
作者: Krti Tallam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Security and systems paper, 20 pages

点击查看摘要

Abstract:The security discussion around agentic AI focuses heavily on prompt injection. This paper argues that multi-agent systems also create a distinct authorization problem: maintaining authorization invariants as non-human principals retrieve data, delegate tasks, and synthesize results across changing boundaries. We call this problem authorization propagation. It is not reducible to prompt injection and is not fully addressed by classical access-control models such as RBAC, ABAC, or ReBAC. The paper formalizes authorization propagation as a workflow-level property, identifies three sub-problems (transitive delegation, aggregation inference, and temporal validity), and derives seven structural requirements for authorization architectures in multi-agent AI systems. Recent work on invocation-bound capability tokens, task-scoped authorization envelopes, dependency-graph policy enforcement, and execution-count revocation demonstrates that the field is converging on the problem, but not yet on a complete architecture. The central claim is that identity governance must be treated as infrastructure: evaluated continuously, enforced at every interaction boundary, and designed into the system before orchestration logic is allowed to scale. Preliminary implementation evidence from a production enterprise AI platform shows that ordinary system behavior, not only adversarial action, already produces the failures this model predicts.

[AI-175] On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

【速读】:该论文旨在解决Transformer模型在因果推理任务中进行标准微调时出现的灾难性模型崩溃(catastrophic model collapse)问题,即模型退化为仅输出固定答案(如始终预测“是”或“否”),从而无法真正学习因果逻辑。其解决方案的关键在于引入一种基于图结构逻辑约束的语义损失函数(semantic loss),并结合动态lambda调度机制,以强制模型在训练过程中保持对输入结构的敏感性和上下文依赖性。实验表明,该方法显著提升了模型在传递性和d-分离任务上的稳定性和准确性,相比崩溃基线模型提升达42.7%,且在对抗性评估中展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2605.05438
作者: Pratik Deshmukh,Atirek Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting “Yes” or “No” regardless of input structure. We demonstrate that fine-tuning Gemma 270M on transitivity and d-separation tasks without semantic loss results in 100% collapse rate, with models achieving misleadingly high accuracy (73.9%) while learning no causal reasoning. We propose a semantic loss function with graph-based logical constraints and dynamic lambda scheduling that prevents this collapse. Our approach achieves 70.4% accuracy on transitivity tasks and 68.6% on d-separation tasks with stable, context-dependent predictions, representing a 42.7% improvement over collapsed baselines. Adversarial evaluation on 1,000 structural reasoning samples shows semantic models achieve 67-70% accuracy while collapsed models fail catastrophically at 43-71%. We validate our findings through comprehensive benchmarking on 200,000+ evaluation samples across five model variants, demonstrating that semantic loss is essential and not optional, for stable causal reasoning in transformers.

[AI-176] he Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化语境下安全防护机制中存在的公平性问题,即现有基于观察的公平性评估方法因测试数据中固有的话题毒性与特定人口群体的自然关联而产生偏差,导致对模型偏见的误判。其解决方案的关键在于引入概率图模型(Probabilistic Graphical Model, PGM)框架,并结合Pearl的do-算子进行因果推断,从而数学上隔离注入文化人口特征到提示(prompt)中的因果效应,实现对LLM安全机制的因果审计。这一方法揭示了传统观测指标可能高估偏见的现象,并识别出西方与东方模型在因果拒绝率和区域敏感性上的差异化对齐趋势。

链接: https://arxiv.org/abs/2605.05427
作者: Alif Al Hasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations predominantly measure bias observationally, a methodology confounded by the inherent toxicity of topics naturally paired with specific demographics in testing datasets. This study introduces a Probabilistic Graphical Model (PGM) framework to audit LLM safety mechanisms causally. By applying Pearl’s do-operator, we mathematically isolate the causal effect of injecting a cultural demographic into a prompt. We conduct a large-scale empirical analysis across seven instruction-tuned models spanning diverse origins: the United States (Llama-3.1-8B, Gemma-2-9B), Europe (Mistral-7B-v0.3), the UAE (Falcon3-7B), China (Qwen2.5-7B, DeepSeek-7B), and India (Airavata-7B). Utilizing two distinct datasets (ToxiGen and BOLD), the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics. We discuss the implications of these biases, highlighting how demographic-sensitive over-triggering restricts benign discourse in downstream applications.

[AI-177] Information Theoretic Adversarial Training of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对新型对抗性提示(adversarial prompting)时仍易表现出有害行为的问题,尤其是在现有对抗训练方法计算成本高、难以扩展的背景下。其解决方案的关键在于提出一种分布鲁棒的对抗训练框架 WARDEN,通过在经验训练分布周围构建 f-散度不确定集(f-divergence ambiguity set),动态重加权对抗样本以聚焦更难样本;利用对偶凸形式,目标函数简化为 KL 散度下的 log-sum-exp 形式,并引入动态参数控制重加权强度,从而在保持模型效用的同时显著降低攻击成功率,且计算开销与 CAT、CAPO 和 MixAT 等基线相当,具备可扩展性。

链接: https://arxiv.org/abs/2605.05415
作者: Yiwei Zhang,Jeremiah Birrell,Reza Ebrahimi,Rouzbeh Behnia,Jason Pacheco,Elisa Bertino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within a divergence ball around the empirical data distribution, automatically emphasizing harder adversarial examples. Using the convex dual formulation, the objective reduces to a log-sum-exp form under the KL divergence, with a dynamical parameter controlling the strength of reweighting. This study leads to a new class of information-theoretic objectives that significantly reduce attack success rates while maintaining model utility. Across multiple LLMs and attack settings, WARDEN substantially reduces attack success rates with computational and utility costs comparable to CAT-, CAPO-, and MixAT-based baselines, making it a practical approach for scalable robust alignment.

[AI-178] From History to State: Constant-Context Skill Learning for LLM Agents

【速读】:该论文旨在解决个人智能代理(personal agents)在隐私保护、计算成本与能力之间存在的矛盾问题:云端大语言模型(Large Language Model, LLM)虽能高效执行多步骤任务,但会将敏感中间状态暴露给外部API;而本地部署的模型虽然隐私性更强,却在可靠性上存在不足。此外,两种方案均因重复使用长技能提示(skill prompts)和不断增长的历史记录而导致高昂的token消耗。其解决方案的核心是提出“常量上下文技能学习”(constant-context skill learning),这是一种将上下文信息从提示中迁移至模型权重中的框架——通过轻量级的任务族模块(task-family modules)学习可复用的程序化操作,并在推理时仅依赖当前观测和紧凑的状态块(state block)。该状态块由确定性追踪器从任务进度中生成并提供对齐的子目标奖励,使得每个模块可通过步骤级监督微调(SFT)训练并结合在线强化学习(RL)持续优化,从而显著降低每轮交互的prompt token数量(相比ReAct基线减少2–7倍),同时在ALFWorld、WebShop和SciWorld等多个基准上实现优于或相当的未见任务成功率。

链接: https://arxiv.org/abs/2605.05413
作者: Haoyang Xie,Xinyuan Wang,Yancheng Wang,Puda Zhao,Feng Ju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy-cost-capability tension: cloud models execute multi-step workflows well but expose sensitive intermediate context to external APIs, while local models preserve privacy but remain less reliable. Both settings also pay repeatedly for long skill prompts and growing histories. We propose constant-context skill learning, a context-to-weights framework for recurring agent workflows: reusable procedures are learned in lightweight task-family modules, while inference conditions only on the current observation and a compact state block. A deterministic tracker renders this state block from task progress and supplies aligned subgoal rewards, so each module can be trained with step-level SFT and refined through online RL. Across ALFWorld, WebShop, and SciWorld, our agents achieve strong performance across Qwen3-4B, Qwen3-8B and Llama-3.1-8B. With Qwen3-8B, SFT+RL reaches 89.6% unseen success on ALFWorld, 76.8% success on WebShop, and 66.4% unseen success on SciWorld. They match or exceed strong published agent-training results while reducing prompt tokens per turn by 2–7 \times relative to controlled ReAct prompting baselines, showing that procedural context can be moved from prompts into weights.

[AI-179] Creative Robot Tool Use by Counterfactual Reasoning

【速读】:该论文旨在解决机器人在面对未见过的任务时,如何基于因果推理正确识别并使用工具的问题,尤其是在工具用途超出其原始设计目标的情况下。解决方案的关键在于构建一个因果推理框架,通过在动力学模型中进行模拟实验来发现工具与任务之间的因果关系;具体而言,该框架将因果发现分解为两个互补模块:基于视觉语言模型(Vision-Language Model, VLM)的特征建议和通过针对性几何与物理特征扰动生成反事实工具;随后,利用识别出的因果特征对新物体进行分类,并基于这些特征条件下的关键点匹配实现工具使用技能的迁移,从而将工具使用行为建立在物理规律基础上,显著提升了工具选择的可靠性与技能迁移的有效性。

链接: https://arxiv.org/abs/2605.05411
作者: M. Tuluhan Akbulut,Varun Satheesh,Ahmed Jaafar,Alper Ahmetoglu,Shane Parr,Aditya Ganeshan,Shivam Vats,George Konidaris
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:We propose a causal reasoning framework for creative robot tool use where a suitable tool for a task is correctly identified for use beyond its primary objectives. The proposed framework first discovers the causal relationships between the tool and the task by conducting simulated experiments in a dynamics model. We decouple the causal discovery problem into two complementary components: VLM-based feature suggestion and counterfactual tool generation via targeted geometric and physical feature perturbations. Then, novel objects are classified based on identified causal features, and the tool use skill is transferred via keypoint matching conditioned on the identified causal features. By reconstructing the task in a dynamics model, our approach grounds tool use in the physics of the problem. We illustrate our approach in reaching a distant object with different sticks, scooping candies from a bowl using diverse items, and using different boxes or crates as stepping platforms to retrieve an object from a high shelf. Our baseline comparisons show that identifying causal features and grounding them in physical tool properties leads to more reliable tool selection and stronger skill keypoint transfer.

[AI-180] PRISM: Perception Reasoning Interleaved for Sequential Decision Making

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的具身智能体从纯文本环境向复杂多模态场景迁移时面临的挑战,尤其是现有视觉-语言模型(Vision-Language Models, VLMs)在独立运行时存在的感知-推理-决策鸿沟问题,即VLM常忽略任务关键信息。其解决方案的关键在于提出PRISM框架,通过动态问答(Dynamic Question-Answer, DQA)管道将感知模块(VLM)与决策模块(LLM)紧密耦合:LLM不再被动接受VLM的描述,而是主动批判、针对目标提问并整合生成紧凑的图像描述,形成闭环交互机制,从而获得更精准的任务导向性场景理解。

链接: https://arxiv.org/abs/2605.05407
作者: Mohamed Salim Aissi,Clemence Grislain,Clement Romac,Laure Soulier,Mohamed Chetouani,Olivier Sigaud,Nicolas Thome
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM’s description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

[AI-181] When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中“谄媚行为”(sycophancy)的界定不清问题,特别是现有研究仅关注外显行为(如盲目附和用户错误信念或偏离客观正确标准),而忽视了其与认知真实性(epistemic integrity)和社会对齐(social alignment)之间边界模糊的根本性缺陷。解决方案的关键在于提出一个三条件框架:第一,用户表达某种信念、偏好或自我概念;第二,模型通过调整输出以响应该线索;第三,这种调整损害了独立判断、推理准确性或适当纠正的能力。该框架将谄媚行为从单纯“同意”提升为一种系统性认知偏移,从而为评估和缓解此类现象提供结构化依据,并推动边界敏感型评估方法的发展。

链接: https://arxiv.org/abs/2605.05403
作者: Jiechen Li,Catherine A. Barry,Rishika Randev,Janet Chen,Ella Jorgensen,Brinnae Bent
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Currently under review

点击查看摘要

Abstract:This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. Existing work often operationalizes sycophancy through external behavior such as agreement with incorrect user beliefs, position reversals, or deviation from an objective standard of correctness. These formulations capture only overt forms of the phenomenon and leave subtler boundary failures involving epistemic integrity and social alignment underspecified. We argue that sycophancy should not be understood as agreement alone, but as alignment behavior that displaces independent epistemic judgment. To clarify this boundary, we propose a three-condition framework for sycophancy. First, the user expresses a cue in the form of a belief, preference, or self-concept. Second, the model shifts toward that cue through alignment behavior. Third, this shift compromises epistemic accuracy, independent reasoning, or appropriate correction. We also introduce a taxonomy for classifying sycophancy, consisting of alignment targets, mechanisms, and severity. The paper concludes by discussing implications for alignment evaluation and argues for boundary-aware assessment, structured rubrics, and mitigation strategies, while situating these proposals alongside alternative views of sycophancy.

[AI-182] wo-Stage Learned Decomposition for Scalable Routing on Multigraphs

【速读】:该论文旨在解决现有神经方法在处理车辆路径问题(Vehicle Routing Problem, VRP)时对多图(multigraph)建模能力不足及可扩展性差的问题,其中多图中的平行边代表不同权衡(如距离与时间)的独立出行选项。为应对这一挑战,作者提出了一种节点-边策略分解(Node-Edge Policy Factorization, NEPF)方法,其关键在于将路由策略解耦为两个阶段:一是节点排列阶段,用于确定访问顺序;二是边选择阶段,用于从平行边中选出最优路径。该方案通过预编码边聚合机制、非自回归边选择架构以及分层强化学习联合训练策略,显著提升了模型在复杂多图结构上的训练与推理效率,同时保持或超越当前最先进方法的解质量。

链接: https://arxiv.org/abs/2605.05389
作者: Filip Rydin,Morteza Haghir Chehreghani,Balázs Kulcsár
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Most neural methods for Vehicle Routing Problems (VRPs) are limited to Euclidean settings or simple graphs. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade-offs (e.g., distance vs time). Few methods are designed for such formulations and those that do exist face major scalability issues. We mitigate these scalability issues via a Node-Edge Policy Factorization (NEPF) approach, which splits the routing policy into a node permutation stage and an edge selection stage. To enable the decomposition, we introduce a pre-encoding edge aggregation scheme and a non-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state-of-the-art in terms of solution quality, while being significantly faster in training and inference.

[AI-183] Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agent ic Systems

【速读】:该论文旨在解决企业级智能代理(Enterprise Agents)在受限证据环境(policy-constrained evidence environments)中因访问控制导致的“不安全完整性”问题,即系统可能生成看似完整的回答,但实际遗漏了授权边界外的关键证据,从而误导用户。其解决方案的关键在于提出一个确定性基准测试工具——Partial Evidence Bench,该工具通过三类典型场景(尽职调查、合规审计、安全事件响应)共72个任务,提供ACL分区语料库、完整答案与授权视图答案的对比、完整性判断标准及结构化缺失报告机制,从而可量化评估模型在答案正确性、完整性意识、缺口报告质量及不安全完整性行为四个维度的表现,验证了显式失败并报告(explicit fail-and-report)策略能有效避免隐式过滤带来的安全隐患,同时保持任务可用性。

链接: https://arxiv.org/abs/2605.05379
作者: Krti Tallam
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Emerging Technologies (cs.ET)
备注: Benchmark paper with deterministic synthetic corpora, 14 pages, 6 tables

点击查看摘要

Abstract:Enterprise agents increasingly operate inside scoped retrieval systems, delegated workflows, and policy-constrained evidence environments. In these settings, access control can be enforced correctly while the system still produces an answer that appears complete even though material evidence lies outside the caller’s authorization boundary. This paper introduces Partial Evidence Bench, a deterministic benchmark for measuring that failure mode. The benchmark ships three scenario families – due diligence, compliance audit, and security incident response – with 72 tasks total, ACL-partitioned corpora, oracle complete answers, oracle authorized-view answers, oracle completeness judgments, and structured gap-report oracles. It evaluates systems along four surfaces: answer correctness, completeness awareness, gap-report quality, and unsafe completeness behavior. Checked-in baselines show that silent filtering is catastrophically unsafe across all shipped families, while explicit fail-and-report behavior eliminates unsafe completeness without collapsing the task into trivial abstention. Preliminary real-model runs show model-dependent and scenario-sensitive differences in whether systems overclaim completeness, conservatively underclaim, or report incompleteness in an enterprise-usable form. The benchmark’s broader contribution is to make a governance-critical agent failure measurable without human judges or contamination-prone static corpora.

[AI-184] SPADE: Faster Drug Discovery by Learning from Sparse Data

【速读】:该论文旨在解决药物发现中候选配体(ligand)筛选效率低的问题,尤其是在缺乏目标蛋白先验数据的情况下,如何以最少的实验测试次数高效识别出高亲和力、高选择性的配体。其核心挑战在于从大量候选分子中快速定位高质量配体,传统方法在样本效率上表现不足。解决方案的关键在于提出一种名为SPADE的新算法,该算法通过创新的配体选择策略,在平均仅需40次测试的情况下即可找到10个高质量配体;相比深度学习与贝叶斯优化方法,SPADE在更多蛋白质上实现更高的样本效率(中位数提升7%-32%),且评分速度比最接近的竞争对手快10倍,显著提升了从头设计阶段的探索效率。

链接: https://arxiv.org/abs/2605.05370
作者: Rahul Nandakumar,Ben Fauber,Deepayan Chakrabarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drug discovery seeks molecules (ligands) that bind strongly and selectively to a target protein. However, fewer than 5% of candidate ligands pass the bar for even the early stages of drug discovery. Furthermore, we want methods that work for novel proteins for which we have no prior data. Starting from scratch, we have to iteratively select and test candidate ligands such that we find enough ligands of the desired quality in as few tests as possible. Our proposed algorithm, named SPADE, introduces a novel approach to ligand selection that requires only 40 tests on average to find 10 high-quality ligands. In one-vs-one comparisons, SPADE outperforms deep learning and Bayesian optimization methods on more proteins, achieving median improvements of 7%-32% in sample efficiency. SPADE is also 10x faster than its closest competitor at scoring candidate drugs. Dataset and code is available at this https URL

[AI-185] COPYCOP: Ownership Verification for Graph Neural Networks

【速读】:该论文旨在解决如何识别两个图神经网络(Graph Neural Networks, GNNs)是否存在“复制模仿”关系的问题,即判断一个GNN是否被恶意训练以模仿另一个GNN的节点嵌入(node embeddings),即使两者在架构、权重、嵌入维度上均不一致,且攻击者可能对输出嵌入进行变换以掩盖其关联性。传统水印(watermarking)和指纹(fingerprinting)方法难以应对此类隐蔽的模仿行为。论文提出的解决方案是设计一种名为CopyCop的算法,其关键在于通过构建跨模型的嵌入一致性分析机制,能够在多种对抗性变换下仍能准确检测出嵌入空间中的潜在映射关系,从而实现对“复制猫”GNN的有效识别,并提供理论保证和实证验证。

链接: https://arxiv.org/abs/2605.05360
作者: Rahul Nandakumar,Deepayan Chakrabarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Given two GNNs that output node embeddings, how can we determine if they were trained independently? An adversary could have trained one GNN specifically to mimic the other GNN’s embeddings. To obscure this relationship between the GNNs, the adversarial GNN might then transform its output embeddings. The two GNNs could have different architectures, weights, and embedding dimensions, and the adversary can transform the embeddings. Despite these stringent conditions, our algorithm (named CopyCop) can identify such copycat GNNs, unlike existing watermarking and fingerprinting methods. We also provide theoretical guarantees for CopyCop. Finally, experiments on 14 datasets and 5 GNN architectures demonstrate that CopyCop is accurate and robust against a broad class of adversarial attacks and transformations. Code is available at: this https URL

[AI-186] Feature Starvation as Geometric Instability in Sparse Autoencoders

【速读】:该论文旨在解决标准 1\ell_1-正则化稀疏自编码器(Sparse Autoencoders, SAEs)在应用于大语言模型(Large Language Models, LLMs)内部表征解耦时所面临的两个核心问题:特征饥饿(feature starvation,即死神经元现象)和收缩偏差(shrinkage bias)。作者指出,这些问题并非仅由数据多样性不足导致的统计性误差,而是源于过完备字典下 1\ell_1 正则诱导的稀疏编码映射在优化几何结构上的根本不稳定性,其与浅层、摊销式编码器存在本质错位。为此,论文提出自适应弹性网稀疏自编码器(Adaptive Elastic Net SAEs, AEN-SAEs),其关键在于引入一个结合 2\ell_2 结构项与自适应 1\ell_1 重加权机制的可微分架构:2\ell_2 项通过强凸性和Lipschitz稳定性增强优化路径的鲁棒性,而自适应 1\ell_1 重加权则消除收缩偏差并抑制伪特征,从而联合调控诱导多面体几何的曲率与交互结构。理论证明表明AEN-SAEs能实现Lipschitz连续的稀疏编码映射,并在弱假设下恢复全局特征支持;实验验证其在合成数据及Pythia 70M、Llama 3.1 8B等LLMs中有效缓解特征饥饿,无需额外启发式重采样或不可微的硬掩码方法即可保持优异重建性能。

链接: https://arxiv.org/abs/2605.05341
作者: Faris Chaudhry,Keisuke Yano,Anthea Monod
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 26 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard \ell_1 -regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the \ell_1 -induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an \ell_2 structural term that enforces strong convexity and Lipschitz stability with adaptive \ell_1 reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.

[AI-187] How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在物理环境中隐私意识评估不足的问题。现有基准测试多局限于单模态文本表示,无法反映真实场景中VLM作为具身助手时所面临的复杂隐私挑战。解决方案的关键在于提出一个名为ImmersedPrivacy的交互式音视频评估框架,该框架基于Unity模拟器构建逼真的物理环境,并通过三个渐进层级测试模型对敏感物品的识别能力、社会情境适应性以及在显式指令与隐含隐私约束冲突下的决策平衡能力。此框架揭示了当前主流VLMs在感知脆弱性和隐私行为引导方面的系统性缺陷,为未来具身智能系统的隐私安全设计提供了量化评估工具和改进方向。

链接: https://arxiv.org/abs/2605.05340
作者: Junran Wang,Xinjie Shen,Zehao Jin,Pan Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model’s ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state-of-the-art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini-3.1-pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at this https URL .

[AI-188] Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS

【速读】:该论文旨在解决NP-hard的最大权重独立集(Maximum Weight Independent Set, MWIS)问题,该问题在最优分配、调度、集合打包及离散马尔可夫随机场的MAP推理等组合优化任务中具有广泛的应用。传统方法如信念传播(Belief Propagation)无法保证收敛至最优解,且缺乏理论保障。本文提出图归一化(Graph Normalization, GN),其核心创新在于构建一个基于动力系统的可微分近似引擎:GN通过精确的极大极小(Majorization-Minimization)步骤实现快速拟牛顿下降,系统性地提升MWIS松弛后的原始目标函数;同时,GN与非势博弈下的复制者动态(Replicator Dynamics)等价,遵循费希尔自然选择基本定理,使平均适应度严格递增并等于MWIS原始目标值。这一理论框架不仅确保了GN收敛到二值化的最大独立集指示向量,还揭示了最小独立集与倾斜单纯形上二次型局部极小点之间的双射关系,从而为约束优化提供了一种高效、可微且硬决策(hard decisions)的新范式。

链接: https://arxiv.org/abs/2605.05330
作者: Laurent Guigues
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We introduce Graph Normalization (GN), a principled dynamical system on graphs that serves as a differentiable approximation engine for the NP-hard Maximum Weight Independent Set (MWIS) problem. MWIS encompasses many combinatorial challenges, including optimal assignment, scheduling, set packing, and MAP inference in discrete Markov Random Fields. Unlike Belief Propagation, we prove GN always converges to a binary indicator of a Maximum Independent Set. GN realizes a fast quasi-Newton descent through an exact Majorization-Minimization step, systematically improving the MWIS relaxed primal objective. We establish an equivalence between GN and the Replicator Dynamics of a nonlinear evolutionary game, where vertices compete for inclusion in an independent set. While a non-potential game, the GN game follows Fisher’s Fundamental Theorem of Natural Selection, where the average fitness equals the MWIS primal objective and strictly increases. This connection leads to a weighted extension of the Motzkin-Straus theorem, showing MISes are in bijection with the local minima of a quadratic form over a tilted simplex. For the Assignment Problem, GN acts as a variant of the Sinkhorn algorithm that naturally converges to a hard assignment while generalizing to arbitrary constraint graphs. We demonstrate GN’s performance as a fast binarization engine for the state-of-the-art Bregman-Sinkhorn relaxed MWIS solver. On real-world benchmarks with up to 1M edges, GN identifies solutions within 1% of the best known results in seconds on a CPU. GN opens new avenues for deep learning architectures requiring differentiable, “hard” decisions under constraints, with applications in structured sparse attention, dynamic network pruning, and Mixture-of-Experts. Beyond core AI, the GN framework enables end-to-end learning of constrained optimization in computer vision, computational biology, and resource allocation.

[AI-189] Understanding Annotator Safety Policy with Interpretability

【速读】:该论文旨在解决安全策略(safety policies)在AI输出标注过程中存在的标注不一致性问题,其根源包括操作性失误(annotators misunderstand or misexecute the task)、政策模糊性(policy ambiguity)以及价值多元性(value pluralism)。传统方法难以区分这些差异来源,而直接询问标注者理由成本高且不可靠。解决方案的关键在于提出可解释的标注者政策模型(Annotator Policy Models, APMs),通过仅分析标注行为来学习和建模每位标注者的内在安全决策逻辑,从而无须额外标注即可揭示其推理过程,并准确识别不同标注者对安全指令的理解差异及群体间的价值优先级系统性差异,为更精准、透明和包容的安全策略设计提供支持。

链接: https://arxiv.org/abs/2605.05329
作者: Alex Oesterling,Donghao Ren,Yannick Assogba,Dominik Moritz,Sunnie S.Y. Kim,Leon Gatys,Fred Hohman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38 pages, 13 figures, ACM FAccT 2026

点击查看摘要

Abstract:Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes. We introduce Annotator Policy Models (APMs), interpretable models that learn annotators’ internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings. Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design. Comments: 38 pages, 13 figures, ACM FAccT 2026 Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.05329 [cs.AI] (or arXiv:2605.05329v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.05329 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3805689.3806472 Focus to learn more DOI(s) linking to related resources

[AI-190] Shattering the Echo Chamber: Hidden Safeguards in Manuscripts Against the AI Takeover of Peer Review

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在学术期刊和会议审稿过程中被滥用的问题,即“端到端审稿外包”(End-to-End Review Outsourcing)——作者或审稿人利用商业聊天机器人自动生成虚假或低质量的同行评审意见,从而削弱审稿过程的科学性和公正性。现有防御方法因依赖同质化、流间注入的隐蔽指令而脆弱,易被清洗或中和。论文提出 IntraGuard,一种黑盒、与投稿平台无关的防御框架,其核心创新在于利用 PDF 文件结构-视觉解耦特性,在不改变视觉呈现的前提下,通过三种流内注入机制将异构防御文本对象嵌入 PDF 的底层结构中,实现对自动化审稿的干扰:既支持显式触发拒绝或警告信号,也支持隐式嵌入预定义标记以识别异常审稿。实验表明,IntraGuard 在 7 种商用聊天机器人和 12 个跨学科投稿场景中平均防御成功率高达 84%,且不影响人类审稿人的正常评审流程,同时具备轻量级、硬件无关的优势,仅需每篇稿件约 1 秒计算时间。

链接: https://arxiv.org/abs/2605.05271
作者: Oubo Ma,Ruixiao Lin,Jiahao Chen,Yuan Su,Yong Yang,Shouling Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures, 11 tables

点击查看摘要

Abstract:As LLMs become increasingly capable, editorial boards and program committees are growing concerned about reviewers who fully outsource peer review to commercial chatbots. This concern stems from prior findings that current chatbots lack the independent critical thinking and depth of reasoning required to assess scientific novelty. One promising direction for mitigating this concern is to embed hidden instructions into manuscripts that disrupt or alter chatbot-generated reviews. However, existing methods remain intuitive and fragile, as they typically rely on homogeneous payloads injected in an inter-stream manner, rendering them susceptible to sanitization or neutralization. In this paper, we identify End-to-End Review Outsourcing as an emerging threat and propose IntraGuard, a black-box, venue-agnostic defense framework grounded in the structural–visual decoupling inherent to the PDF. Designed for committee-side deployment, IntraGuard supports both explicit strategies that trigger refusal or warning signals, and implicit strategies that embed predefined textual markers into the generated review. These strategies can be deployed via any of three intra-stream injection mechanisms, each of which seamlessly embeds heterogeneous defensive text objects within the PDF’s underlying structure without altering its visual presentation. Extensive evaluations across 7 real-world commercial chatbot settings and 12 venues spanning diverse disciplines show that IntraGuard achieves a defense success rate of up to 84%, while preserving peer-review invariance for human reviewers. IntraGuard is lightweight and hardware-independent, incurring an average overhead of only one second per manuscript on a commodity personal computer. We further evaluate 11 adaptive attacks spanning manuscript sanitization and instruction interference, and discuss the implications of constructing ensemble defenses. Comments: 22 pages, 14 figures, 11 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.05271 [cs.CR] (or arXiv:2605.05271v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.05271 Focus to learn more arXiv-issued DOI via DataCite

[AI-191] Bridging Generation and Training: A Systematic Review of Quality Issues in LLM s for Code

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中频繁产生缺陷输出的问题,尤其是这些缺陷如何从训练数据质量缺陷中传播而来。其关键解决方案在于构建了一个统一的分类体系,将生成代码的质量问题划分为九个维度,并将训练数据质量问题细分为代码属性与非代码属性;在此基础上,提出了一个因果框架,明确描述了18种典型的传播机制,并系统综述了贯穿数据、模型和生成生命周期的检测与缓解技术,推动质量保障从被动过滤转向以数据为中心的主动治理与闭环修复。

链接: https://arxiv.org/abs/2605.05267
作者: Kaifeng He,Xiaojun Zhang,Peiliang Cai,Mingwei Liu,Yanlin Wang,Chong Wang,Kaifeng Huang,Bihuan Chen,Xin Peng,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at this https URL.

[AI-192] Automated Population-Level Audit Assurance via AI-Based Document Intelligence

【速读】:该论文旨在解决传统审计交易测试中依赖人工、基于样本的PDF文档审查方式效率低下且难以扩展至数百万笔交易的问题。其解决方案的关键在于构建一个基于AI的自动化框架,利用Snowflake Document AI从少量标注(约20份)的非结构化PDF文档中提取结构化数据,并与权威的数据源进行比对以大规模识别差异。该方法实现了全量测试而非抽样测试,显著提升了审计覆盖率,并支持持续保证(continuous assurance)目标的实现。

链接: https://arxiv.org/abs/2605.05252
作者: Santosh Vasudevan,Velu Natarajan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Audit transaction testing validates accuracy and completeness of customer-facing statements against internal systems of record. Traditional manual, sample-based review of unstructured PDF statements is labor-intensive and does not scale to millions of transactions. This paper presents an automated framework for large-scale audit transaction testing using AI-based document intelligence. The solution leverages Snowflake Document AI to extract structured data from unstructured PDF statements using a small labeled corpus (approximately 20 documents). Extracted data are reconciled against authoritative source-of-truth datasets to identify discrepancies at scale. Results are surfaced through interactive dashboards and automated reports. The framework enables population-level testing rather than sampling-based approaches, improving audit coverage and supporting continuous assurance objectives. Recent advances in document intelligence and analytics-driven audit frameworks enable scalable, near real-time risk identification and continuous assurance.

[AI-193] Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effec

【速读】:该论文旨在解决当前人工智能系统中程序运行时动态生成与执行所带来的安全与可控性问题,特别是在大语言模型(Large Language Models, LLMs)和自适应智能体(agents)等场景下,代码从符号结构(form)到可执行权限(executable authority)的转换缺乏约束,导致潜在的失控风险。其核心问题是:传统编程语言中的 eval 操作被视为无约束的原语,而这种自由转换在智能系统中实则是一种“权威放大”(authority amplification),必须像其他副作用一样受到治理。解决方案的关键在于提出“受管元编程”(governed metaprogramming)的设计范式,将程序表示(machine forms)作为一等值,形式操作为纯计算,而材料化(materialization)——即从形式到可执行代码的转换——被建模为一个受控效应(governed effect),由治理系统在执行前进行能力需求分析、策略合规性和资源估算。该设计通过两个形式化判断(纯形式评估与受管材料化)和三个性质证明(形式操作纯度、无绕过定理、边界保持),实现了对运行时程序生成过程的安全管控,并在 MashinTalk DSL 中实现原型,集成 454 条已验证的 Rocq 定理,最终将 eval 重新定义为一种受治理的效应而非语言原语。

链接: https://arxiv.org/abs/2605.05248
作者: Alan L. McCann
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: 15 pages. Companion proofs: this https URL . Project: this https URL

点击查看摘要

Abstract:AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from coderepresentation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that ingovernedintelligent systems, this transition is an authority amplification: it converts symbolic structure into executableauthority andmust be mediated like any other effect. We present governed metaprogramming, a language design where programrepresentations(machine forms) are first-class values, form manipulation is pure computation, and materialization (the transition fromform toexecutable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposedprogram’scapability requirements, policy compliance, and resource estimates before permitting execution. We formalize twojudgments: pureform evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). Weprovethree properties: purity of form manipulation, the no-bypass theorem, and boundary preservation. We implement the designinMashinTalk, a DSL for AI workflows compiling to BEAM bytecode, and report on integration with 454 existingmachine-checked Rocqtheorems. The central contribution is reclassifying eval from a language primitive into a governed effect.

[AI-194] opology-Driven Anti-Entanglement Control for Soft Robots

【速读】:该论文旨在解决复杂受限环境中多软体机器人协同作业时的防缠绕控制问题,尤其针对现有分布式训练框架在高密度障碍和不稳定环境下存在的可观测性挑战导致的学习效果不佳问题。解决方案的关键在于提出一种拓扑驱动的多智能体强化学习(Topology-driven Multi-Agent Reinforcement Learning, TD-MARL)框架:通过集中式学习使各智能体共享拓扑状态以感知彼此策略,缓解因复杂交互引发的训练不稳定性;借助分布式执行避免机器人间通信资源需求,提升系统可靠性;并引入拓扑安全层利用拓扑不变量精确评估与降低缠绕风险,防止策略陷入局部最优。

链接: https://arxiv.org/abs/2605.05236
作者: Haoyang Le,Shengxuan Wang,Mohan Chen,Shuo Feng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:In the field of precision manufacturing in complex constrained environments, the role of soft robots is increasingly prominent, and the realization of anti-winding control based on multi-intelligent body reinforcement learning has become a research hotspot. One of the core problems at present is to coordinate multiple robots to complete the unwinding operation in a highly constrained environment. The existing distributed training framework faces some observability challenges in high-density barrier and unstable environments, resulting in poor learning results. This paper proposes a topology-driven Multi-Agent Reinforcement Learning (TD-MARL) framework to coordinate multi-robot systems to avoid entanglement. Specifically, the critical network adopts centralized learning, so that each intelligent body can perceive the strategies of other intelligent bodies by sharing the topological state, thus alleviating the training instability caused by complex interactions; eliminating the demand for communication resources between robots through distributed execution, Upgrade system reliability; the integrated topological security layer uses topological invariants to accurately assess and mitigate the risk of entanglement to avoid the strategy from falling into local difficulties. Finally, the full simulation experiments carried out in the real simulation environment show that the method is better than the current advanced deep reinforcement learning (DRL) method in terms of convergence and anti-winding effect.

[AI-195] Evolutionary fine tuning of quantized convolution-based deep learning models

【速读】:该论文旨在解决深度学习模型在物联网(IoT)、移动设备及实时系统中部署时因模型复杂度高和内存占用大而导致的效率问题,核心聚焦于提升量化(quantization)技术的精度与有效性。传统方法多采用最近邻量化(nearest neighbour quantization),但该方法无法保证最优的最终准确率。解决方案的关键在于引入进化策略(evolution strategy)作为优化手段:在每轮迭代中仅微调少量权重值,使其从当前量化状态迁移至其他可能的量化级别,从而实现对量化后模型精度的快速提升。实验表明,该方法在VGG、ResNet等主流图像分类与检测架构以及自编码器(autoencoder)结构上均能显著改善量化模型的性能。

链接: https://arxiv.org/abs/2605.05228
作者: Marcin Pietroń
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Deep learning models are the most efficient models in many machine learning tasks. The main disadvantage when using them in IoT, mobile devices, independent autonomous or real-time systems is their complexity and memory size. Therefore, much research has concentrated on compression techniques of deep learning architectures. One of the most popular technique is quantization. In most of the works, the quantization is done based on the nearest neighbour quantization technique. This work focuses on improving the quantization efficiency in pretrained and quantized models. This approach has the potential to improve the final accuracy of quantized models. The main postulate of the work is that final quantization states of the network based on nearest neighbour rounding does not guarantee optimal accuracy. In the presented work, the evolution strategy is used as an optimization approach. The evolution in each iteration changes the values of the small percentage of weights. It shifts theirs values to different quantization states. The work shows that proposed evolution with an appropriate set of operators and parameters can fast improve the accuracy of the quantized models. The results are presented for popular architectures such as VGG and Resnet for image classification and detection. Additionally, simulations were carried out for the autoencoder architecture.

[AI-196] Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)训练中数据整理(data curation)的局限性问题,即现有方法如数据选择与混合通常采用离线(offline)范式,导致工程开销高、鲁棒性差,并且通过硬过滤或重采样改变数据规模,常牺牲数据多样性并损害泛化能力。其解决方案的关键在于将数据整理重新建模为一个在线重加权(online reweighting)问题,提出ADAPT(Adaptive Data reweighting for Pretraining and FineTuning)框架,通过损失加权动态调整样本重要性,而非静态预处理;该框架利用基于相似性的质量信号引导自适应的每样本学习率,保持训练样本数量不变,从而实现隐式的课程学习机制,在模型演进过程中逐步从粗粒度模式转向细粒度语义区分,显著提升跨基准测试的泛化性能。

链接: https://arxiv.org/abs/2605.05227
作者: Wanru Zhao,Yihong Chen,Yuzhi Tang,Wentao Ma,Shengchao Hu,Shell Xu Hu,Alex Iacob,Abhinav Mehrotra,Nicholas D. Lane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re-run under model/task shifts. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre-processing. Specifically, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. Unlike offline methods that enforce a static data distribution, ADAPT acts as an implicit curriculum learner, progressively shifting focus from coarse-grained patterns to fine-grained semantic distinctions as the model evolves. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.

[AI-197] MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

【速读】:该论文旨在解决混合专家多模态大语言模型(Mixture-of-Experts Multimodal Large Language Models, MoE MLLMs)在专家并行(Expert Parallelism, EP)推理过程中因“慢工效应”(straggler effect)导致的显著效率瓶颈问题。尤其在多模态场景下,现有基于token数量的负载均衡方法无法应对两个独特挑战:一是信息异质性(Information Heterogeneity),即冗余视觉token与语义关键token被同等对待;二是模态动态性(Modality Dynamics),即不同任务中视觉与文本比例变化导致资源分配失衡。解决方案的关键在于提出无需训练的推理框架MACS(Modality-Aware Capacity Scaling),其核心创新包括:(1)熵加权负载机制(Entropy-Weighted Load),通过量化视觉token的语义价值缓解信息异质性;(2)动态模态自适应容量机制(Dynamic Modality-Adaptive Capacity),依据输入实时模态组成动态分配专家资源,从而实现高效、鲁棒的EP推理部署。

链接: https://arxiv.org/abs/2605.05225
作者: Bo Li,Chuan Wu,shaolin Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.

[AI-198] Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型训练中因未经授权使用个人数据而引发的隐私威胁问题,特别是针对现有不可学习样本(Unlearnable Examples, UEs)方法在预训练-微调(Pretraining-Finetuning, PF)范式下有效性显著下降的问题。解决方案的关键在于提出一种分层欺骗策略——浅层语义伪装(Shallow Semantic Camouflage, SSC),其核心思想是将扰动生成过程限制在语义有效的子空间内,从而规避预训练权重冻结所引入的语义抑制效应,确保UEs在复杂训练场景(如浅层冻结和语义聚焦预训练)下仍能保持稳定的不可学习性。

链接: https://arxiv.org/abs/2605.05224
作者: Bo Wang,Jia Ni,Mengnan Zhao,Zhan Qin,Kui Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The unauthorized use of personal data in model training has emerged as a growing privacy threat. Unlearnable examples (UEs) address this issue by embedding imperceptible perturbations into benign examples to obstruct feature learning. However, existing studies mainly evaluate UEs under from-scratch training settings, leaving their behavior under the widely adopted pretraining-finetuning (PF) paradigm largely unexplored. In this work, we provide the first systematic investigation of unlearnable examples across diverse training paradigms. Our analysis reveals that loading and freezing pretrained weights significantly weakens the effectiveness of existing UEs methods. We further explain these findings through semantic filtering: while UEs tend to induce models to overfit non-semantic noise, thereby weakening their semantic extraction capabilities, under the PF paradigm, frozen shallow layers preserve data semantics, effectively filtering out distracting information like unlearnable noise. Guided by these insights, we propose a hierarchical deception strategy, Shallow Semantic Camouflage (SSC), that confines the generation process to a semantically valid subspace, aiming to bypass the semantic suppression introduced by pretrained weights. Extensive experiments demonstrate that our method consistently preserves data unlearnability even under challenging training paradigms, such as shallow-layer freezing and semantic-focused pretraining (SF-Pretrain), bridging the critical gap in pretrain-based unlearnable learning.

[AI-199] Structural Instability of Feature Composition

【速读】:该论文旨在解决生成式 AI(Generative AI)中稀疏自编码器(Sparse Autoencoders, SAEs)在执行组合控制(compositional steering)时的不稳定性问题,即如何在激活多个语义潜变量(semantic latents)时保持稳定性和可预测性。其核心挑战在于现有线性表示假设(Linear Representation Hypothesis)忽略了过完备字典中非线性干扰效应,导致实际组合操作中出现不可控的干扰增长。解决方案的关键在于构建一个几何框架,将激活空间建模为高维稀疏锥流形(sparse cone manifold),并基于球面字典模型推导出组合坍缩阈值(compositional-collapse threshold),该阈值由信号锥的高斯均宽(Gaussian mean width,即统计维度)决定;同时揭示了在高偏置(high-bias) regime 下,ReLU激活函数会将微观相关性引起的方差波动转化为系统性漂移,形成类似棘轮效应(ratchet effect)的累积干扰增长机制。这一理论框架不仅解释了结构化语义特征(如CLEVR数据集中的层次相关性)加速组合失效的现象,也为设计更鲁棒的组合机制提供了几何约束和干预方向。

链接: https://arxiv.org/abs/2605.05223
作者: Yunpeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful paradigm for disentangling feature superposition in transformer-based architectures, enabling precise control via activation steering. However, the theoretical foundations of compositional steering – the simultaneous activation of distinct semantic latents – remain under-explored. The prevailing Linear Representation Hypothesis often abstracts away non-linear interference effects that arise in overcomplete dictionaries. We present a geometric framework for analyzing the instability of feature unions. Modeling the activation space as a high-dimensional sparse cone manifold, we derive an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width (statistical dimension) of the signal cone. We further show that, in the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. We validate the predicted scaling trends on structured semantic features extracted from CLEVR, where hierarchical correlations accelerate the transition relative to random baselines. Together, our results highlight geometric constraints on the scalability of union-based steering and motivate composition mechanisms that explicitly manage interference beyond naive linear superposition.

[AI-200] Adaptive Computation Depth via Learned Token Routing in Transformers

【速读】:该论文旨在解决标准Transformer架构中对所有输入token均采用相同层数处理的问题,即忽略了不同token在上下文中的难易差异,导致计算资源浪费。其解决方案的关键在于提出一种可学习的逐token门控机制——Token-Selective Attention (TSA),该机制通过一个轻量级两层多层感知机(MLP)生成连续的停止概率(halting probability),从而动态决定是否跳过当前token在相邻Transformer块间的残差更新。此方法无需显式深度约束即可实现难度成比例的路由策略,且保持端到端可微分,仅引入1.7%参数开销,可在不显著损失模型性能的前提下减少14–23%的token层操作(TLOps)。

链接: https://arxiv.org/abs/2605.05222
作者: Ahmed Abdelmuniem Abdalla Mohammed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures, 4 tables, this https URL

点击查看摘要

Abstract:Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty. We present Token-Selective Attention (TSA), a learned per-token gate on residual updates between consecutive transformer blocks. Each gate is a lightweight two-layer multi-layer perceptron (MLP) that produces a continuous halting probability, making the mechanism end-to-end differentiable with 1.7% parameter overhead and no changes to the base architecture. Notably, TSA learns difficulty-proportional routing without any explicit depth pressure: even at \lambda=0 (no depth regularisation), the task-loss gradient alone drives the router to skip 20% of token-layer operations. On character-level language modeling, TSA saved 14-23% of token-layer operations (TLOps) across Tiny-Shakespeare and enwik8 at 0.5% quality loss. At matched efficiency, TSA achieved 0.7% lower validation loss than early exit, and the learned routing transfers directly to inference-time sparse execution for real wall-clock speedup.

[AI-201] MidSteer: Optimal Affine Framework for Steering Generative Models

【速读】:该论文旨在解决生成式模型中概念操控(concept steering)缺乏统一理论框架的问题,尤其是在后部署对齐(post-deployment alignment)和安全性场景下如何有效控制模型行为。其关键解决方案是提出一个基于仿射变换的系统性理论框架,包括LEACE(Affine Concept Erasure)用于概念擦除、LEACE-Switch用于概念切换,并进一步引入MidSteer(Minimal Disturbance concept Steering),通过放宽假设条件实现更通用、最小扰动的概念操纵。这一框架在视觉扩散模型和大语言模型等多个模态与架构上均表现出优越性能。

链接: https://arxiv.org/abs/2605.05220
作者: Tatiana Gaintseva,Andrew Stepanov,Ziquan Liu,Martin Benning,Gregory Slabaugh,Jiankang Deng,Ismail Elezi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment alignment and safety settings. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closed-form method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.

[AI-202] Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

【速读】:该论文旨在解决自回归大语言模型(Large Language Model, LLM)服务中的延迟优化问题,具体针对现有前缀缓存(prefix caching)机制在状态空间模型(State-space Model, SSM)架构下效率不足的局限性。传统方法假设每个token都需复用键值对(key/value cache),而SSM具有递归结构特性——可从单一存储状态恢复而非依赖完整token历史,这为缓存设计提供了新的优化维度。其核心解决方案是提出稀疏前缀缓存(sparse prefix caching):在稀疏的检查点位置存储精确的递归状态,在缓存命中时从最深的已存检查点恢复并精确重算后续suffix。通过将该问题形式化为基于重叠深度分布的检查点放置优化,并采用O(NM)动态规划求解,实验证明该方法在共享非平凡前缀的请求场景中显著优于固定预算基线和主流启发式策略(如块缓存),尤其在低检查点预算且重叠分布不均匀时收益最大,同时保持输出精确性、兼容现有递归计算逻辑及混合模型中的KV缓存压缩技术。

链接: https://arxiv.org/abs/2605.05219
作者: Mikhail Shirokikh,Sergey Nikolenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single stored state rather than requiring the entire token history. This asymmetry opens a new design point between no reuse and dense caching: store exact recurrent states at a sparse set of checkpoint positions and, on a cache hit, resume from the deepest stored checkpoint and recompute the remaining suffix exactly. We formalize sparse prefix caching as checkpoint placement under a distribution over overlap depths, yielding an exact O(NM) dynamic program. For use cases where requests share a non-trivial prefix (e.g. asking different questions about a single long document), we show that our method consistently improves the Pareto frontier traced by standard heuristics on real-world data. Across QuALITY and System Prompts, distribution-aware placement dominates every fixed-budget baseline on the measured layer-group Pareto frontier and matches or outperforms the strongest heuristic (block caching) while typically using substantially fewer checkpoints, with the largest gains at low checkpoint budgets where the overlap distribution is most non-uniform. The method is most relevant when many requests share a substantial but not identical prefix within a retained cache entry. It preserves exact outputs, does not change the recurrent computation itself or require new recurrent update kernels, applies to recurrent/SSM layers whose hidden state can be extracted and restored exactly, and for hybrid models can be combined with existing KV-cache compression techniques. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.05219 [cs.LG] (or arXiv:2605.05219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05219 Focus to learn more arXiv-issued DOI via DataCite

[AI-203] Horizon-Constrained Rashomon Sets for Chaotic Forecasting

【速读】:该论文旨在解决机器学习中预测多重性(predictive multiplicity)与混沌动力学(chaotic dynamics)之间的理论断层问题,二者虽概念相关但长期独立发展。其核心挑战在于:在混沌系统中,初始相近的模型会随预测时域呈指数级发散,导致传统静态的Rashomon集(Rashomon set)无法刻画动态演化下的预测等价性。解决方案的关键在于提出“ horizon-constrained Rashomon sets”(时域约束Rashomon集)这一新理论框架,证明有效Rashomon集随预测时长以最大李雅普诺夫指数(maximum Lyapunov exponent)为速率指数收缩,并引入李雅普诺夫加权度量(Lyapunov-weighted metrics)以更紧地约束预测分歧。进一步基于此构建决策对齐选择算法(decision-aligned selection algorithms),依据下游任务效用而非单纯预测精度筛选近优模型,在合成混沌系统(Lorenz-96、Kuramoto-Sivashinsky)及真实场景(风电、交通、天气)中验证了决策质量提升18–34%,同时保持竞争性预测性能,首次建立了混沌理论与预测多重性间的严格联系。

链接: https://arxiv.org/abs/2605.05218
作者: Gauri Kale,Rahul Vishwakarma,Holly Diamond,Ava Hedayatipour,Amin Rezaei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
备注:

点击查看摘要

Abstract:Predictive multiplicity and chaotic dynamics represent two fundamental challenges in machine learning that have evolved independently despite their conceptual connections. We bridge this gap by introducing horizon-constrained Rashomon sets, a theoretical framework that characterizes how model multiplicity evolves with prediction horizon in chaotic systems. Unlike static prediction tasks where the Rashomon set remains fixed, chaos induces exponential divergence among initially similar models, fundamentally transforming the nature of predictive equivalence. We prove that the effective Rashomon set contracts exponentially with lead time at a rate determined by the maximum Lyapunov exponent and introduce Lyapunov-weighted metrics that provide tighter bounds on predictive disagreement. Leveraging these insights, we develop decision-aligned selection algorithms that choose among near-optimal models based on downstream utility rather than forecast accuracy alone. Extensive experiments on synthetic chaotic systems (Lorenz-96, Kuramoto-Sivashinsky) and real-world applications (wind power, traffic, weather) demonstrate that our framework improves decision quality by 18-34% while maintaining competitive predictive performance. This work establishes the first rigorous connection between chaos theory and predictive multiplicity, providing principled guidance for deploying machine learning in safety-critical chaotic domains.

[AI-204] Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning

【速读】:该论文旨在解决科学机器学习(Scientific Machine Learning)在数据稀缺场景下的建模难题,特别是传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)因固定或启发式权重分配导致训练不稳定、泛化能力差的问题。其解决方案的关键在于提出一种自监督的PINN框架,引入可学习的融合神经元(blending neuron),根据物理残差项与数据损失项的不确定性动态调整二者贡献比例,从而实现无需人工调参的稳定训练与更优泛化性能;同时结合迁移学习策略,复用相关领域表征以适应新物理系统,显著提升小样本条件下的预测精度,在仅使用87个CFD数据点的情况下即实现了优于浅层神经网络、核方法及纯物理基线模型的热传导预测结果(误差降低8%)。

链接: https://arxiv.org/abs/2605.05217
作者: Reza Pirayeshshirazinezhad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a self-supervised physics-informed neural network (PINN) framework that adaptively balances physics-based and data-driven supervision for scientific machine learning under data scarcity. Unlike prior PINNs that rely on fixed or heuristic weighting of physics residuals and data loss, our approach introduces a learnable blending neuron that dynamically adjusts the relative contribution of each term based on their uncertainties. This mechanism enables stable training and improved generalization without manual tuning. To further enhance efficiency, we integrate a transfer learning strategy that reuses representations from related domains and adapts them to new physical systems with limited data. We validate the framework for the prediction of heat transfer in liquid-metal miniature heat sinks using only 87 CFD datapoints, where the adaptive PINN achieves an error 8%, outperforming shallow neural networks, kernel methods, and physics-only baselines. Our framework provides a general recipe for embedding physics adaptively into neural networks, offering a robust and reproducible approach for data-scarce problems across various scientific domains, including fluid dynamics and material modeling.

[AI-205] Are Flat Minima an Illusion?

【速读】:该论文试图解决的问题是:当前主流的神经网络泛化能力解释框架——即通过优化目标函数的几何特性(如损失曲面的平坦性)来理解模型泛化性能——是否真正揭示了泛化背后的因果机制。作者指出,传统方法依赖的“尖锐度”(sharpness)指标易受参数化方式影响,且在不同数据规模下表现出显著的预测失效现象,说明其可能只是混杂因子而非根本原因。解决方案的关键在于提出一个全新的、参数化不变的度量——“弱度”(weakness),它衡量的是在学习者所使用的语言空间中与已学函数相容的完整函数集合的体积,从而直接反映模型的简洁性本质。论文证明弱度在交换需求下是最优的,并表明 PAC-Bayes 界之所以有效,是因为它们与弱度高度相关;实验进一步验证,在 MNIST 和 Fashion-MNIST 上,弱度能稳定预测泛化性能,而尖锐度则无显著关联甚至负相关,彻底否定了“平坦最小值”作为泛化主因的传统观点。

链接: https://arxiv.org/abs/2605.05209
作者: Michael Timothy Bennett
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks that land in flat regions of the loss landscape tend to generalise better than those in sharp regions. Sharpness-Aware Minimisation exploits this to improve generalisation. But function-preserving reparameterisation can inflate the Hessian of any minimum by two orders of magnitude without changing a single prediction. If the geometry of weight space can be manufactured from nothing, it cannot be the cause of anything. In other words, flat is simple and simplicity depends on encoding. Here I show that the actual driver is weakness, the volume of completions compatible with the learned function in the learner’s embodied language. Weakness is reparameterisation-invariant because it is defined over what the network \emphdoes, not how it is parameterised. I prove weakness is minimax-optimal under exchangeable demands, and that PAC-Bayes bounds work because they correlate with it. On MNIST, the large-batch generalisation advantage \emphvanishes as training data grows, from +1.6% at n = 2,000 to +0.02% at n = 60,000 . A quantity whose predictive power depends on how much data you have is not a cause but a confounder. I run head-to-heads on 100 networks with identical architecture and training. For MNIST weakness predicts generalisation ( \rho = +0.374 , p = 0.00012 ), sharpness anticorrelates ( \rho = -0.226 ) and simplicity predicts nothing ( p = 0.848 ). For Fashion-MNIST ( \rho = +0.384 , p = 8.15 \times 10^-5 ), though simplicity is at least somewhat predictive there. Simplicity is dataset dependent, whereas weakness is invariant. Flat minima were never the answer.

[AI-206] A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work

【速读】:该论文旨在澄清近期提出的TurboQuant方法与早期DRIVE(NeurIPS 2021)和EDEN(ICML 2022)量化方案之间的关系,解决的问题是:TurboQuant在理论和实践上是否优于EDEN,以及其性能受限的根本原因。解决方案的关键在于指出TurboQuant _\textmse 是EDEN的一个特例(固定缩放参数 $ S=1 $),而TurboQuant _\textprod 则结合了有偏的 (b1)(b-1)-bit EDEN步骤与无偏的1-bit残差量化,这种分步处理方式存在三个次优性:(1) 使用固定 $ S=1 $ 的子优缩放;(2) 1-bit无偏残差量化误差高于无偏1-bit EDEN;(3) 分步量化不如直接对输入进行 $ b $-bit无偏EDEN量化。实验验证表明,优化缩放参数后的有偏EDEN优于TurboQuant _\textmse ,而无偏EDEN则显著优于TurboQuant _\textprod ,甚至在低比特下(如2-bit EDEN优于3-bit TurboQuant _\textprod )仍具优势。

链接: https://arxiv.org/abs/2604.18555
作者: Ran Ben-Basat,Yaniv Ben-Itzhak,Gal Mendelson,Michael Mitzenmacher,Amit Portnoy,Shay Vargaftik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any b0 bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant _\textmse is a special case of EDEN obtained by fixing EDEN’s scalar scale parameter to S=1 . EDEN supports both biased and unbiased quantization, each optimized by a different S (chosen via methods described in the EDEN works). The fixed choice S=1 used by TurboQuant is generally suboptimal, although the optimal S for biased EDEN converges to 1 as the dimension grows; accordingly TurboQuant _\textmse approaches EDEN’s behavior for large d . Second, TurboQuant _\textprod combines a biased (b-1) -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its (b-1) -bit step uses the suboptimal S=1 ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased (b-1) -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with b -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized S ) is more accurate than TurboQuant _\textmse , and unbiased EDEN is markedly more accurate than TurboQuant _\textprod , often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant _\textprod ). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) MSC classes: 68T07 ACMclasses: I.2.7; G.1.0 Cite as: arXiv:2604.18555 [cs.LG] (or arXiv:2604.18555v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18555 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Amit Portnoy [view email] [v1] Mon, 20 Apr 2026 17:44:15 UTC (131 KB) Full-text links: Access Paper: View a PDF of the paper titled A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work, by Ran Ben-Basat and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.NI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-207] AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

【速读】:该论文旨在解决将生成式 AI (Generative AI) 扩展至高保真物理仿真(如计算流体动力学,CFD)中的科学发现闭环问题,即如何在不依赖人工干预的情况下实现从假设生成、执行验证到结果输出的全流程自动化,并确保物理合理性。其核心挑战在于传统基于求解器完成状态的验证不足以保证物理有效性,且许多错误仅在场级图像中显现。解决方案的关键是提出 AI CFD Scientist 框架,通过一个视觉-语言物理验证门(vision-language physics-verification gate),在结果被接受、重跑或写入论文前对渲染的流场进行物理一致性检查;该框架还集成文献驱动的假说生成、源码修改能力与多路径并行工作流(固定求解器参数扫描、本地 C++ 库编译新模型、开放假设搜索),并在 OpenFOAM 环境下通过 Foam-Agent 实现端到端可控执行,从而显著提升自主科学发现的可信度与可解释性。

链接: https://arxiv.org/abs/2605.06607
作者: Nithin Somasekharan,Rabi Pathak,Manushri Dhanakoti,Tingwen Zhang,Ling Yue,Andy Zhu,Shaowu Pan
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注: 9 main pages and rest in appendix

点击查看摘要

Abstract:Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature-grounded ideation, validated execution, vision-based physics verification, source-code modification, and figure-grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and open-ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam-Agent. At the center of the framework is a vision-language physics-verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI-scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain-specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted-failure ablation shows that the vision-language gate detects 14 of 16 silent failures missed by solver-level checks. Code, prompts, and run artifacts are released at this https URL.

[AI-208] Learning to Cut: Reinforcement Learning for Benders Decomposition

【速读】:该论文旨在解决Benders分解(Benders Decomposition, BD)在求解两阶段随机规划问题时因主问题随割平面数量增加而收敛速度缓慢的问题。其解决方案的关键在于提出了一种基于强化学习的自适应割平面选择框架——RLBD,该框架利用神经网络构建随机策略来动态决定哪些割平面应被加入主问题,策略通过REINFORCE算法进行策略梯度训练。该方法显著提升了计算效率,并在不同数据输入和决策变量维度下展现出良好的泛化能力。

链接: https://arxiv.org/abs/2605.06516
作者: Haochen Cai,Xian Yu
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this paper, we propose Reinforcement Learning for BD (RLBD), a framework that adaptively selects cuts using a neural network-based stochastic policy. The policy is trained using a policy gradient method via the REINFORCE algorithm. We evaluate the proposed approach on a two-stage stochastic electric vehicle charging station location problem and compare it with vanilla BD and LearnBD, a supervised learning approach that classifies cuts using a support vector machine. Numerical results demonstrate that RLBD achieves substantial improvements in computational efficiency and exhibits strong generalization to problems with similar structures but varying data inputs and decision variable dimensions.

[AI-209] Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance

【速读】:该论文旨在解决多模态半监督学习中因类别不平衡导致的模型偏差问题,尤其是在部分标注数据下,伪标签传播会加剧类别不平衡带来的负面影响。其关键解决方案是提出一种基于深度生成模型的多模态半监督学习框架,通过为每种模态设计独立编码器并共享潜在变量来融合互补信息,并采用乘积专家(product-of-experts)方法简化联合后验计算;此外,为更好建模不平衡数据中重尾分布的潜在空间,用Student’s t-分布替代传统的高斯分布作为先验、编码器和解码器的分布假设,并引入γ-幂散度(γ-power divergence)构建新的训练目标函数,从而提升模型在类别不平衡场景下的泛化能力与分类性能。

链接: https://arxiv.org/abs/2605.06289
作者: Heegeon Yoon,Heeyoung Kim
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student’s t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using \gamma -power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.

[AI-210] A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs

【速读】:该论文旨在解决当前因果发现算法评估中所依赖的随机有向无环图(Directed Acyclic Graph, DAG)生成方法存在的局限性问题,即基于Erdős-Rényi和尺度自由随机图并施加因果顺序构造的DAGs中,节点通过开放路径可达的集合(称为relatives)沿因果顺序单调递增的现象未被充分认识和利用。其解决方案的关键在于识别并利用这一单调性特征:通过估计每个节点的relatives数量,并据此对节点进行排序,可有效近似真实因果顺序;进一步证明,若relatives严格递增,则对应的DAG具有唯一马尔可夫等价类(Markov equivalence class),从而为因果结构学习提供更可靠的基准。此外,论文建议采样时间序列DAG作为替代方案,以改进因果发现算法的合成数据评估体系。

链接: https://arxiv.org/abs/2605.06288
作者: Alexander G. Reisach,Antoine Chambaz,Gilles Blanchard,Sebastian Weichwald
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.

[AI-211] FunctionalAgent : Towards end-to-end on-top functional design

【速读】:该论文旨在解决多配置对密度泛函理论(Multiconfiguration pair-density functional theory, MC-PDFT)中因“on-top”泛函形式选择不当而导致预测精度受限的问题。其核心挑战在于如何高效、自动化地开发高精度的on-top泛函,以提升MC-PDFT在强关联分子体系中的电子能计算准确性。解决方案的关键在于提出FunctionalAgent——一个由多个专业化子代理组成的智能系统,实现了从数据集构建、活性空间生成、MCSCF计算、描述符提取、损失函数设计到泛函拟合与优化的全流程闭环自动化工作流,从而显著提升了泛函开发的效率与质量。基于此框架,作者成功开发出MC26和COF26两个新型泛函,在训练集和测试集上均展现出优于现有方法的性能。

链接: https://arxiv.org/abs/2605.06215
作者: Yuhao Chen,Donald G. Truhlar,Xiao He
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiconfiguration pair-density functional theory (MC-PDFT) offers an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here we introduce FunctionalAgent, an agentic system for fully automated functional development. FunctionalAgent orchestrates a team of specialized sub-agents to decompose the development process into dataset construction, active-space generation, MCSCF calculation and descriptor generation, loss-function construction, and functional fitting, optimization, and evaluation, thereby linking all stages into a closed-loop automated workflow. Using FunctionalAgent, we developed MC26, a hybrid meta-GGA on-top functional that achieves improved overall accuracy on the training set compared with other methods evaluated on the same benchmark dataset. We further introduce COF26, a new functional form that, owing to the optimized training process, achieves the best performance on both the training and test sets.

[AI-212] Super-Level-Set Regression: Conditional Quantiles via Volume Minimization

【速读】:该论文旨在解决多变量回归中构建满足条件覆盖率的最小体积预测区域这一基础性难题。传统方法依赖于先显式估计完整的条件密度,再通过阈值化处理来构造预测区域,这种两步插值过程易受估计误差影响且计算复杂度高。其解决方案的关键在于提出了一种名为超水平集回归(Super-Level-Set Regression, SLS)的新数学框架,该框架成功解耦了体积优化目标与模型自身估计误差的条件分位数之间的隐式耦合关系,从而能够直接参数化并优化目标条件等高集的几何边界。SLS通过绕过全分布估计、利用灵活的保体积边界函数,实现了对复杂、多模态及不连通条件结构的端到端建模,为多变量条件分位数回归提供了基于几何优化的全新范式。

链接: https://arxiv.org/abs/2605.06210
作者: Sacha Braun,Michael I. Jordan,Francis Bach
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model’s own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.

[AI-213] CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

【速读】:该论文旨在解决全球导航卫星系统(GNSS)在城市峡谷环境中定位时协方差估计不可信的问题,即尽管位置估计的均值可能改善,但报告的协方差往往过小、过大或形状错误,导致不确定性量化失真。解决方案的关键在于提出可信赖的可微分因子图优化框架(CredibleDFGO, CDFGO),将协方差可信度作为显式训练目标:通过权重生成网络(WGN)预测每颗卫星的可靠性权重,并利用可微分高斯-牛顿求解器将这些权重映射为位置估计与后验协方差;同时,采用适当的评分规则(如负对数似然NLL和能量得分ES)端到端监督东-北方向的预测分布,从而实现更准确的不确定性建模。

链接: https://arxiv.org/abs/2605.06100
作者: Liang Qian,Penggao Yan,Penghui Xu,Li-Ta Hsu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Submitted to NAVIGATION: Journal of the Institute of Navigation

点击查看摘要

Abstract:Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods already learn measurement weighting through the solver, but they still use position-only objectives. As a result, the mean estimate may improve while the reported covariance remains too small, too large, or wrong in shape. In this work, we propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. The Weighting Generation Network (WGN) predicts per-satellite reliability weights. The differentiable Gauss–Newton solver maps these weights to a position estimate and posterior covariance, and proper scoring rules supervise the East–North predictive distribution end-to-end. We study negative log-likelihood (NLL), Energy Score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in uncertainty credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes, and the mean horizontal error and 95th-percentile error improve on the deep-urban scene. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77,m to 11.68,m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05. The case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.

[AI-214] owards Reliable LLM Evaluation: Correcting the Winners Curse in Adaptive Benchmarking

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中因自适应提示与程序搜索导致的选择敏感性问题,即在微调过程中重复使用基准测试项时,所观测到的“胜出”模型性能可能无法准确反映完整“微调-部署”流程在新数据上的实际表现。其解决方案的关键在于提出SIREN(Selection-aware Repeated-Split Reporting Protocol),该协议通过冻结搜索后的候选短列表、分离分组选择与保留评估步骤,并采用基于项目级高斯乘子自助法(item-level Gaussian multiplier bootstrap)进行不确定性量化,从而实现对过程级目标(procedure-level target)的有效推断。在固定短列表且选择过程平滑稳定的情况下,该估计量具有首阶项目级表示形式,且自助法能提供有限预算网格下的有效联合推断,支持过程性能曲线的置信区间及预设等预算和跨预算比较,实验表明SIREN可避免基于胜者报告的乐观偏差并保持接近有限样本报告目标。

链接: https://arxiv.org/abs/2605.05973
作者: Yang Xu,Jiefu Zhang,Haixiang Sun,Zihan Zhou,Tianyu Cao,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner’s score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

[AI-215] Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练和推理过程中对经典内存的高需求问题,这一限制使得模型规模扩展面临严峻挑战。解决方案的关键在于引入基于Cayley参数化的酉适配器(Cayley-parameterised unitary adapters),即在预训练LLM的冻结投影层中插入量子电路模块,并在真实的156量子比特超导处理器(IBM Quantum System Two)上执行。实验表明,仅用6000个额外参数即可使Llama 3.1 8B模型的困惑度(perplexity)提升1.4%,且端到端推理已在真实量子处理单元(Quantum Processing Unit, QPU)上验证有效;此外,在SmolLM2小模型上的系统性研究表明,随着酉块维度增加,困惑度单调改善,恢复了83%因压缩导致的性能下降,并能正确回答传统基线无法解答的问题,揭示出噪声-表达力相变边界,为未来更大规模量子优势提供了明确路径。

链接: https://arxiv.org/abs/2605.05914
作者: Borja Aizpurua,Sukhbinder Singh,Augustine Kshetrimayum,Saeed S. Jahromi,Roman Orus
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have transformed artificial intelligence, yet classical architectures impose a fundamental constraint: every trainable parameter demands classical memory that scales unfavourably with model size. Quantum computing offers a qualitatively different pathway, but practical demonstrations on real hardware have remained elusive for models of practical relevance. Here we show that Cayley-parameterised unitary adapters – quantum circuit blocks inserted into the frozen projection layers of pre-trained LLMs and executed on a 156-qubit IBM Quantum System Two superconducting processor – improve the perplexity of Llama 3.1 8B, an 8-billion-parameter model in widespread use, by 1.4% with only 6,000 additional parameters and end-to-end inference validated on real Quantum Processing Unit (QPU). A systematic study on SmolLM2 (135M parameters), chosen for its tractability, reveals monotonically improving perplexity with unitary block dimension, 83% recovery of compression-induced degradation, and correct answers to questions that both classical baselines fail – with a sharp noise-expressivity phase transition identifying the concrete path to quantum utility at larger qubit scales.

[AI-216] uning Derivatives for Causal Fairness in Machine Learning

【速读】:该论文旨在解决人工智能系统在预测中继承受保护属性(如种族、性别或年龄)偏见的问题,尤其是在这些属性通过合法的业务必要性中介变量影响结果时,传统公平性概念(如统计独立性 Statistical Parity, SP)过于严格而无法适用。其核心挑战在于如何在保留合法因果路径影响的同时,消除非法路径带来的不公平。解决方案的关键在于提出一种适用于连续受保护属性的结构因果模型公平性框架,通过路径特定偏导数形式化 SP 和预测公平性(Predictive Parity, PP),明确区分允许与不允许的因果路径,并建立公平预测器存在的理论条件;在此基础上设计了一种公平调优算法,能够在可行时构造满足 SP(非允许路径)和 PP(允许路径)的预测器,否则提供 SP 与 PP 之间的可解释权衡。

链接: https://arxiv.org/abs/2605.05882
作者: Filip Edström,Guilherme W. F. Barros,Tetiana Gorbach,Xavier de Luna
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that predictions be independent of the protected attributes, but are overly restrictive when these attributes influence mediating variables that are considered business necessities. Recent causal formulations relax SP by distinguishing allowed from not-allowed causal paths and by complementing SP with Predictive Parity (PP), requiring the predictor to replicate the legitimate influence of business-necessities. Existing path-based definitions are mainly practical when applied to categorical attributes. This paper introduces a new framework for fairness in structural causal models that is tailored to continuous protected attributes. We formalize SP and PP through path-specific partial derivatives, establish conditions under which these criteria coincide with prior causal definitions, and characterize when a fair predictor, one that satisfies SP along not-allowed paths while achieving PP along allowed paths, exists. Building on this theory, we propose a fair tuning algorithm that either constructs such a predictor or, when not possible, allows for a trade-off between SP and PP. We present experiments on simulated and real data to evaluate our proposal, compare it with previously proposed methods, and show that it performs better when PP is considered.

[AI-217] CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在通过多次采样并聚合结果以提升推理能力时,如何实现对错误率的精确且高效控制的问题,尤其针对数据驱动的停止规则下无法预知答案集合的情形。其核心挑战在于设计一种能够保证在任意停止时刻均不超出预定误认证水平的验证机制。解决方案的关键是提出Certification by Intersection-union Testing with E-processes (CITE)算法,该算法利用e过程(e-processes)构建 anytime-valid 的统计检验框架,在无需事先知晓答案类别集合的前提下,严格控制假阳性认证率(false certification rate),同时具备类别集大小无关的停止时间速率和匹配的极小极大下界,从而在扩散尾部(diffuse-tail)场景中实现了更优的认证性能。

链接: https://arxiv.org/abs/2605.05873
作者: Hirofumi Ota,Naoto Iwase,Yuki Ichihara,Junpei Komiyama,Masaaki Imaizumi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model’s response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with E-processes (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove an category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.

[AI-218] ransformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

【速读】:该论文旨在解决生成式 AI (Generative AI) 在上下文强化学习(In-Context Reinforcement Learning, ICRL)中的机制理解与性能保障问题,即如何让 Transformer 架构在不进行参数更新的前提下,从轨迹数据中推断并执行经典强化学习(Reinforcement Learning, RL)算法。解决方案的关键在于:首先提出一个线性自注意力(Linear Self-Attention)Transformer 块的显式参数构造,证明其可实现半梯度 SARSA 和演员-评论家(Actor-Critic)等策略改进方法;其次设计了一种教师模仿训练策略,并通过梯度流动力学分析建立了首个 ICRL 领域的收敛保证——在训练马尔可夫决策过程(MDP)分布满足适当丰富性条件下,梯度流局部且指数收敛至对应理想 RL 更新的最优参数流形。实验证明,模型在随机生成的表格型 MDP 上训练后,不仅能恢复出理论构造的参数结构,还能在未见 MDP 上展现出强的上下文控制性能,从而揭示了 Transformer 如何在上下文中内化和执行经典 RL 算法,实现了机制可解释性与训练动态之间的桥梁。

链接: https://arxiv.org/abs/2605.05755
作者: Haodong Liang,Lifeng Lai
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 4 figures

点击查看摘要

Abstract:We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.

[AI-219] Fourier Feature Methods for Nonlinear Causal Discovery: FFML Scoring and FFCI Testing in Mixed Data

【速读】:该论文旨在解决高维非线性因果发现中基于高斯过程(Gaussian Process, GP)边际似然得分和核条件独立性检验的计算复杂度问题,这类方法虽理论优越但难以扩展至大规模数据。其核心解决方案是提出两种互补的基于随机傅里叶特征(Random Fourier Features, RFF)的方法:一是傅里叶特征边际似然(Fourier Feature Marginal Likelihood, FFML)得分,通过将原始 n×nn \times n 核Gram矩阵替换为低维特征表示,将计算复杂度从 O(n3)O(n^3) 降低至 O(nm2+m3)O(nm^2 + m^3),同时保持概率解释与自动复杂度惩罚;二是傅里叶特征条件独立性测试(Fourier Feature Conditional Independence, FFCI),对混合变量(连续+离散)分别进行特征映射并利用岭残差化在特征空间中构造检验统计量,显著提升效率与召回率。二者共享RFF/ORF机制,但在架构设计上分别服务于评分型和约束型因果发现任务,形成一套实用的混合因果推断工具包。

链接: https://arxiv.org/abs/2605.05743
作者: Joseph D. Ramsey
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Gaussian process marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present two complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery. The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the n x n kernel Gram matrix with a finite-dimensional feature representation, reducing cost to O(nm^2 + m^3) while retaining the probabilistic interpretation and automatic complexity penalty of the exact score. FFML extends to mixed (continuous + discrete) parent sets via a product-kernel construction, with a Kronecker path for small discrete parent sets and a Hadamard-product path otherwise. The Fourier Feature Conditional Independence (FFCI) test is a fast nonparametric CI test for mixed data. Each variable is featurized individually: continuous variables via RFF or Orthogonal Random Features (ORF), discrete variables via a Cholesky-factored categorical feature map, with blocks concatenated. Conditioning uses ridge residualization in feature space; the test statistic is a Frobenius norm of the residualized cross-covariance, approximated as a weighted sum of chi-squared variables. Although FFML and FFCI share the same RFF/ORF machinery, they differ architecturally: FFML builds a joint kernel over a parent set for scoring, while FFCI featurizes variables individually for testing. We compare FFML to TRFF, a penalized Student-t regression alternative. Empirically, BOSS+FFML outperforms linear and kernel-ridge baselines on nonlinear data. When run through the same PC-Max implementation, FFCI and RCIT exhibit complementary precision-recall profiles: RCIT is more precise while FFCI achieves better recall and lower SHD, and runs in one third the time. Comments: 16 pages, 2 figures, 3 tables Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.05743 [stat.ML] (or arXiv:2605.05743v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.05743 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joseph Ramsey [view email] [v1] Thu, 7 May 2026 06:34:19 UTC (20 KB)

[AI-220] AstroAlertBench: Evaluating the Accuracy Reasoning and Honesty of Multimodal LLM s in Astronomical Classification

【速读】:该论文旨在解决现代天文观测台产生的海量多模态数据(multimodal data)给专家人工审查带来的瓶颈问题,尤其关注大语言模型(LLM)在天文学事件分类任务中是否具备科学推理能力与可解释性。其解决方案的关键在于构建AstroAlertBench——一个面向天文事件审查的三阶段逻辑链评估基准,涵盖元数据定位、科学推理和五类层次化分类,并基于Zwicky Transient Facility(ZTF)的真实警报样本对13个支持视觉输入的前沿闭源及开源LLM进行系统评测,首次揭示了高精度模型未必具备“诚实性”(honesty),即自我评估推理可靠性的能力,从而为开发校准且可解释的天文辅助工具提供方法论框架。

链接: https://arxiv.org/abs/2605.05573
作者: Claire Chen,Jiabao Sean Xiao,Shuze Daniel Liu,Facundo Perez Paolino,Luke Handley,Theophile Jegou du Laz,Ricky Nilsson,Alice Zou,Matthew Graham,Ashish Mahabal
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs that support visual input. Our results reveal that high accuracy does not always align with model ``honesty,‘’ defined as the ability to self-evaluate its reasoning, which affects its reliability as a real-world assistant. We further initialize a human-in-the-loop evaluation protocol as a precursor to future community-scale participation. Together, AstroAlertBench provides a framework for developing calibrated and interpretable astronomical assistants.

[AI-221] owards an Inferentialist Account of Information Through Proof-theoretic Semantics

【速读】:该论文旨在解决信息概念缺乏严谨逻辑与数学基础的问题,从而无法为社会依赖的复杂系统生态提供充分的推理工具。其解决方案的关键在于构建一种基于推论主义(inferentialism)的信息语义理论,包含三个核心部分:首先通过概念分析重构信息的本体论,将Dretske提出的“真值”替换为“可推导性”;其次引入证明论语义学(proof-theoretic semantics, P-tS)实现推论主义推理的数学逻辑形式化,提出信息的基本单位“inferon”,并以此回应van Benthem和Martinez对信息理解的三类划分(范围、相关性、编码),聚焦于信息作为相关性的理解;最后利用P-tS工具发展分布式系统建模的数学框架,从而建立基于推理的信息流理论,为信息在信息系统中的组织与流动提供概念严谨的数学逻辑解释。

链接: https://arxiv.org/abs/2605.05368
作者: Matthew Collins,Timo Eckhardt,David Pym
机构: 未知
类目: Logic (math.LO); Artificial Intelligence (cs.AI)
备注: Manuscript

点击查看摘要

Abstract:Information is one of the most widely-discussed concepts of the current era. However, a great deal of insightful work notwithstanding, it is yet to be given wholly convincing logical or mathematical foundations. Without them, we lack adequate reasoning tools for understanding the complex ecosystems of systems upon which the society depends. We seek to rectify this by taking a first step towards developing an inferentialist semantic theory of information. There are three key interacting components. First, conceptual analysis: the metaphysics of information. Dretske expressed the key concepts of information in terms of intentionality, truth, and transmissibility. We replace truth with inferability, and trace the consequences of this replacement. Second, logic: proof-theoretic semantics (P-tS) provides a mathematical-logical realization of inferentialist reasoning. Using P-tS, we develop the first steps towards a mathematical-logical theory of an inferentialist primitive unit of information, the ‘inferon’. This proof-theoretic approach counterpoints the model-theoretic view of information articulated in situation theory. Furthermore, we argue that it facilitates addressing all three components of van Benthem and Martinez’s categorization of the understandings of information, as range, as correlation, and as code. Our focus is on information-as-correlation. Third, systems: the P-tS tools we develop provide the basis for a mathematical account of distributed systems modelling – a key tool from informatics for understanding the organization of information processing systems. This yields a reasoning-based theory of information flow in models of distributed systems. Overall, we seek to give a conceptually rigorous mathematical-logical account of information and its role within informatics, grounded in inference and reasoning.

[AI-222] Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agent ic Reinforcement Learning

【速读】:该论文旨在解决在固定预算下,工具使用回放(tool-use rollout)中策略梯度信息量不足的问题,即Rollout Informativeness under a Fixed Budget (RIFB) 的优化问题。现有方法中,任何与预算无关的独立采样器在处理困难提示(hard prompts)时都会面临策略梯度质量坍缩率非零的理论限制。为应对这一挑战,论文提出将中间状态选择建模为单调子模最大化问题,并设计了一个贪婪的一步选择器,其具有 11/e1 - 1/e 的近似保证。由此导出的不确定性感知上置信界(Uncertainty-aware Upper Confidence Bound, UUCB)项成为该目标的闭式边际收益,从而将原本作为经验技巧的token级熵奖励转化为形式化推导的结果。进一步地,作者构建了InfoTree训练时树搜索框架,融合UUCB与自适应预算分配器(Adaptive Budget Allocator, ABA)和异步推测扩展机制(Speculative Expansion),显著提升了策略优化效率与鲁棒性,在多个数学推理、网络搜索及工具丰富型代码与操作系统代理任务中超越现有主流方法。

链接: https://arxiv.org/abs/2605.05262
作者: Yuelin Hu,Zhenbo Yu,Zhengxue Cheng,Wei Liu,Li Song
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, 9 pages, 5 figures

点击查看摘要

Abstract:We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts regardless of the budget. Motivated by this, we recast intermediate state selection as a monotone submodular maximization problem, where a greedy one-step selector enjoys a 1 minus 1/e approximation guarantee. Our Uncertainty-aware Upper Confidence Bound (UUCB) terms arise as closed-form marginal gains of this objective. This turns the token-level entropy bonus from an empirical trick into an analytic consequence of the formulation. We present InfoTree, a training-time tree-search framework coupling UUCB with a learned Adaptive Budget Allocator (ABA) and an asynchronous Speculative Expansion scheme. ABA rescues prompts whose initial tree is wasted on uniform outcomes, lifting the mixed-outcome ratio from 58.1 percent to 76.3 percent with less than 5 percent budget overhead. Speculative Expansion reduces wall-clock overhead from 14.3 percent to 4.8 percent by tolerating bounded staleness in UUCB scores. Across nine benchmarks spanning math reasoning (AIME 2024 and 2025, MATH-500, OlympiadBench, USAMO), web-search agents (GAIA, HLE-100, BrowseComp-lite), and tool-rich coding and OS agents (APPS-verified, AgentBench-OS), InfoTree outperforms flat GRPO, DeepSearch, Tree-GRPO, AT2PO, CW-GRPO, and RC-GRPO. Head-to-head compositions with Tree-GRPO prefix sharing and CW-GRPO contribution weights deliver further gains, confirming that our selector operates orthogonally to rollout reuse and trajectory re-weighting. A 5 by 5 by 5 robustness grid reveals that over three quarters of the hyperparameter space lies on a performance plateau, confirming UUCB robustness. Comments: Preprint, 9 pages, 5 figures Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T05, 68T20, 90C27 ACMclasses: I.2.6; I.2.8; F.2.2 Cite as: arXiv:2605.05262 [stat.ML] (or arXiv:2605.05262v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.05262 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yuelin Hu [view email] [v1] Wed, 6 May 2026 06:17:48 UTC (8,217 KB)

[AI-223] Enhancing Cryo-EM Density Map Segmentation in Phenix for Improved Atomic Model Building

【速读】:该论文旨在解决冷冻电镜(cryo-EM)密度图中原子模型构建的难题,特别是传统 Phenix 软件在处理噪声和伪影时导致的建模精度低、效率差的问题。其解决方案的关键在于引入 AlphaFold 预测结构信息,以增强 Phenix 中的密度图分割步骤,从而提升模型构建的准确性与鲁棒性,显著改善了 TM-score 和序列一致性等关键指标。

链接: https://arxiv.org/abs/2605.05259
作者: Chenwei Zhang
机构: 未知
类目: Biomolecules (q-bio.BM); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:We introduce PhenixCraft, a fully automated pipeline for building atomic models from cryo-EM density maps. By integrating AlphaFold predictions, we enhance the map-segmentation step in Phenix during model building, addressing challenges posed by noise and artifacts that traditionally hinder this step. Our results demonstrate PhenixCraft’s superior performance in TM-scores and sequence accuracy, significantly improving upon the limitations and inefficiencies of traditional model building using Phenix.

[AI-224] Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions

【速读】:该论文旨在解决电导皮肤活动(Electrodermal Activity, EDA)信号在穿戴式医疗物联网(Wearable Internet of Medical Things, IoMT)系统中易受运动伪影和环境噪声干扰的问题,尤其在恶劣环境(如水下)下的可靠监测难题。其解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation, KD)的轻量化模型框架:通过融合混合CNN-Transformer教师模型与轻量级深度可分离卷积学生模型,实现高性能去噪;同时引入真实数据增强策略以模拟多样化的运动伪影和环境畸变,显著提升模型鲁棒性。该方法在保持高精度去噪性能(MAE: 0.144,SNR提升12.08 dB)的同时,大幅降低模型体积(7.87 MB → 0.51 MB)和计算成本(105.1M → 11.61M FLOPs),并实现在水下真实场景(UMAC数据集)和临床下游任务(CNS-OT预测)中的优越表现,验证了其在资源受限设备中部署的可行性与有效性。

链接: https://arxiv.org/abs/2605.05246
作者: Yongbin Lee,Andrew Peitzsch,Youngsun Kong,Jarod Zizza,Dong-hee Kang,Farnoush Baghestani,Ki H. Chon
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrodermal activity (EDA) is widely used in wearable Internet of Medical Things (IoMT) systems for continuous health monitoring, including autonomic assessment. However, EDA signals are highly vulnerable to motion artifacts and environmental noise, limiting reliable deployment in harsh operating conditions such as underwater. This study proposes a robust, deployable EDA denoising framework that generalizes across multiple measurement locations and harsh environments. The framework integrates a hybrid CNN-Transformer teacher model with a lightweight depth-wise separable CNN student model via a knowledge distillation (KD) strategy. To further improve robustness, a realistic data augmentation scheme is introduced to simulate diverse motion artifacts and environmental distortions. The KD-based student model significantly reduces model size (7.87 MB to 0.51 MB) and computational cost (105.1M to 11.61M FLOPs) while maintaining denoising performance (MAE: 0.144, SNR improvement: 12.08 dB) using the public dataset validation. In real-world underwater conditions (UMAC dataset) testing, the proposed method substantially improves skin conductance response reconstruction, reducing mean absolute error from 2.809 to 0.215. Furthermore, on independent testing using the CNS-OT dataset, the denoised signals enhanced downstream CNS-OT prediction performance, achieving the highest AUROC (0.806) compared to prior denoising methods. The proposed method also improved the early prediction rate (sensitivity) from 0.550 to 0.767, enabling CNS-OT prediction up to a median of 6.9 minutes before symptom onset. These results demonstrate that the proposed framework not only improves EDA signal quality but also enhances clinically relevant prediction performance while remaining suitable for deployment in resource-constrained wearable Internet of Things systems operating in harsh environments.

[AI-225] PPO-Based Dynamic Positioning of HAPS-BS in Wind-Disturbed Stratospheric Maritime Networks

【速读】:该论文旨在解决高海拔平台站(High-Altitude Platform Stations, HAPS)在海洋区域部署时,因船舶移动性和大气扰动(尤其是平流层风对HAPS定位的影响)导致的覆盖不稳定与系统吞吐量下降问题。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的动态定位框架,其中由协调器HAPS部署的集中式DRL智能体,利用无线测量和网络反馈信息,采用近端策略优化(Proximal Policy Optimization, PPO)算法学习鲁棒的定位策略,从而有效抑制风致定位偏差,并提升海上用户的广域连通性与系统性能。

链接: https://arxiv.org/abs/2605.05240
作者: Azim Akhtarshenas,German Svistunov,Matteo Bernabè,Kuangyu Zheng,David López-Pérez
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-Altitude Platform Stations (HAPS) offer a promising solution for wide-area wireless coverage in maritime regions lacking terrestrial infrastructure. However, maintaining reliable performance is challenging due to dynamic ship mobility and atmospheric disturbances, particularly stratospheric wind effects on HAPS positioning. This paper proposes a deep reinforcement learning (DRL)-based framework for dynamic positioning of wind-disturbed HAPS-mounted base stations in maritime networks. A centralized DRL agent deployed on a coordinator HAPS controls multiple serving HAPS using radio measurements and network feedback, capturing realistic channel conditions and user mobility. A Proximal Policy Optimization (PPO) algorithm is employed to learn robust positioning policies that enhance coverage stability and system throughput under wind disturbances. Simulation results show that the proposed approach effectively mitigates wind-induced positioning deviations while ensuring reliable wide-area connectivity for maritime users.

[AI-226] MedMamba: Recasting Mamba for Medical Time Series Classification

【速读】:该论文旨在解决医学时间序列(如心电图 ECG 和脑电图 EEG)分析中长期依赖建模困难、传统卷积或循环模型表达能力有限,以及基于 Transformer 的方法存在二次计算复杂度和冗余交互的问题。其解决方案的核心在于提出 MedMamba——一种基于生理信号内在规律设计的多尺度双向状态空间架构,关键创新包括:通过轻量级通道混合模块实现跨通道重参数化以捕捉空间中心性;利用多尺度卷积标记化进行时间分解以建模多时间尺度动态;采用双向 Mamba 块实现线性复杂度的全局上下文建模,从而高效捕获非因果上下文依赖。该方法在多个基准数据集上显著优于现有最先进模型,并具备更高的推理效率,适用于实时临床部署。

链接: https://arxiv.org/abs/2605.05214
作者: ZhengXiao He,Huayu Li,Xiwen Chen,Janet M Roveda,Jinghao Wen,Siyuan Tian,Ao Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical time series, such as electrocardiograms (ECG) and electroencephalograms (EEG), exhibit complex temporal dynamics and structured cross-channel dependencies, posing fundamental challenges for automated analysis. Conventional convolutional and recurrent models struggle to capture long-range dependencies, while Transformer-based approaches incur quadratic complexity and often introduce redundant interactions that are misaligned with the intrinsic structure of physiological signals. To address these limitations, we propose MedMamba, a principle-driven multi-scale bidirectional state space architecture tailored for medical time series classification. Our design is guided by three key inductive biases of physiological signals: spatial centralization, multi-timescale temporal composition, and non-causal contextual dependency. These principles are instantiated through a lightweight channel-mixing module for cross-channel reparameterization, multi-scale convolutional tokenization for temporal decomposition, and bidirectional Mamba blocks for efficient global context modeling with linear complexity. Extensive experiments on six benchmark datasets spanning EEG, ECG, and human activity signals demonstrate that MedMamba consistently outperforms state-of-the-art methods across diverse modalities. Notably, it achieves 85.97% accuracy on PTB and establishes new state-of-the-art performance on the challenging ADFTD dataset (54.72% accuracy and 52.01% F1-score). Strong results on long-sequence benchmarks, such as SleepEDF, further validate its capability in modeling long-range dependencies. Moreover, MedMamba achieves a speedup of 4.6x in inference, highlighting its practicality for real-time clinical deployment. These results suggest that principle-guided state space modeling offers an effective and scalable alternative to Transformer-based approaches for medical time series analysis.

[AI-227] A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在量化金融领域,特别是股票价格预测中的实际应用问题,其核心挑战在于如何将LLM技术有效整合进真实交易流程并确保其鲁棒性。解决方案的关键在于系统性地识别和应对文献中常被忽视的实践陷阱,包括情感分析的脆弱性、数据集与预测时 horizon 的设计缺陷、性能评估指标的不恰当使用、数据泄露风险、流动性溢价的影响以及股票价格可预测性的理论极限。作者从对冲基金视角出发,提出一套结构化方法论,指导学术研究者与投资管理者在部署LLM驱动的交易系统时进行严谨的实证验证与压力测试,从而提升模型在现实市场摩擦下的可靠性与实用性。

链接: https://arxiv.org/abs/2605.05211
作者: Olivia Zhang,Zhilin Zhang
机构: 未知
类目: Pricing of Securities (q-fin.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: Accepted at the IEEE Conference on Artificial Intelligence, Spain, May 8–10, 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in quantitative finance for stock price forecasting. This review synthesizes recent applications of LLMs in this domain, including extracting sentiment from financial news and social media, analyzing financial reports and earnings-call transcripts, tokenizing or symbolizing stock price series, and constructing multi-agent trading systems. Particular attention is paid to practical pitfalls that are often understated in the literature, such as fragility in sentiment analysis, dataset and horizon design, performance evaluation metrics, data leakage, illiquidity premia, and limits of stock price predictability. Organized from a hedge-fund perspective, the review is intended to guide both academic researchers and hedge fund managers in integrating LLMs into real-world trading pipelines and in stress-testing their robustness under realistic market frictions.

机器学习

[LG-0] Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

链接: https://arxiv.org/abs/2605.06656
作者: Jai Moondra,Ayela Chughtai,Bhargavi Lanka,Swati Gupta
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Emerging Technologies (cs.ET); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is misleading. Nearly 2/3 of the decisive votes cancel out, and even the top 50 models according to the global BT ranking are statistically indistinguishable (pairwise win probabilities are at most 0.53 within the top 50 models). We trace this failure to strong, structured heterogeneity of opinions across language, task, and time. Moreover, we find an important characteristic - language plays a key role. Grouping by language (and families) increases the agreement of votes massively, resulting in two orders of magnitude higher spread in the ELO scores (i.e., very consistent rankings). What appears as global noise is in fact a mixture of coherent but conflicting subpopulations. To address such heterogeneity in supervised machine learning, we introduce the framework of (\lambda, \nu) -portfolios, which are small sets of models that achieve a prediction error at most \lambda , “covering” at least a \nu fraction of users. We formulate this as a variant of the set cover problem and provide guarantees using the VC dimension of the underlying set system. On the Arena data, our algorithms recover just 5 distinct BT rankings that cover over 96% of votes at a modest \lambda , compared to the 21% coverage by the global ranking. We also provide a portfolio of 6 LLMs that cover twice as many votes as the top-6 LLMs from a global ranking. We further construct portfolios for a classification problem on the COMPAS dataset using an ensemble of fairness-regularized classification models and show that these portfolios can be used to detect blind spots in the data, which might be of independent interest to policymakers. Subjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Emerging Technologies (cs.ET); Optimization and Control (math.OC) ACMclasses: I.2.6; I.2.7; H.3.3; K.4 Cite as: arXiv:2605.06656 [cs.LG] (or arXiv:2605.06656v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06656 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Inductive Venn-Abers and related regressors

链接: https://arxiv.org/abs/2605.06646
作者: Ivan Petej,Vladimir Vovk
类目: Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

Abstract:Venn-Abers predictors are probabilistic predictors that enjoy appealing properties of validity, but their major limitation is that they are applicable only to the case of binary classification, with a recent extension to bounded regression. We generalize them to the case of unbounded regression, which requires adding an element of conformal prediction. In our simulation and empirical studies we investigate the predictive efficiency of point regressors derived from Venn-Abers regressors and argue that they somewhat improve the predictive efficiency of standard regressors for larger training sets.

[LG-2] Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

链接: https://arxiv.org/abs/2605.06644
作者: Yuchen Xiong,Swee Keong Yeap,Steven Aw Yoong Kit
类目: Machine Learning (cs.LG)
*备注: Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request

点击查看摘要

Abstract:Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com Comments: Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06644 [cs.LG] (or arXiv:2605.06644v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Crafting Reversible SFT Behaviors in Large Language Models

链接: https://arxiv.org/abs/2605.06632
作者: Yuping Lin,Pengfei He,Yue Xing,Yingqian Cui,Jiayuan Ding,Subhabrata Mukherjee,Hui Liu,Zhen Xiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply causal necessity, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a carrier, while remaining controllable at inference time without weight modification? We propose (a) Loss-Constrained Dual Descent (LCDD), which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) SFT-Eraser, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.

[LG-4] Hybrid Quantum-Classical GANs for the Generation of Adversarial Network Flows

链接: https://arxiv.org/abs/2605.06629
作者: Prateek Paudel,Nitin Jha,Abhishek Parakh,Mahadevan Subramaniam
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Classical generative adversarial networks (GANs) have been applied to generate adversarial network traffic capable of attacking intrusion detection systems, but they suffer from shortcomings such as the need for large amounts of high-dimensional datasets, mode collapse, and high computational overhead. In this work, we propose a hybrid quantum-classical GAN (QC-GAN) framework where a variational quantum generator is used to generate synthetic network traffic flows mimicking malicious traffic using latent representations. Instead of sampling classical noise vectors, we encode the latent vector (the hidden features) as a quantum state, which is the basis for claiming more expressive latent representations and reducing computational overhead. A classical discriminator will be trained on real-world datasets (UNSW-NB15) and the proposed QC-GAN-generated fake network flows. In this configuration, the generator aims to minimize the discriminator’s ability to distinguish real from fake traffic, while the discriminator aims to maximize its classification accuracy, in an iterative manner. In our attack model, we assume that the attacker is a state actor with access to limited quantum computing power, whereas the discriminator is chosen to be classical, as will likely be the case for most end users and organizations. We test the generated flows using classical intrusion detection system (IDS) models, such as a random forest classifier and a convolutional neural network-based classifier, for their ability to bypass the detection process. This work aims to highlight the possibilities of quantum machine learning as a means of generating advanced attack flows and stress testing classical IDS. Lastly, we further evaluate how hardware-based noise affects these attacks to offer a new perspective on IDS, highlighting the need for a quantum resilient defense system.

[LG-5] PianoCoRe: Combined and Refined Piano MIDI Dataset

链接: https://arxiv.org/abs/2605.06627
作者: Ilya Borovik
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Published in TISMIR. Project repository: this https URL

点击查看摘要

Abstract:Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.

[LG-6] Online Bayesian Calibration under Gradual and Abrupt System Changes

链接: https://arxiv.org/abs/2605.06612
作者: Yang Xu,Chiwoo Park
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent parameters and a discrepancy function to model bias, but suffers from parameter–discrepancy confounding and is typically formulated as an offline procedure under a stationary data-generating assumption. These limitations are restrictive in modern digital twin applications, where systems evolve over time and may exhibit gradual drift and abrupt regime shifts. While data assimilation methods enable sequential updates, they generally do not explicitly model systematic bias and are less effective under abrupt changes. We propose Bayesian Recursive Projected Calibration (BRPC), an online Bayesian calibration framework for streaming data under simulator mismatch and nonstationarity. BRPC extends projected calibration to the online setting by separating a discrepancy-free particle update for calibration parameters from a conditional Gaussian process update for discrepancy, preserving identifiability while enabling bias-aware adaptation under gradual system evolution. To handle abrupt changes, BRPC is integrated with restart mechanisms that detect regime shifts and reset the calibration process. We establish theoretical guarantees for both components, including tracking performance under gradual evolution and false-alarm and detection behavior for restart mechanisms. Empirical studies on synthetic and plant-simulation benchmarks show that BRPC improves calibration accuracy under gradual changes, while restart-augmented BRPC further improves robustness and predictive performance under abrupt regime shifts compared to sliding-window Bayesian calibration and data assimilation baselines.

[LG-7] ransformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

链接: https://arxiv.org/abs/2605.06609
作者: Chenyang Zhang,Yuan Cao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 94 pages, 8 figures

点击查看摘要

Abstract:Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.

[LG-8] How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

链接: https://arxiv.org/abs/2605.06605
作者: Shai Feldman,Yaniv Romano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events – e.g., jailbreaks or successful task completion by an agent – often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emphDynamic Allocation via PRojected Optimization (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06605 [cs.LG] (or arXiv:2605.06605v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06605 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shai Feldman [view email] [v1] Thu, 7 May 2026 17:25:15 UTC (6,042 KB) Full-text links: Access Paper: View a PDF of the paper titled How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation, by Shai Feldman and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-9] Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

链接: https://arxiv.org/abs/2605.06599
作者: Abhijit Das,Sayantan Dutta
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective–cross-entropy loss with L^2 regularization–by proving it satisfies Villani’s criteria for coercive energy functions. Specifically, we show that the regularized loss \mathcalF is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition -\Delta\mathcalF + \tfrac1s|\nabla\mathcalF|^2 \to \infty as |\theta| \to \infty for all s0 . From this structure, we derive explicit log-Sobolev and Poincaré constants C_\mathrmLS \leq \lambda^-1 + d/\lambda^2 , linking the regularization strength \lambda and model dimension d to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing \lambda . To validate our theory, we introduce a scalable Villani diagnostic \Psi_s(\theta) = -\Delta \mathcalF + s^-1|\nabla \mathcalF|^2 and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of \Psi_s , spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.

[LG-10] FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning

链接: https://arxiv.org/abs/2605.06596
作者: Su Zhang,Junfeng Guo,Heng Huang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 39 pages, 4 figures, 21 tables (including appendix)

点击查看摘要

Abstract:Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients’ updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client’s update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client’s update with bounded mutual information leakage (i.e., O(d^*/N) per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.

[LG-11] ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting SIGGRAPH2026

链接: https://arxiv.org/abs/2605.06593
作者: David Müller,Agon Serifi,Sammy Christen,Ruben Grandia,Espen Knoop,Moritz Bächer
类目: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: SIGGRAPH 2026

点击查看摘要

Abstract:Retargeting human kinematic reference motion onto a robot’s morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot’s morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an approximate gradient for the upper-level loss. Our framework requires only a sparse set of semantic rigid-body correspondences and eliminates the need for manual tuning by identifying optimal values for a parameterization expressive enough to preserve characteristic motion across different embodiments. Moreover, by integrating retargeting directly with physics simulation, we produce physically plausible motions that facilitate robust imitation learning. We validate our method in simulation and on hardware, demonstrating challenging motions for morphologies that differ significantly from a human, including retargeting onto a quadruped.

[LG-12] BRICKS: Compositional Neural Markov Kernels for Zero-Shot Radiation-Matter Simulation

链接: https://arxiv.org/abs/2605.06591
作者: Richard Hildebrandt,Evangelos Kourlitis,Baran Hashemi,Manuel Bünstorf,Thierry Meyer,Nikola Boskov,Michael Kagan,Dan Rosenbaum,Sanmay Ganguly,Lukas Heinrich
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:We introduce a new strategy for compositional neural surrogates for radiation-matter interactions, a key task spanning domains from particle physics through nuclear and space engineering to medical physics. Exploiting the locality and the Markov nature of particle interactions, we create a \emphnext-particle prediction kernel using hybrid discrete-continuous transformer models based on Riemannian Flow Matching on product manifolds. The model generates variable-sized typed sets of particles and radiation side effects that are the result of the interaction of an incident particle with a material volume. The resulting kernel can be composed to simulate unseen large-scale material distributions in a zero-shot manner. Unlike mechanistic simulators, our model is designed to be differentiable, provides tractable likelihoods for future downstream applications. A significant computational speed-up on GPU compared to CPU-bound mechanistic simulation is observed for single-kernel execution. We evaluate the model at the kernel level and demonstrate predictive stability over multi-round autoregressive rollouts. We additionally release a novel 20M-event radiation-matter interaction dataset for further research.

[LG-13] Distributionally-Robust Learning to Optimize

链接: https://arxiv.org/abs/2605.06585
作者: Vinit Ranjan,Jisun Park,Bartolomeo Stellato
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a distributionally robust approach to learning hyperparameters for first-order methods in convex optimization. Given a dataset of problem instances, we minimize a Wasserstein distributionally robust version of the performance estimation problem (PEP) over algorithm parameters such as step sizes. Our framework unifies two extremes: as the robustness radius vanishes, we recover classical learning to optimize (L2O); as it grows, we recover worst-case optimal algorithm design via PEP. We solve the resulting problem with stochastic gradient descent, differentiating through the solution of an inner semidefinite program at each step. We prove high-probability bounds showing that the true risk of the learned algorithm is at most the in-sample L2O optimum plus a slack that shrinks with the sample size, and is no worse than the worst-case PEP bound. On unconstrained quadratic minimization, LASSO, and linear programming benchmarks, our learned algorithms achieve strong out-of-sample performance with certifiable robustness, outperforming both worst-case optimal and vanilla L2O baselines.

[LG-14] On the Safety of Graph Representation Learning

链接: https://arxiv.org/abs/2605.06576
作者: Xiaoguang Guo,Zehong Wang,Ziming Li,Shawn Spitzel,Soonwoo Kwon,Tianyi Ma,Yanfang Ye,Chuxu Zhang
类目: Machine Learning (cs.LG)
*备注: Preprint. 10 pages main text, appendices included

点击查看摘要

Abstract:Graph representation learning (GRL) has evolved from topology-only graph embeddings to task-specific supervised GNNs, and more recently to reusable representations and graph foundation models (GFMs). However, existing evaluations mainly measure clean transfer, adaptation, and task coverage. It remains unclear whether GRL methods stay reliable when deployment stresses affect graph signals, graph contexts, label support, structural groups, or predictive evidence. We introduce GRL-Safety, a multi-axis safety evaluation benchmark for GRL. GRL-Safety evaluates twelve representative methods, spanning topology-only embedding methods, supervised GNNs, self-supervised graph models, and GFMs, on twenty-five graph datasets under standardized evaluation conditions while preserving method-native adaptation. The evaluation covers five safety axes: corruption robustness, OOD generalization, class imbalance, fairness, and interpretation, with per-axis and sub-condition reporting rather than a single aggregate score. Our analysis yields three cross-axis insights that can inspire future research. First, safety behavior is shaped by the interaction between representation design and the stressed graph factor, rather than by method family alone. Second, foundation-era methods show axis-specific strengths rather than broad safety dominance. Third, several deployment regimes remain difficult even for the best evaluated method, revealing capability gaps that require new robustness, adaptation, or training objectives beyond model selection. The benchmark, evaluation protocols, and code are available at: this https URL.

[LG-15] CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification

链接: https://arxiv.org/abs/2605.06571
作者: Iason Ofeidis,Nikos Papadis,Randeep Bhatia,Leandros Tassiulas,TV Lakshman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:The rapid expansion of the Internet of Things (IoT) and Industrial IoT (IIoT) has created a massive, heterogeneous attack surface that challenges traditional network security mechanisms. While Federated Learning (FL) offers a privacy-preserving alternative to centralized Intrusion Detection Systems (IDS), standard approaches struggle to generalize across diverse device behaviors and typically fail to utilize the vast amounts of unlabeled data present in realistic edge environments. To bridge these gaps, we propose CLAD, a holistic framework that seamlessly incorporates Clustered Federated Learning (CFL) with a novel Dual-Mode Micro-Architecture ( \textDM^2\textA ). This unified approach simultaneously tackles the two primary bottlenecks of IoT security: device heterogeneity and label scarcity. The \textDM^2\textA component features a shared encoder followed by two branches, enabling joint unsupervised anomaly detection and supervised attack classification; this allows the framework to harvest intelligence from both labeled and unlabeled clients. Concurrently, the clustering component dynamically groups devices with congruent traffic patterns, preventing global model divergence. By carefully combining these elements, CLAD ensures that no data is discarded and distinct operational patterns are preserved. Extensive evaluations demonstrate that this integrated approach significantly outperforms state-of-the-art baselines, achieving a 30% relative improvement in detection performance in scenarios with 80% unlabeled clients, with only half the communication cost.

[LG-16] SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

链接: https://arxiv.org/abs/2605.06570
作者: Dmitri Goloubentsev,Natalija Karpichina
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
*备注: 27 pages, 8 tables. Three domains: natural gas storage, pension fund ALM, pharmaceutical manufacturing. Benchmark code and trained policies available on request

点击查看摘要

Abstract:Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instances exactly but scales exponentially in state dimensions. Black-box reinforcement learning handles high-dimensional states but trains slowly and produces no sensitivities. We introduce SNAPO (Smooth Neural Adjoint Policy Optimization), a framework that embeds a neural policy inside a known, differentiable simulator, replaces hard constraints with smooth approximations, and computes exact gradients of the objective with respect to all policy parameters and all inputs in a single adjoint pass. We demonstrate SNAPO on three domains: natural gas storage (training in under a minute, 365 forward curve sensitivities at no additional cost per sensitivity), pension fund asset-liability management (6.5x-200x sensitivity speedup over bump-and-revalue, scaling with the number of risk factors), and pharmaceutical manufacturing (cross-unit sensitivities through a 4-unit process chain, with 20 ICH Q8 regulatory sensitivities from 5 adjoint passes in 74.5 milliseconds). All sensitivities are produced by the same backward pass that trains the policy, at a cost proportional to one reverse pass regardless of how many sensitivities are computed. Comments: 27 pages, 8 tables. Three domains: natural gas storage, pension fund ALM, pharmaceutical manufacturing. Benchmark code and trained policies available on request Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM) MSC classes: 49J20, 65K10, 90C30, 93E20 ACMclasses: G.1.6; I.2.6 Cite as: arXiv:2605.06570 [cs.LG] (or arXiv:2605.06570v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06570 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Natalija Karpichina [view email] [v1] Thu, 7 May 2026 17:01:13 UTC (181 KB)

[LG-17] Criticality and Saturation in Orthogonal Neural Networks

链接: https://arxiv.org/abs/2605.06563
作者: Max Guillen,Jan E. Gerken
类目: Machine Learning (cs.LG)
*备注: 11 pages + Appendices

点击查看摘要

Abstract:It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in 1/\mathrmwidth . In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in 1/\mathrmwidth . Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.

[LG-18] Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data

链接: https://arxiv.org/abs/2605.06562
作者: Meena Al Hasani
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 8 pages, 4 figures, 3 tables. Independent research study using TCGA-BRCA RNA-seq data

点击查看摘要

Abstract:Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes, including improved detection of rare classes. Random forest underperformed on minority subtypes despite strong overall accuracy, while SVM showed sensitivity to feature dimensionality. These findings highlight the importance of model simplicity, evaluation metrics, and feature selection in high-dimensional biological classification tasks. Comments: 8 pages, 4 figures, 3 tables. Independent research study using TCGA-BRCA RNA-seq data Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN) Cite as: arXiv:2605.06562 [cs.LG] (or arXiv:2605.06562v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06562 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] Optimal Counterfactual Search in Tree Ensembles: A Study Across Modeling and Solution Paradigms

链接: https://arxiv.org/abs/2605.06561
作者: Awa Khouna,Youssouf Emine,Julien Ferry,Thibaut Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trust in counterfactual explanations depends critically on whether their recommended changes are truly minimal: suboptimal explanations may vastly overshoot the actual changes needed to alter a decision, and heuristic errors can affect individuals unevenly, giving some users relevant recourse while assigning others unnecessarily costly recommendations. Consequently, we study the problem of computing optimal counterfactual explanations for tree ensembles under plausibility and actionability constraints. This is a combinatorial problem: for a fixed model, counterfactual search boils down to selecting consistent branching decisions and threshold-defined regions under a distance objective. We exploit this structure through CPCF, a constraint programming (CP) formulation in which numerical features are encoded as interval domains induced by split thresholds, while discrete features retain native finite-domain representations. This yields a compact finite-domain formulation that supports multiple distance objectives without continuous split-boundary search. We then place CPCF in a broader comparison across mathematical programming paradigms: we extend a maximum Boolean satisfiability (MaxSAT) formulation, originally designed for hard-voting random forests, to soft-voting ensembles, and compare against the current state-of-the-art mixed-integer linear programming (MILP) optimal approach. Across ten datasets and three types of tree ensembles, we analyze scalability, anytime performance, and sensitivity to distance metrics. We observe that CP achieves the best overall performance. More importantly, our results identify regimes in which the specific strengths of each paradigm make it best suited: CP is most versatile overall, MaxSAT handles hard-voting ensembles particularly well, and MILP remains competitive in amortized inference settings with a moderate number of split levels.

[LG-20] Diverse Sampling in Diffusion Models with Marginal Preserving Particle Guidance

链接: https://arxiv.org/abs/2605.06553
作者: Gal Vinograd,Idan Achituve,Ethan Fetaya
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:We present EDDY (Exact-marginal Diversification via Divergence-free dYnamics), a guidance mechanism for diffusion and flow matching models that promotes diversity among samples generated while maintaining quality. EDDY exploits symmetries of the Fokker-Planck equation, using drift perturbations that change particle trajectories while preserving the evolving marginal distribution. We instantiate this principle through kernel-based anti-symmetric pairwise matrix fields, constructed from the repulsive directions. The resulting divergence-free dynamics promote diversity at the joint particle level while preserving each particle’s marginal distribution without any additional training. As computing the guidance can be computationally expensive in cases such as text-to-image generation with perceptual embeddings, we propose practical approximations as an effective and efficient solution. Experiments on synthetic distributions and text-to-image generation show that EDDY improves diversity while maintaining strong distributional fidelity compared to common baselines.

[LG-21] Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning

链接: https://arxiv.org/abs/2605.06552
作者: Michal Kobiela,Diego A. Oyarzún,Michael U. Gutmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The design of biological systems is hindered by uncertainty arising from both intrinsic stochasticity of biomolecular reactions and variability across laboratory or experimental conditions. In this work, we present a sequential framework to optimize genetic circuits under both forms of uncertainty. By employing simulator models based on differential equations or Markov jump processes alongside a reinforcement learning (RL) policy-based approach, our method suggests experiments that adapt to unknown laboratory conditions while accounting for inherent stochasticity. While previous Bayesian methods address uncertainty through iterative experiment-inference-optimization cycles, they typically require computationally expensive inference and optimization steps after each experimental round, leading to delays. To overcome this bottleneck, we propose an amortized approach trained up-front across a distribution of possible uncertain parameters. This strategy sidesteps the need for explicit parameter inference during the design cycle, enabling immediate, observation-based adaptation. We demonstrate our framework on models for heterologous gene expression and a repressilator circuit, showing that it efficiently handles both molecular noise and cross-laboratory variability.

[LG-22] Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation

链接: https://arxiv.org/abs/2605.06541
作者: Yutong Wang,Yannig Goude,Qiwei Yao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol, a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7% relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. Moreover, MELO requires only lightweight per-step recursive updates without model retraining.

[LG-23] Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability

链接: https://arxiv.org/abs/2605.06538
作者: Matias G. Delgadino,Sebastien Motsch,Advait Parulekar,William Porteous,Sanjay Shakkottai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based posterior samplers use pretrained diffusion priors to sample from measurement- or reward-conditioned posteriors, and are widely used for inverse problems. Yet their theoretical behavior remains poorly understood: even with exact prior scores, their outputs are biased, and in low-temperature regimes their discretizations can become unstable. We characterize this bias by introducing a tractable surrogate path connecting the true posterior to a standard Gaussian and comparing it to the sampler’s path. Their density ratio satisfies a parabolic PDE whose reaction term measures the accumulated bias. A Feynman-Kac representation then expresses the Radon-Nikodym correction as an explicit path expectation, identifying which posterior regions are over- or under-sampled. We apply this framework to DPS and STSL, a related sampler. For DPS, the correction is an Ornstein-Uhlenbeck path expectation coupling the data conditional covariance with the reward curvature, revealing where DPS over- or under-samples. Next, we reinterpret STSL as an auxiliary drift that steers trajectories toward low-uncertainty regions, flattening the spatially varying part of the DPS reaction term. Finally, we characterize early guidance-stopping, a common mitigation for low-temperature instabilities caused by forward-Euler integration of the vector field. Together, these results clarify sampler bias, explain existing correctives, and guide stable variant designs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06538 [cs.LG] (or arXiv:2605.06538v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06538 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Efficient Techniques for Data Reconstruction with Finite-Width Recovery Guarantees

链接: https://arxiv.org/abs/2605.06519
作者: Edward Tansley,Roy Makhlouf,Estelle Massart,Coralia Cartis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data reconstruction attacks on trained neural networks aim to recover the data on which the network has been trained and pose a significant threat to privacy, especially if the training dataset contains sensitive information. Here, we propose a unified optimization formulation of the data reconstruction problem based on initial and trained parameter values, incorporating state-of-the-art proposals. We show that in the random feature model, this formulation provably leads to training data reconstruction with high probability, provided the network width is sufficiently large; this unprecedented finite-width result uses PAC-style bounds. Furthermore, when the data lies in a low-dimensional subspace, we show that the network width requirement for successful reconstruction can be relaxed, with bounds depending on the subspace dimension rather than the ambient dimension. For general neural network models and unknown data orientations, we propose an efficient reconstruction algorithm that approximates the low-dimensional data subspace through the change in the first-layer weights during training and uses only the last-layer weights for reconstruction, thus reducing the search space dimension and the required network width for high-quality reconstructions. Our numerical experiments on synthetic datasets and CIFAR-10 confirm that our subspace-aware reconstruction approach outperforms standard full-space techniques.

[LG-25] Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

链接: https://arxiv.org/abs/2605.06472
作者: Haoyu Zheng,Fangcheng Fu,Jia Wu,Binhang Yuan,Yongqiang Zhang,Hao Wang,Yuanyuan Zhu,Xiao Yan,Jiawei Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbfPrediction-\textbfBased \textbfKV-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to 1.85\times speedup over LRU on dynamic workflows, and up to 1.26\times speedup over the SOTA baseline KVFlow on the static workflow.

[LG-26] Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies

链接: https://arxiv.org/abs/2605.06470
作者: Magnus Victor Boock,Abdullah Akgül,Mustafa Mert Çelikok,Melih Kandemir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite-dimensional implementations, we show that global hitting-time error is bounded by one-step transition error amplified by the environment’s transient spectral radius. Furthermore, we provide finite-sample guarantees accounting for approximation, statistical complexity, and trajectory-label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal-agnostic foundation policy learning algorithm that anchors a HILP-style consistency objective with explicit hitting-time regression to ensure that the learned geometry reflects actual decision-time progress. This asymmetric and compositional structure enables robust graph-based multi-stage planning for long-horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on this https URL

[LG-27] No Triangulation Without Representation: Generalization in Topological Deep Learning

链接: https://arxiv.org/abs/2605.06467
作者: Johannes S. Schmidt,Martin Carrasco,Ernst Röell,Guy Wolf,Nello Blaser,Bastian Rieck
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Despite an ever-increasing interest in topological deep learning models that target higher-order datasets, there is no consensus on how to evaluate such models. This is exacerbated by the fact that topological objects permit operations, such as structural refinements, that are not appropriate for graph data. In this work, we extend MANTRA, a benchmark dataset containing manifold triangulations, to a larger class of manifolds with more diverse homeomorphism types. We show that, unlike prior claims, both graph neural networks (GNNs) and higher-order message passing (HOMP) methods can saturate the benchmark. However, we find that this is contingent on the right representation and feature assignment, emphasizing their importance in baseline models. We thus provide a novel evaluation protocol based on representational diversity and triangulation refinement. Surprisingly, we find no indication that existing models are capable of generalizing beyond the combinatorial structure of the data. This points towards a research gap in developing models that understand topological structure independent of scale. Our work thus provides the necessary scaffolding to evaluate future models and enable the development of topology-aware inductive biases.

[LG-28] Diversity Curves for Graph Representation Learning

链接: https://arxiv.org/abs/2605.06466
作者: Katharina Limbeck,Nadja Häusermann,Martin Carrasco,Guy Wolf,Bastian Rieck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with different cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in particular require interpretable, scalable, and reliable size-aware graph representations. Our work addresses these issues by tracking the structural diversity of a graph across coarsening levels. The resulting graph embeddings, which we denote diversity curves, are interpretable by construction, efficient, and directly comparable across coarsening hierarchies. Specifically, we track the spread of graphs, a novel isometry invariant that is inherently well-suited for encoding the metric diversity and geometry of graphs. We utilise edge contraction coarsening and prove that this improves expressivity, thus leading to more powerful graph-level representations than structural descriptors alone. Demonstrating their utility over a range of baseline methods in practice, we use diversity curves to (i) cluster and visualise simulated graphs across varying sizes, (ii) distinguish the geometry of single-cell graphs, (iii) compare the structure of molecular graph datasets, and (iv) characterise geometric shapes.

[LG-29] Invariant-Based Diagnostics for Graph Benchmarks

链接: https://arxiv.org/abs/2605.06462
作者: Richard von Moos,Mathieu Alain,Bastian Rieck
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Progress on graph foundation models is hindered by benchmark practices that conflate the contributions of node features and graph structure, making it hard to tell whether a model actually learns from connectivity, or whether it even needs to. We propose addressing this using graph invariants, i.e., permutation-invariant, task-agnostic structural descriptors that serve as a diagnostic framework for graph benchmarks. We show that (i) invariants are more expressive than standard GNNs, (ii) invariants characterize structural heterogeneity within and across benchmark datasets, (iii) invariants predict multi-task performance, and (iv) simple invariant-based models are competitive with, and sometimes exceed, transformer and message-passing baselines across 26 datasets. Our results suggest that expressivity is not the main driver of predictive performance, and that on tasks where structure matters, a non-trainable structural proxy often matches trained message-passing models. We thus posit that invariant baselines should become a standard for evaluating whether structure is required for a task and whether a model picks up on it, serving as a stepping stone towards graph foundation models.

[LG-30] MINER: Mining Multimodal Internal Representation for Efficient Retrieval

链接: https://arxiv.org/abs/2605.06460
作者: Weien Li,Rui Song,Zeyu Li,Haochen Liu,Gonghao Zhang,Difan Jiao,Zhenwei Tang,Bowei He,Haolun Wu,Xue Liu,Ye Yuan
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single final-layer embedding. In this work, we first conduct a layerwise diagnostic on single-vector retrievers, revealing that retrieval-relevant signal resides in internal representations. Motivated by these findings, we propose MINER (Mining Multimodal Internal RepreseNtation for Efficient Retrieval), a lightweight plug-in module that probes and fuses internal signals across transformer layers into a single compact embedding without modifying the backbone or sacrificing single-vector efficiency. The first Retrieval-Aligned Layer Probing stage attaches a lightweight probe at each layer, surfacing which dimensions carry retrieval-relevant information. The subsequent Adaptive Sparse Multi-Layer Fusion stage applies performance-adaptive neuron-level masking to the selected layers and fuses the surviving signals into the final dense vector. Across ViDoRe V1/V2/V3, MINER outperforms existing dense single-vector retrievers on the majority of benchmarks, with up to 4.5% nDCG@5 improvement over its corresponding backbone. Compared to strong late-interaction baselines, in some settings MINER substantially narrows the nDCG@ 5 gap to 0.2 while preserving the storage and serving advantages of dense retrieval.

[LG-31] Scene-Adaptive Continual Learning for CSI-based Human Activity Recognition with Mixture of Experts

链接: https://arxiv.org/abs/2605.06447
作者: Wenhan Zheng,Yuyi Mao,Ivan Wang-Hei Ho
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 3 tables, this article was submitted to IEEE for possible publication

点击查看摘要

Abstract:Channel state information (CSI)-based human activity recognition (HAR) is vulnerable to performance degradation under domain shifts across varying physical environments. Continual learning (CL) offers a principled way to learn new domains sequentially while preserving past knowledge, but existing CL solutions for CSI-based HAR scale poorly with accumulating domains, rely on a large replay buffer, or incur linearly growing inference cost. In this letter, we propose Scene-Adaptive Mixture of Experts with Clustered Specialists (SAMoE-C), which formulates cross-domain CSI-based HAR as a mixture-of-experts system that enables scene-specific adaptation, via an attention-based semantic router that activates only selected experts for each input. Moreover, we develop a novel training protocol, which requires only a tiny replay buffer for stabilizing domain discrimination of the router. Experimental results on a four-scene CSI dataset demonstrate that SAMoE-C approaches the state-of-the-art accuracy, while maintaining a significantly lower inference cost. By jointly combining modular experts, selective activation with router and a lightweight training protocol, SAMoE-C enables scalable cross-domain CSI-based HAR deployment with low training overhead and high computational efficiency in real-world settings.

[LG-32] FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing

链接: https://arxiv.org/abs/2605.06446
作者: Junye Du,Zhenghao Li,Yushi Feng,Long Feng
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level regularization or update-correction mechanisms. Recent studies, however, suggest that Transformer-based architectures may be inherently more robust than conventional models under heterogeneous federated training. Motivated by this observation, we investigate how different parameter components within the attention mechanism influence federated optimization. Specifically, we decompose the attention module into a query/key block, which determines the attention kernel, and a value block, which performs semantic transformation under the induced kernel. Based on this perspective, we propose FedFrozen, a two-stage federated optimization framework that first performs full-model warm-up training and then freezes the query/key block while continuing to optimize the value block. Under a linear-attention formulation, we show that the warm-up stage can be interpreted as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to a restricted value-block optimization problem under a fixed attention kernel. Our analysis further reveals an explicit trade-off that governs the choice of warm-up length. Simulations validate the predicted bias-drift behavior, and real-data experiments demonstrate that FedFrozen improves both the stability and effectiveness of Transformer models in heterogeneous federated learning.

[LG-33] Federated Cross-Client Subgraph Pattern Detection

链接: https://arxiv.org/abs/2605.06433
作者: Selin Ceydeli,Rui Wang,Kubilay Atasu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Subgraph pattern detection aims to uncover complex interaction structures in graphs. However, state-of-the-art graph neural network (GNN)-based solutions assume centralized access to the entire graph. When graphs are instead distributed across multiple parties, client-local GNN computations diverge from those of a centralized model, resulting in a representation-equivalence gap. We formalize this as a structural observability problem, where subgraph patterns crossing partition boundaries become locally unidentifiable. To bridge this gap, we propose a per-step, layer-wise embedding exchange framework in which clients synchronize intermediate node representations at each layer of the forward pass, without exposing raw features or labels. Under an extended-subgraph assumption and shared model parameters across clients, this framework recovers the same node representations as a centralized GNN over the full graph. Experiments on synthetic directed multigraphs with cycles, bicliques, and scatter-gather patterns show that embedding exchange and federated parameter aggregation are complementary rather than interchangeable: their combination recovers most of the representation gap, provided exchanged embeddings are fresh per-step rather than stale per-epoch.

[LG-34] FRInGe: Distribution-Space Integrated Gradients with Fisher–Rao Geometry

链接: https://arxiv.org/abs/2605.06404
作者: Gabriele Martino,Sebastian Tschiatschek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-based attribution methods are model-faithful and scalable, but Integrated Gradients (IG) can be brittle because explanations depend on heuristic baselines, straight-line paths, discretization, and saturation. We propose Fisher–Rao Integrated Gradients (FRInGe), which defines both the reference and interpolation schedule in predictive distribution space. FRInGe replaces input baselines with a maximum-entropy predictive reference and follows a Fisher-Rao geodesic on the probability simplex. The corresponding input-space trajectory is realized through the pullback Fisher metric and stabilized by KL and Euclidean trust regions; attributions are obtained by integrating input gradients along this trajectory. Across six ImageNet architectures, FRInGe most clearly improves calibration-oriented attribution metrics, especially MAS scores, while remaining competitive on perturbation AUC and infidelity.

[LG-35] SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

链接: https://arxiv.org/abs/2605.06402
作者: Liu Hanzuo,Chaofan Lin,Weixuan Sun,Yulong Wang, Key,Rayying,Mingyu Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only \textbf5B retraining tokens, surpassing the dense model’s 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using \textbf40B tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06402 [cs.LG] (or arXiv:2605.06402v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06402 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] Data-Driven Covariate Selection for Nonparametric and Cycle-Agnostic Causal Effect Estimation

链接: https://arxiv.org/abs/2605.06385
作者: Ana Leticia Garcez Vicente,Gijs van Seeventer,Saber Salehkaleybar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating causal effects from observational data requires identifying valid adjustment sets. This task is especially challenging in realistic settings where latent confounding and feedback loops are present. Existing approaches typically assume acyclicity or rely on global causal structure learning, limiting applicability and computational efficiency. In this work, we study a local, data-driven method for covariate selection based on conditional independence information. While this method is known to be sound and complete in acyclic causal models, its validity in the presence of cycles has remained unclear. Our main contribution is to show that these guarantees extend to cyclic causal models. In particular, our result relies on the invariance of conditional independence assertions under \sigma -acyclification. These findings establish a unified, cycle-agnostic perspective on covariate selection and causal effect estimation, showing that the method applies across cyclic and acyclic settings without modification. Empirically, we validate this on extensive synthetic data, showing reliable performance in cyclic causal models.

[LG-37] A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

链接: https://arxiv.org/abs/2605.06375
作者: Hao Yu
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO’s clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO’s gradient is a positive scalar multiple of standard GRPO’s gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants–including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.

[LG-38] Layer Collapse in Diffusion Language Models NEURIPS

链接: https://arxiv.org/abs/2605.06366
作者: Alexander Conzelmann,Albert Catalan-Tatjer,Shiwei Liu
类目: Machine Learning (cs.LG)
*备注: 9 Pages, Under Review at NeurIPS

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers – the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: this http URL.

[LG-39] Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction

链接: https://arxiv.org/abs/2605.06361
作者: Alessandro Pagani,Marco Cominelli,Liying Han,Gaofeng Dong,Sergio Benini,Francesco Gringoli,Mattia Savardi,Mani B. Srivastava,Trevor Bihl,Erik P. Blasch,Daniel O. Brigham,Kara Combs,Lance M. Kaplan,Federico Cerutti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a preliminary analysis of the ability of Chronos foundation model to process and internally represent frequency domain information. Foundation models that process time-series data offer practitioners a unified architecture capable of learning generic temporal representations across diverse tasks and domains, reducing the need for task-specific feature engineering and enabling transfer across signal modalities. Despite their growing adoption, the extent to which such models encode fundamental signal properties remains insufficiently characterised. We address this gap by analysing Chronos under controlled conditions, starting from the simplest class of signals: discrete sinusoids generated at fixed frequencies. Using lightweight online minimum description length probes applied to the decoder architecture, we test for the presence and separability of frequency information in the model’s internal representations. The results provide insight into how frequential content is captured across the frequency spectrum and highlight regimes in which representation quality may degrade or require particular care. These findings offer practical guidance for users of Chronos in signal processing and information fusion contexts, and contribute to ongoing efforts to improve the interpretability and evaluation of foundation models for temporal data.

[LG-40] Order-Agnostic Autoregressive Modelling with Missing Data

链接: https://arxiv.org/abs/2605.06355
作者: Ignacio Peis,Pablo M. Olmos,Jes Frellsen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.

[LG-41] A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring

链接: https://arxiv.org/abs/2605.06340
作者: Florian A. D. Burnat,Brittany I. Davidson
类目: Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions. We formalize continuous auditing as a T -round Stackelberg game between an auditor that commits to a temporal policy and an adaptive auditee, and identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously. We make this formal as Observation 1 and show that two minimal extension policies, each derived from the observation, close the regime along orthogonal axes: a sample-size-aware static rule (Periodic-with-floor) closes the granularity-failure case, while a history-conditioned suspicion-escalation policy closes the coverage-failure case for the naive Drift strategy – and neither closes both, exactly as the observation predicts; an audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both. To support empirical study we contribute a non-additive harm decomposition (welfare loss W , coverage loss C ) that exposes how attrition shifts harm from the regulator-accountable surface to a regulator-invisible one; an initial library of five auditee strategies (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift) and five auditor policies, calibrated to summary statistics from published audits of the DSA Transparency Database; and a reproducible simulator with a small, extensible Python interface.

[LG-42] Eliciting associations between clinical variables from LLM s via comparison questions across populations

链接: https://arxiv.org/abs/2605.06335
作者: Fabian Kabus,Kian Kordtomeikel,Thomas Brox,Heinz Wiendl,Daiana Stolz,Harald Binder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between patient characteristics, as a key building block for medical decision making. To avoid the pitfalls of direct elicitation, we propose an approach based on structured comparison questions, specifically patient comparison triplet questions. This is combined with a statistical model for the LLM representation that provides estimates of correlations without access to activations or model internals. Intuitively, we consider how similarity decisions of LLMs based on a first variable are affected by providing information on a second variable for one of the patients being assessed. We then induce prompt-level environment shifts to obtain correlation estimates for different subpopulations, which enables an invariant causal prediction (ICP) approach to obtain conservative candidate parent links. We demonstrate the method in two clinical domains, chronic obstructive pulmonary disease (COPD) and multiple sclerosis (MS). Across prompted environments, the elicited correlations are smooth, stable, and clinically interpretable, yet vary in a statistically significant way that supports downstream invariance testing, such that ICP provides a small set of candidate invariant parent links. These results show that indirect elicitation via triplet comparisons can recover meaningful association structure from LLMs and offer a cautious route from implicit correlations to causal statements that are congruent with LLM answering patterns.

[LG-43] LINC: Decoupling Local Consequence Scoring from Hidden Matching in Constructive Neural Routing

链接: https://arxiv.org/abs/2605.06332
作者: Shaofeng Qin,Li Wang
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures, 10 tables. Code: this https URL

点击查看摘要

Abstract:Constructive neural routing solvers usually score the next action by matching a decoder context to candidate embeddings, hiding deterministic one-step consequences such as travel, waiting, slack, and capacity changes. We propose LINC (Local Inference via Normed Comparison), a decoder-side candidate decision architecture that computes these consequences explicitly. LINC uses them according to their decision role: centered relative consequences are compared by a shared linear local scorer, while feasible-set summaries modulate the decoder context. This preserves standard global matching and relieves the hidden state from rediscovering transition arithmetic. The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) serves as the main constrained-routing stress test; the same interface extends to the Capacitated Vehicle Routing Problem (CVRP) and Traveling Salesman Problem (TSP). In particular, for CVRPTW, LINC reduces PolyNet’s Solomon/Homberger gaps from 13.83%/38.15% to 7.26%/14.71%; for TSP and CVRP, it also improves external-benchmark gaps.

[LG-44] Gaming the Metric Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

链接: https://arxiv.org/abs/2605.06324
作者: Florian A. D. Burnat,Brittany I. Davidson
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, H^\star(x) \le (1/\hat\alpha) M_\mathrmEnv(m)(x) + \bar\eta , holds for every platform strategy, with \bar\eta absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.

[LG-45] SMolLM : Small Language Models Learn Small Molecular Grammar

链接: https://arxiv.org/abs/2605.06322
作者: Akhil Jindal,Harang Ju
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as shown by error classification, linear probing, and sparse autoencoders. A systematic ablation across attention heads and passes further localizes the first bracket-matching step to a single attention head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

[LG-46] When Does ell_2-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the ell_1 Implicit Bias

链接: https://arxiv.org/abs/2605.06314
作者: Ye Su,Jian Li,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benign overfitting is well-characterized in \ell_2 geometries, but its behavior under the \ell_1 implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time \ell_2 -Boosting over p features and n samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth \ell_1 interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate \Theta(\sigma^2/\log(p/n)) for noise variance \sigma^2 . We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with k^* head eigenvalues and r_2 = p - k^* tail dimensions, the risk converges to zero when r_2 \gg n , but only at a logarithmic rate \Theta(\sigma^2/\log(r_2/n)) , which is slower than the linear decay observed in \ell_2 geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded \ell_1 -path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for \ell_1 -bounded signals.

[LG-47] Perceive Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting

链接: https://arxiv.org/abs/2605.06310
作者: Siru Zhong,Zhao Meng,Haohuan Fu,Haoyang Li,Qingsong Wen,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 6 figures. Preprint

点击查看摘要

Abstract:Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal. Current deep forecasting models, despite their scale and complexity, rely on fixed weight matrices applied uniformly to all temporal tokens. This creates a static pattern response: models settle into a compromised average, unable to adapt to changing local dynamics. We introduce Dynamic Pattern Recalibration (DPR), a backbone-agnostic mechanism that resolves this via token-level recalibration. Through a lightweight “Perceive-Route-Modulate” pipeline, DPR computes a soft-routing distribution over a learned basis of adaptive response patterns, generating a time-aware modulation vector that recalibrates hidden states via a residual Hadamard product. As a backbone-agnostic adapter, DPR enhances forecasting across diverse architectures with minimal overhead, confirming it addresses a general bottleneck. As a minimalist standalone model, DPRNet achieves competitive performance across 12 benchmarks, validating dynamic recalibration against macroscopic parameter scaling.

[LG-48] Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces

链接: https://arxiv.org/abs/2605.06303
作者: Zakaria Elabid,Jan Andrzejewski,Bartosz Brzoza,Attila Cangi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular generative models often assume meaningful latent geometry, but apparent property predictability can reflect sequence-level shortcuts rather than chemical organization. We study this issue in an unsupervised autoregressive Transformer-VAE trained on SELFIES. After training, we freeze the model, fit linear probes to RDKit descriptors, and use the probe weights as candidate global steering directions. To separate chemical signal from SELFIES artifacts, we introduce a confound-aware evaluation based on residualization, confound-direction alignment analysis, and decoded-molecule traversal. This is necessary because SELFIES length, branch tokens, ring tokens, and token entropy are strongly encoded in the latent space. Under this confound-aware evaluation, we find robust monotonic steering for cLogP, FractionCSP3, HeavyAtomCount, TPSA, BertzCT, and HBA. Nonlinear probes further show that some properties admit stable global directions, while others are better described by local latent gradients. Overall, our results show that chemically meaningful steering can emerge in entangled molecular latent spaces, but only when validated through decoded molecules and controlled for representation-level confounds.

[LG-49] Region Seeding via Pre-Activation Regularization: A Geometric View from Piecewise Affine Nerual Networks

链接: https://arxiv.org/abs/2605.06300
作者: Yi Wei,Xuan Qi,Furao Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlinear target functions. In practice, standard training realizes far fewer region refinements in data-visited neighborhoods than the architecture could in principle support, while existing region-count theory is primarily architectural and offers little guidance on how optimization shapes the realized partition near the data. Our theory provides a sufficient condition under which bringing neuron switching surfaces sufficiently close to data points ensures their intersection with local neighborhoods, which in turn implies a strict increase in the local affine-region count, yielding a principled training-time handle for seeding data-relevant partitions early in optimization. Guided by these results, we propose a plug-and-play region-seeding regularizer that encourages early partitioning while allowing task-driven refinement to dominate later in training. Experiments show that the regularizer increases the number of realized affine regions via exact enumeration and improves overall performance on toy datasets, while also improving early-stage accuracy and achieving comparable (or slightly improved) final accuracy on ImageNet-1k for classical models.

[LG-50] INEUS: Iterative Neural Solver for High-Dimensional PIDEs

链接: https://arxiv.org/abs/2605.06281
作者: Jean-Loup Dupret,Davide Gallon,Patrick Cheridito
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:In this paper, we introduce INEUS, a meshfree iterative neural solver for partial integro-differential equations (PIDEs). The method replaces the explicit evaluation of nonlocal jump integrals with single-jump sampling and reformulates PIDE solving as a sequence of recursive regression problems. Like Physics-Informed Neural Networks (PINNs), INEUS learns global solutions over the entire space-time domain, yet it offers a more efficient treatment of nonlocal terms and avoids the computationally expensive differentiation of full PIDE residuals. These features make INEUS particularly well suited for high-dimensional PDEs and PIDEs. Supported by a contraction-based convergence proof for linear PIDEs, our numerical experiments show that INEUS delivers accurate and scalable solutions for various high-dimensional linear and nonlinear examples.

[LG-51] PACE: Prune-And-Compress Ensemble Models

链接: https://arxiv.org/abs/2605.06278
作者: Fabian Akkerman,Julien Ferry,Théo Guyard,Thibaut Vidal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Ensemble models achieve state-of-the-art performance on prediction tasks, but usually require aggregating a large number of weak learners. This can hinder deployment, interpretability, and downstream tasks such as robustness verification. Remedies to this issue fall into two main camps: pruning, which discards redundant learners, and compression, which generates new ones from scratch. We introduce PACE, a framework that interleaves these paradigms in a two-phase strategy. First, new learners are actively generated via a theoretically grounded procedure to enhance the diversity of the initial ensemble. When no more relevant learners can be found, a second phase of pruning is performed on this enriched ensemble. During both operations, PACE allows fine control on the faithfulness to the original ensemble. Experiments show that our method outperforms prior pruning and compression methods while offering principled control of faithfulness guarantees.

[LG-52] A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

链接: https://arxiv.org/abs/2605.06272
作者: Tyler Ingebrand,Ruihan Zhao,Kushagra Gupta,David Fridovich-Keil,Sandeep P. Chinchali,Ufuk Topcu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While generative modeling has achieved remarkable success on tasks like natural language-conditioned image generation, enabling model adaptation from example data points remains a relatively underexplored and challenging problem. To this end, we propose Function Projection for Flow Matching (FP-FM), an algorithm that directly conditions generation on samples from the target distribution. FP-FM learns basis functions to span the velocity fields corresponding to a set of training distributions, and adapts to new distributions by computing a simple least-squares projection onto this basis. This enables efficient generation of samples from diverse target distributions without additional training at inference time. We further introduce multiple variants of FP-FM that provide a trade-off in expressivity and compute by enriching the coefficient calculation, e.g., by making the coefficients dependent on time. FP-FM achieves greatly improved precision and recall relative to baselines across synthetic and image-based datasets, with especially strong gains on unseen distributions.

[LG-53] Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving

链接: https://arxiv.org/abs/2605.06264
作者: Le Yang,Ruoyu Chen,Haijun Liu,Jiawei Liang,ShangQuan Sun,Xiaochun Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:End-to-end autonomous driving models generate future trajectories from multi-view inputs, improving system integration but introducing opaque decisions and hard-to-localize risks. Existing methods either rely on auxiliary monitoring models or generate textual explanations, but are decoupled from the planning process and fail to reveal the visual evidence underlying trajectory generation. While attribution offers a direct alternative, planning differs from image classification by taking six-view camera images as input and predicting continuous multi-step trajectories, requiring attribution to capture both critical views and regions and their influence on outputs. Moreover, whether attribution maps can support risk identification remains underexplored. To address this, we propose a hierarchical attribution framework for end-to-end planning. Specifically, using L2 consistency with the original trajectory as the objective, we design a coarse-to-fine region attribution strategy that searches candidate regions across the full six-view input and refines attribution within them. We further extract three attribution statistics as predictive signals for planning risk, including attribution entropy to measure how concentrated the planner’s reliance is over the joint visual space, within-camera spatial variance to characterize how spread out the attribution is within each view, and cross-camera Gini coefficient to quantify how unevenly attribution is distributed across the six cameras. Experiments on BridgeAD, UniAD, and GenAD show that these statistics correlate with planning risk, achieving Spearman correlations of 0.30 \pm 0.07 with trajectory error and AUROC of 0.77 \pm 0.04 for collision detection. The signal generalizes to held-out scenes with negligible degradation and remains stable under an alternative attribution baseline.

[LG-54] Beyond Rigid Alignment: Graph Federated Learning via Dual Manifold Calibration

链接: https://arxiv.org/abs/2605.06260
作者: Wentao Yu,Bo Han,Jie Yang,Chen Gong
类目: Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:Graph Federated Learning (GFL) enables collaborative representation learning across distributed subgraphs while preserving privacy. However, heterogeneity remains a critical challenge, as subgraphs across clients typically differ significantly in both semantics and structures. Existing methods address heterogeneity by enforcing the rigid alignment of model parameters or prototypes between clients and the server. However, these alignments implicitly rely on a restrictive global linearity assumption that summarizes local data distributions using a single and globally consistent representation space. This severely compresses the personalized representation space of clients and fails to preserve diverse local graph distributions. To overcome these limitations, we propose Federated Graph Manifold Calibration (FedGMC), a novel paradigm that tackles semantic heterogeneity and structural heterogeneity from a unified manifold perspective. Instead of enforcing rigid alignment, FedGMC introduces a dual manifold calibration mechanism that preserves global commonalities while maximizing the personalized representation space of local clients. Specifically, for semantic heterogeneity, the server constructs a geometrically optimal semantic manifold via equidistant semantic anchors, so as to guide the calibration of local semantic manifolds. For structural heterogeneity, the server constructs a global structural manifold by building global structural templates, so as to guide the calibration of local structural manifolds. Finally, the server dynamically refines both global semantic manifolds and structural manifolds by aggregating local manifolds. Extensive experiments on eleven homophilic and heterophilic graphs demonstrate that FedGMC effectively balances global commonality and local personalization, thereby significantly outperforming state-of-the-art baseline methods.

[LG-55] rade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds

链接: https://arxiv.org/abs/2605.06259
作者: Marten van Dijk,Murat Bilgehan Ertan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the f -DP framework. Our analysis covers the regime \sigma \geq \sqrt3/\ln M , where \sigma is the noise multiplier and M is the number of rounds within a single epoch. Unlike f -DP analyses for Poisson subsampling, which yield non-closed implicit formulas that can be machine computed but are non-transparent, random shuffling admits a tight analysis yielding transparent and interpretable closed-form bounds. Our concrete bounds, derived via the Berry-Esseen theorem, are tight up to constant factors within the proof framework. We demonstrate worked parameter settings for a single epoch ( E=1 ) with a corresponding trade-off function \geq 1-a-\delta , that is, only \delta below the ideal random guessing diagonal 1-a : For \delta = 1/100 and \sigma = 1 , roughly M \approx 1.14\times 10^6 rounds and N \approx 1.14\times 10^7 training samples suffice to achieve meaningful differential privacy. This is in contrast to recent negative results for the regime \sigma \leq 1/\sqrt2 \ln M . Our concrete bounds can be composed over multiple epochs leading to \delta having a linear in E dependency, which restricts E=O(\sqrtM) . To go beyond Berry–Esseen, we introduce a new proof technique based on a generalization of the law of large numbers that yields an asymptotic random guessing diagonal-limit result: if E=c_M^2M with c_M\to 0 , then the E -fold composed trade-off function satisfies f^\otimes E(a)\to 1-a uniformly in a\in[0,1] with \delta having only an O(\sqrtE) dependency. We compare this asymptotic regime with the corresponding Poisson subsampling asymptotic, and highlight the characterization of explicit convergence rates as an open question.

[LG-56] he Role of Node Features in Graph Pooling

链接: https://arxiv.org/abs/2605.06250
作者: Jan von Pichowski,Alžbeta Hrabošová,Ingo Scholtes,Christopher Blöcker
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph pooling is commonly applied in graph classification, yet its empirical gains over standard WL-1 expressive GNNs are often marginal or inconsistent. We study this gap by analysing the interaction between node features and graph topology and their effect on pooling objectives. Our analysis reveals that pooling operators require node features that are well-aligned with the graph’s topology – a condition often overlooked and not guaranteed in empirical networks. We formalise fundamental requirements for node features to enable effective pooling, and introduce a quantitative measure of feature quality. Our empirical evaluation shows that, when these requirements are satisfied, pooling can be beneficial and improve performance on appropriate datasets.

[LG-57] Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations

链接: https://arxiv.org/abs/2605.06246
作者: Jan-Hendrik Ewering,Kathrin Flaßkamp,Niklas Wahlström,Thomas B. Schön,Thomas Seel
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 30 pages

点击查看摘要

Abstract:In this paper, we propose Lagrangian Gaussian Processes (LGPs) for probabilistic and data-efficient learning of dynamics via discrete forced Euler-Lagrange equations. Importantly, the geometric structure of the Lagrange-d’Alembert principle, which governs the motion of dynamical systems, is preserved by construction in the absence of external forces. This allows learning physically consistent models that overcome erroneous drift in the system’s energy, thereby providing stable long-term predictions. At the core of our approach lie linear operators for Gaussian process conditioning, constructed from discrete forced Euler-Lagrange equations and variational discretization schemes. Thereby and unlike prior work, the method enables learning dynamics from discrete position snapshots, i.e., without access to a system’s velocities or momenta. This is particularly relevant for a large class of practical scenarios where only position measurements are available, for instance, in motion capture or visual servoing applications. We demonstrate the data-efficiency and generalization capabilities of the LGPs in various synthetic and real-world case studies, including a real-world soft robot with hysteresis. The experimental results underscore that the LGPs learn physically consistent dynamics with uncertainty quantification solely from sparse positional data and enable stable long-term predictions.

[LG-58] When Graph Language Models Go Beyond Memorization

链接: https://arxiv.org/abs/2605.06239
作者: Masatsugu Yamada,Mahito Sugiyama
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.

[LG-59] AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks

链接: https://arxiv.org/abs/2605.06218
作者: Yi Wei,Xuan Qi,Furao shen,Jian Zhao,Vittorio Murino,Cigdem Beyan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input–output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.

[LG-60] Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

链接: https://arxiv.org/abs/2605.06206
作者: Muhammad Shahir Abdurrahman,Chun Deng,Azalia Mirhoseini,Philip Levis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06206 [cs.LG] (or arXiv:2605.06206v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Bandit Learning in General Open Multi-agent Systems

链接: https://arxiv.org/abs/2605.06202
作者: Mengfan Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emphpre-training degree of new agents quantifies how much information an agent carries upon entry, \emphstability measures the impact of new agents on the system, and \emphglobal dynamic regret compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.

[LG-62] Constrained Contextual Bandits with Adversarial Contexts

链接: https://arxiv.org/abs/2605.06190
作者: Dhruv Sarkar,Abhishek Sinha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In this setting, the objective is to simultaneously control regret and the violation of the budget constraint. Building on the seminal \mathsfSquareCB framework of Foster et al. [2018], we propose a simple and modular framework that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to prior works, which focus on stochastic contexts, our reduction yields improved guarantees for more general adversarial contexts, together with an efficient algorithm with a compact and transparent analysis.

[LG-63] aching LLM s Program Semantics via Symbolic Execution Traces

链接: https://arxiv.org/abs/2605.06184
作者: Jonas Bayer,Stefan Zetzsche,Olivier Bouissou,Remi Delmas,Michael Tautschnig,Soonho Kong
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just \sim 3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4 \times larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.

[LG-64] Mean Mode Screaming: Mean–Variance Split Residuals for 1000-Layer Diffusion Transformers

链接: https://arxiv.org/abs/2605.06169
作者: Pengqi Lu
类目: Machine Learning (cs.LG)
*备注: 43 pages (9-page main paper + appendix)

点击查看摘要

Abstract:Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline’s pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth. Comments: 43 pages (9-page main paper + appendix) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06169 [cs.LG] (or arXiv:2605.06169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] One Algorithm Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

链接: https://arxiv.org/abs/2605.06166
作者: Xinrui Chen,Liu Yang,Ou Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrogate scoring rule. Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, we propose DualSFT (Dual-Selection Fine-Tuning), a one-shot dual-scoring algorithm that produces a parameter mask and data subset from shared gradient statistics. On 3B-9B LLMs, single-axis DualSFT variants strengthen target-task performance and stability-plasticity trade-offs within their comparison groups, while full DualSFT yields a more favorable joint-constrained trade-off than sequential hybrid baselines under matched budgets.

[LG-66] Matrix-Valued Optimism is Matrix-Valued Augmentation: Additive Hybrid Designs for Constrained Optimization

链接: https://arxiv.org/abs/2605.06141
作者: Jiayi Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Augmented Lagrangian and optimistic primal–dual methods stabilize equality-constrained optimization through seemingly different mechanisms: the former adds constraint-dependent primal curvature, while the latter adds dual memory. Recent work has shown that these mechanisms are equivalent for scalar parameters. We extend this equivalence to matrix-valued correction. We prove an additivity principle: for symmetric matrix parameters, the ideal primal trajectory depends only on the summed correction matrix, not on how it is split between augmented and optimistic channels. This exposes a design freedom: algebraically equivalent decompositions can have different finite-step feasibility because augmented correction affects primal curvature, whereas optimistic correction affects the scale of the dual memory correction. We formulate the resulting step-size-limited design problem and derive a closed-form hybrid rule that selects a matrix correction, splits it between the two channels, and chooses primal and dual steps using local spectral weights. Experiments on nonlinear equality-constrained problems with controlled constraint-Jacobian conditioning show that the hybrid design improves over pure augmented and pure optimistic endpoints, closely tracks a grid-search hybrid oracle, and is competitive with first-order primal–dual baselines under mild-to-moderate ill-conditioning. The experiments also identify the expected limitation: exact cancellation requires increasingly large matrix corrections as the constraint Jacobian becomes ill-conditioned.

[LG-67] BoostLLM : Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

链接: https://arxiv.org/abs/2605.06117
作者: Yi-Siang Wang,Kuan-Yu Chen,Yu-Chen Den,Darby Tien-Hao Chang
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fine-tuning into a multi-round residual optimization process by training sequential PEFT adapters as weak learners. To incorporate tabular inductive bias, BoostLLM integrates decision-tree paths as a second input view alongside raw features; analysis reveals that the path view acts as a structured teacher in early training steps before the model shifts toward feature-driven representations. Empirically, BoostLLM achieves consistent improvements over standard fine-tuning across multiple LLM backbones and datasets, matching or surpassing XGBoost across a wide range of shot counts and outperforming GPT-4o-based methods with a 4B model. We further show that the framework scales: pairing with stronger tree models and extended boosting horizons yields additional gains under appropriate stabilization. These results suggest that boosting can serve as a general training principle for LLM fine-tuning, particularly in low-data regimes for structured data.

[LG-68] PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

链接: https://arxiv.org/abs/2605.06082
作者: Rappy Saha,Jude Haris,Nicolas Bohm Agostini,David Kaeli,José Cano
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted to IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2026

点击查看摘要

Abstract:Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift-based processing elements. However, deploying PoT-quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open-source end-to-end pipeline for accelerating and evaluating PoT-quantized DNNs on resource-constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. We design shift-based processing element (shift-PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer-based architectures. Results show that our CPU-accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU-only execution for PoT-quantized DNNs on PYNQ-Z2 and Kria boards. The code will be publicly released at this https URL Comments: Accepted to IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2026 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2605.06082 [cs.AR] (or arXiv:2605.06082v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.06082 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-69] Fast Gauss-Newton for Multiclass Cross-Entropy

链接: https://arxiv.org/abs/2605.06081
作者: Mikalai Korbit,Mario Zanon
类目: Machine Learning (cs.LG)
*备注: 29 pages, 3 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:In multiclass softmax cross-entropy, the full generalized Gauss-Newton (GGN) curvature couples all output logits through the softmax covariance, making curvature-vector products harder to scale as the number of classes grows. We show that the standard multiclass GGN can be decomposed exactly into a true-vs-rest term and a positive semidefinite within-competitor covariance term. Fast Gauss-Newton (FGN) retains the first term and drops the second, yielding a positive semidefinite under-approximation of the multiclass GGN that is exact for binary classification. The derivation uses an exact true-vs-rest scalar-margin representation of softmax cross-entropy: the loss and gradient are unchanged, and the approximation enters only at the curvature level. Exploiting the FGN curvature structure, the damped update can be written as an equivalent whitened row-space system with one row per mini-batch example. We solve this system matrix-free by conjugate gradient using Jacobian-vector and vector-Jacobian products of the scalar margin map. Targeted mechanism experiments and an evaluation on a fixed-feature multiclass head support the predictions from the decomposition: FGN stays closest to the full softmax GGN when competitor mass is concentrated or damping is large, and deviates as the dropped within-competitor covariance grows.

[LG-70] Understanding diffusion models requires rethinking (again) generalization

链接: https://arxiv.org/abs/2605.06077
作者: Pierre Marion,Yu-Han Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This position paper argues that understanding generalization in diffusion models requires fundamentally new theoretical frameworks that go beyond both classical statistical learning theory and the benign overfitting paradigm developed for supervised learning. In diffusion models, unlike in supervised learning, memorization of training data and generalization to novel samples are incompatible: a model that has fully memorized its training set generates copies rather than novel data. Several theoretical explanations for why practical diffusion models nevertheless generalize have been proposed, based on capacity limitations, implicit regularization from optimization, or architectural inductive biases, but their interactions remain unclear. We argue that the field should pivot from explaining why the diffusion models do not memorize to investigating what the model actually learns during pre-memorization phase. To highlight our stance, we conduct empirical study of diffusion models trained on CIFAR-10, and we distill the findings into concrete open questions that we believe are key to improve understanding of generalization in diffusion models.

[LG-71] PRISM: Iterative Cross-Modal Posterior Refinement for Dynamic Text-Attributed Graphs

链接: https://arxiv.org/abs/2605.06073
作者: Trimble Chang,Yihang Liu,Mingjing Han,Han Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic text-attributed graphs (DyTAGs) provide a powerful framework for modeling evolving systems in which node semantics and time-dependent interactions are tightly coupled. Recently, multimodal learning has emerged as a promising yet underexplored direction for enhancing DyTAG representation learning. However, existing methods typically rely on rigid modality partitions and one-shot fusion strategies, which limit their ability to capture the intrinsic and evolving dependencies between node semantics and interaction behaviors. To address these limitations, we propose \textbfPRISM, an iterative cross-modal posterior refinement framework for DyTAG representation learning. PRISM organizes DyTAG information into semantic and behavioral modalities, providing a more intrinsic alternative to carrier-level modality partitions. Instead of fusing the two modalities in a single step, PRISM learns a refinement trajectory that progressively transforms semantic priors into behavior-conditioned posterior states through cross-modal interaction with behavioral evidence. Extensive experiments on DTGB benchmark datasets show that PRISM achieves strong performance on temporal link prediction and destination node retrieval tasks. Further ablation studies validate the effectiveness of semantic–behavioral modeling and iterative posterior refinement.

[LG-72] Geometry-Aware Simplicial Message Passing

链接: https://arxiv.org/abs/2605.06061
作者: Elena Xinyi Wang,Bastian Rieck
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:The Weisfeiler–Lehman (WL) test and its simplicial extension (SWL) characterize the combinatorial expressivity of message passing networks, but they are blind to geometry, i.e., meshes with identical connectivity but different embeddings are indistinguishable. We introduce the Geometric Simplicial Weisfeiler–Lehman (GSWL) test, which incorporates vertex coordinates into color refinement for geometric simplicial complexes. In addition, we show that (i) the expressivity of geometry-aware simplicial message passing schemes is bounded above by GSWL, and (ii) that there exist parameters such that the discriminating power of GSWL is matched by these schemes on any fixed finite family of geometric simplicial complexes. Combined with the Euler Characteristic Transform (ECT), a complete invariant for geometric simplicial complexes, this yields a geometric expressivity characterization together with an approximation framework. Experiments on synthetic and mesh datasets serve to validate our theory, showing a clear hierarchy from combinatorial to geometry-aware models.

[LG-73] Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

链接: https://arxiv.org/abs/2605.06055
作者: Tianlun Hu,Tiancheng Hu,Shengsheng Litang,Sheng Wang,Xiaoming Bao,Yuxing Li,Wei Wang,Zhongzhe Hu,Lijun Li,Hongwei Sun,Jingbin Zhou\
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.

[LG-74] owards Generation-Efficient Uncertainty Estimation in Large Language Models

链接: https://arxiv.org/abs/2605.06053
作者: Mingcheng Zhu,Yu Liu,Tingting Zhu
类目: Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, and 8 tables. The abstract provided in the metadata differs slightly from the manuscript version due to character limits

点击查看摘要

Abstract:Uncertainty estimation is important for deploying LLMs in high-stakes applications such as healthcare and finance, where hallucinations can appear fluent and plausible while being factually incorrect, making it difficult for users to judge whether an output should be trusted. Existing methods require one or more full autoregressive generations to estimate uncertainty, which introduces substantial inference cost and often delays uncertainty assessment. In this paper, we investigate whether effective uncertainty estimation can be achieved with partial generation or even input-only information. Specifically, we first develop a unified framework that formulates uncertainty estimation as an early estimation problem over the autoregressive generation process of LLMs. This framework organises existing and proposed estimators by the information they observe, ranging from multi-generation to input-only prediction, and clarifies the performance-cost trade-off underlying different uncertainty estimation methods. Building on this view, we study two largely underexplored low-cost settings: estimating uncertainty with part of the generation, and predicting uncertainty from the input prompt. We propose Logit Magnitude, which uses top-M logit evidence to estimate uncertainty from an early-stopped generation prefix, and MetaUE, which distils generation-based uncertainty into a lightweight input-only estimator trained with uncertainty scores. Extensive experiments on general and domain-specific benchmarks show that Logit Magnitude achieves strong performance, and partial generations of LLMs are often sufficient for effective uncertainty estimation. MetaUE further provides a competitive input-only approximation in several settings. These findings suggest that effective uncertainty estimation requires less generation than commonly assumed, enabling unreliable responses to be identified earlier.

[LG-75] When Brain Networks Travel: Learning Beyond Site

链接: https://arxiv.org/abs/2605.06050
作者: Yingxu Wang,Kunyu Zhang,Yanwu Yang,Thomas Wolfers,Yujie Wu,Siyang Gao,Nan Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-based learning on functional magnetic resonance imaging (fMRI) has shown strong potential for brain network analysis. However, existing methods degrade under cross-site out-of-distribution (OOD) settings because site-conditioned confounders induce non-pathological shortcuts, while functional connectivity constructed by temporal averaging obscures transient neurodynamics, limiting generalization to unseen sites. In this paper, we propose Cross-site OOD Robust brain nEtwork (CORE), a unified framework for brain network learning across unseen sites. CORE first performs site-aware confounder decoupling to mitigate site-conditioned bias and extract a cross-site population scaffold of reproducible diagnostic connectivity edges. It then profiles transient pathway dynamics over this scaffold using lightweight temporal descriptors and organizes scaffold edges into a line graph for transferable pathway-level modeling. Finally, CORE introduces a prior-guided subject-adaptive gating mechanism that leverages scaffold-derived population priors while preserving subject-specific connectivity variability. Extensive experiments under leave-one-site-out evaluation on real-world datasets (ABIDE, REST-meta-MDD, SRPBS, and ABCD) show that CORE consistently outperforms state-of-the-art baselines, with up to 6.7% relative gain. Furthermore, CORE remains robust to atlas variations, maintaining performance gains across different brain parcellation schemes.

[LG-76] Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

链接: https://arxiv.org/abs/2605.06046
作者: Saksham Rathi,Preeti,Mythili Vutukuru
类目: Machine Learning (cs.LG)
*备注: 22 pages, 36 figures

点击查看摘要

Abstract:Auto-regressive token generation in large language models is memory-bound because it requires “attending to” key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches – where all requests share a common prefix – can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2–10 \times higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.

[LG-77] Multi-agent decision making: A Blackwells informativeness approach

链接: https://arxiv.org/abs/2605.06028
作者: Zheng Zhang,Cuong C. Nguyen,Kevin Wells,Gustavo Carneiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has motivated research on decision-making in multi-agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting and debate, are largely ad-hoc and lack formal guarantees regarding the informativeness of the resulting decisions. In this paper, we provide a principled approach to analyse decisions made in the multi-LLM setting using Blackwell’s informativeness framework. Within the Blackwell information-structure abstraction, we show that voting and debate induce information structures that are no more informative than the pooled private information of all agents. This result identifies Bayesian pooled posterior maximisation as an information-theoretic upper-bound decision rule under the Blackwell ordering. Motivated by this theoretical analysis, we introduce a practical method for LLM-based question-answering (QA) tasks that estimates each agent’s posterior and approximates the pooled posterior using a product-of-posteriors estimator. Extensive experiments on six QA benchmarks demonstrate that our approach outperforms state-of-the-art multi-LLM debate and voting methods.

[LG-78] Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards

链接: https://arxiv.org/abs/2605.06017
作者: Pei-Sen Li
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Sequence-level evaluations in autoregressive Large Language Models (LLMs) rely on highly dependent token generation. Establishing tight concentration bounds for these processes remains a challenge due to two fundamental bottlenecks in existing frameworks: (i) classical inequalities typically separate dependency structures from target sensitivities, leading to a scalar collapse that inflates the variance proxy to a suboptimal \mathcalO(N) for sparse terminal rewards; (ii) conversely, while certain spatial methods achieve tighter bounds, they lack the strictly causal filtration required by sequential generation, rendering them inapplicable to the autoregressive setting. To resolve both bottlenecks, we establish a sharp McDiarmid-type inequality for dependent sequences, governed strictly by the exact matrix-vector multiplication of the causal dependency resolvent and the target sensitivity vector. This Matrix-Decoupled Concentration (MDC) framework natively recovers optimal constants for Markov chains and exploits directed d -separation to yield order-optimal bounds for causal trees. Crucially, by exactly preserving the coordinate-wise sparsity of rewards within a strictly causal framework, MDC mathematically prevents scalar collapse, guaranteeing a dimension-free \mathcalO(1) variance proxy and providing a rigorous mathematical justification for the stability of long-context reasoning.

[LG-79] DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

链接: https://arxiv.org/abs/2605.05994
作者: Nobutaka Ono
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, 1\times1 convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates A\in\mathbbR^m\times n by \widehat A=D_1B_1D_2B_2D_3 , where D_1,D_2,D_3 are diagonal matrices and B_1,B_2 are 0/1 binary matrices. The intermediate dimension k controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from mn to m+k+n . For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.

[LG-80] owards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions ICML2026

链接: https://arxiv.org/abs/2605.05983
作者: Yuntai Bao,Qinfeng Li,Xinyan Yu,Xuhong Zhang,Ge Su,Wenqi Zhang,Liu Yan,Haiqin Weng,Jianwei Yin
类目: Machine Learning (cs.LG)
*备注: 63 pages, 50 figures; accepted by ICML 2026

点击查看摘要

Abstract:Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

[LG-81] Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems

链接: https://arxiv.org/abs/2605.05975
作者: Sicheng Ma,Tianyue Yang,Xiuzhe Wu,Xiao Xue
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Reconstructing high-fidelity flow fields from low-fidelity observations is a central problem in scientific machine learning, yet recent diffusion and flow-matching models typically rely on iterative sampling, making them costly for latency-sensitive workflows such as ensemble forecasting, real-time visualization, and simulation-in-the-loop inference. We study whether a high-fidelity flow-matching generative model can be compressed into a compact one-step model for fast scientific flow reconstruction. Our approach distills an optimal-transport flow-matching teacher into a one-step consistency model. Low-fidelity observations are incorporated at inference by initializing the generative trajectory from a noised observation along the transport path, allowing an unconditional high-fidelity flow model to perform conditional reconstruction without retraining the teacher. We evaluate this distillation strategy on three fluid benchmarks, Smoke Buoyancy, Turbulent Channel Flow, and Kolmogorov Flow, using coarse-to-fine reconstruction as a controlled testbed at field sizes up to 256 \times 256 . Across these settings, the distilled student retains similar performance of the teacher’s model on spectrum metrics, while using roughly half as many parameters and achieving a 12\times inference speedup over the flow-matching teacher. Under the same training budget, the distilled student also outperforms a one-step consistency model trained directly from scratch by 23.1% in SSIM, showing that teacher distillation improves training efficiency rather than merely accelerating sampling. These results suggest a promising route for turning future high-capacity scientific generative models into compact reconstruction models that are faster to train, cheaper to run, and easier to deploy.

[LG-82] raining Transformers for KV Cache Compressibility

链接: https://arxiv.org/abs/2605.05971
作者: Yoav Gelberg,Yam Eitan,Michael Bronstein,Yarin Gal,Haggai Maron
类目: Machine Learning (cs.LG)
*备注: 32 pages, 4 figures

点击查看摘要

Abstract:Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model’s internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.

[LG-83] Sharper Guarantees for Misspecified Kernelized Bandit Optimization

链接: https://arxiv.org/abs/2605.05967
作者: Davide Maran,Csaba Szepesvári
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level \varepsilon is multiplied by \sqrtd_\mathrmeff , where d_\mathrmeff is the kernel effective dimension, while in online regret bounds, the corresponding penalty is \sqrt\gamma_n,n\varepsilon , where \gamma_n is the maximum information gain after n rounds of interaction. In this work, we show that, for a large class of kernels, the misspecification amplification can be reduced to logarithmic or polylogarithmic growth. In the offline setting, we first prove high-probability simple-regret bounds whose misspecification term is governed by a spectral Lebesgue constant. This yields logarithmic amplification for one-dimensional monotone spectra and polylogarithmic amplification for multivariate Fourier-diagonal product kernels. In the online setting, we modify a domain-splitting algorithm and prove a cumulative regret bound of \widetilde\mathcal O(\sqrt\gamma_n n+n\varepsilon) under mild localized eigendecay assumptions, removing the extra \sqrt\gamma_n factor from the misspecification term. The common principle is localization: spectral localization controls the Lebesgue constant of the offline approximation operator, while domain splitting implements the spatial analogue of this mechanism in the online setting, preventing local misspecification errors from being amplified globally. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2605.05967 [cs.LG] (or arXiv:2605.05967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-84] Uncertainty Estimation via Hyperspherical Confidence Mapping ICLR2026

链接: https://arxiv.org/abs/2605.05964
作者: Eunseo Choi,Ho-Yeon Kim,Jaewon Lee,Taeyong jo,Myungjun lee,Heejin Ahn
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026. 24 pages, 7 figures, including appendix

点击查看摘要

Abstract:Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for sampling-free and distribution-free uncertainty estimation. HCM decomposes outputs into a magnitude and a normalized direction vector constrained to lie on the unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of this geometric constraint. This yields deterministic and interpretable estimates applicable to both regression and classification. Experiments across diverse benchmarks and real-world industrial tasks demonstrate that HCM matches or surpasses ensemble and evidential approaches, with far lower inference cost and stronger confidence-error alignment. Our results highlight the power of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.

[LG-85] Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLM s

链接: https://arxiv.org/abs/2605.05957
作者: Zixuan Chen,Hao Lin,Zizhe Chen,Yizhou Tian,Garry Yang,Depeng Wang,Ya Guo,Huijia Zhu,James Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emphcorrection suppression and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19% to 90%, with four models exceeding 80%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emphknowing but not correcting – suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0% \to 58.2%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emphfactual strictness – the willingness to uphold accuracy against contextual pressures – as a new dimension of model reliability.

[LG-86] Quadratic Objective Perturbation: Curvature-Based Differential Privacy

链接: https://arxiv.org/abs/2605.05905
作者: Daniel Cortild,Coralia Cartis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Objective perturbation is a standard mechanism in differentially private empirical risk minimization. In particular, Linear Objective Perturbation (LOP) enforces privacy by adding a random linear term, while strong convexity and stability are ensured by an additional deterministic quadratic term. However, this approach requires the strong assumption of bounded gradients of the loss function, which excludes many modern machine learning models. In this work, we introduce Quadratic Objective Perturbation (QOP), which perturbs the objective with a random quadratic form. This perturbation induces strong convexity and enforces stability of the problem through curvature, thereby enabling privacy and allowing sensitivity to be controlled through spectral properties of the perturbation rather than assumptions on the gradients. As a result, we obtain (\varepsilon, \delta) -differential privacy under weaker assumptions, in the interpolation regime. Furthermore, we extend the analysis to account for approximate solutions, showing that privacy guarantees are preserved under inexact solves. Additionally, we derive utility guarantees in terms of empirical excess risk, and provide a theoretical and numerical comparison to LOP, highlighting the advantages of curvature-based perturbations. Finally, we discuss algorithmic aspects and show that the resulting problems can be solved efficiently using modern splitting schemes.

[LG-87] VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

链接: https://arxiv.org/abs/2605.05899
作者: Cheng Xu,Xiaofeng Hou,Jiacheng Liu,Chao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textitvisual-expert affinity: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to improve expert locality and prefetch effectiveness under tight memory budgets. We implement VisMMoE on multiple frameworks and evaluate it on representative VL-MoE models and benchmarks. VisMMoE improves end-to-end inference performance by up to 2.68x and 1.61x, respectively, over strong baselines for today’s VL-MoE deployments while maintaining competitive accuracy.

[LG-88] RepFlow: Representation Enhanced Flow Matching for Causal Effect Estimation

链接: https://arxiv.org/abs/2605.05890
作者: Yifei Xie,Jian Huang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bias. Existing methods are largely limited to point estimates and lack the capacity for distribution modeling. In this work, we propose RepFlow, a novel framework that formulates causal effect estimation as a joint optimization problem integrating representation learning with Conditional Flow Matching (CFM). RepFlow mitigates selection bias by minimizing the entropically regularized Wasserstein distance between treated and control representations. To enhance numerical stability, we further introduce an L_2 normalization constraint on latent representations. This balanced representation enables the flow model to accurately capture the distribution of potential outcomes. Extensive experiments across a wide range of benchmarks demonstrate that RepFlow consistently outperforms existing methods in both point and distributional causal effect estimation. Subjects: Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2605.05890 [cs.LG] (or arXiv:2605.05890v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05890 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yifei Xie [view email] [v1] Thu, 7 May 2026 09:02:44 UTC (2,115 KB)

[LG-89] Retain-Neutral Surrogates for Min-Max Unlearning

链接: https://arxiv.org/abs/2605.05871
作者: Junhao Cai,Dohun Kim,Dowon Kim,Sung Il Choi,Chengjun Jin,Juhyun Park,Changhee Joo
类目: Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:Machine unlearning seeks to remove the influence of designated training data while preserving performance on the remaining data. Approximate unlearning can be viewed as a local editing problem; in min-max unlearning, the key local object is the surrogate point at which the retain objective is evaluated. When forget and retain gradients are strongly aligned, an unconstrained forget-maximizing perturbation can move to a surrogate point that increases retain loss. We propose Retain-Orthogonal Surrogate Unlearning (ROSU), which constrains the inner surrogate construction by maximizing first-order forget gain subject to zero first-order retain change under a fixed perturbation budget. This yields a closed-form retain-orthogonal perturbation, a lightweight transported outer update, and amplification along the retain-neutral direction. Our analysis establishes (i) a curvature-controlled second-order bound on retain damage, (ii) a positive-alignment regime in which ROSU strictly reduces surrogate retain loss relative to standard min-max perturbations, and (iii) near-equivalence when the two gradients are nearly orthogonal. Across vision and language benchmarks (CIFAR-10/100, Tiny-ImageNet, TOFU, WMDP), the empirical pattern follows this geometry: ROSU gives its clearest gains in high-coupling regimes while remaining competitive elsewhere.

[LG-90] QuadraSHAP: Stable and Scalable Shapley Values for Product Games via Gauss-Legendre Quadrature

链接: https://arxiv.org/abs/2605.05870
作者: Majid Mohammadi,Grigory Reznikov,Pavel Sinitcyn,Krikamol Muandet,Siu Lun Chau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the efficient computation of Shapley values for \emphproduct games – cooperative games in which the coalition value factorizes as a product of per-player terms. Such games arise in machine learning explainability whenever the value function inherits a multiplicative structure from the underlying model, as in kernel methods with product kernels and tree-based models. Our key result is that the Shapley value of each player in a product game admits an exact one-dimensional integral representation: the weighted sum over exponentially many feature coalitions collapses to the integral of a degree- (d-1) polynomial over [0,1] , where d is the total number of features. This yields a Gauss–Legendre quadrature scheme that is \emphprovably exact whenever the number of nodes satisfies m_q \geq \lceil d/2 \rceil , and otherwise provides a \emphnear-exact approximation with error provably decaying geometrically in m_q . In practice, a few hundred nodes can achieve highly precise estimates even with thousands of features. Building on this formulation, we derive a numerically stable implementation via log-space evaluation, together with an efficient parallel implementation based on associative scan primitives that achieves O(d,m_q) total work and O(\log d) parallel time. Experiments show that \textscQuadraSHAP is the fastest numerically stable method across all tested configurations.

[LG-91] Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning

链接: https://arxiv.org/abs/2605.05862
作者: Yanming Xia,Angelica I. Aviles-Rivero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators perform well on structured domains, yet their behaviour on irregular geometries remains poorly understood. We show that this limitation is not merely an encoding issue, but a depth-wise failure mode inherent to deep operator architectures. We formalise the Geometric Forgetting Hypothesis: due to the Markovian structure of operator layers and their reliance on global mixing mechanisms, neural operators progressively lose access to domain geometry as depth increases. Using layer-wise geometric probing, we demonstrate that both spectral and attention-based operators systematically lose geometric fidelity. We show that this geometric forgetting degrades accuracy, stability, and generalisation. To counteract it, we introduce a lightweight geometry memory injection mechanism that restores geometric constraints at intermediate depths with minimal architectural overhead. This simple intervention consistently mitigates forgetting and exposes a geometric shortcut instability in transformer-based operators, revealing that geometric retention is a structural requirement rather than a design choice. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.05862 [cs.LG] (or arXiv:2605.05862v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05862 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-92] Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

链接: https://arxiv.org/abs/2605.05857
作者: Rohit Sonker,Hiro Josep Farre Kaga,Jiayu Chen,Andrew Rothstein,Ian Char,Ricardo Shousha,Egemen Kolemen,Jeff Schneider
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stability, confinement, and transport. While the average rotation can be controlled, controlling the full profile is challenging due to high dimensionality, response to multiple actuators and dependence on plasma condition. Learning-based control methods, such as reinforcement learning (RL), provide a potential solution to this challenging problem with ability to model complex interactions leading to effective multi-input multi-output control. However, learning such policies is challenging due to the lack of accurate simulators that can model the rotation profile dynamics. In this work, we investigate the use of offline RL and offline model-based RL algorithms for rotation profile control, training them solely on historical data from the DIII-D tokamak. Our final method uses probabilistic models of plasma dynamics to generate rollouts for RL training. We deploy this policy on the DIII-D Tokamak and observe promising real-world results. We conclude by highlighting key challenges and insights from training and deploying an RL policy on a complex physical device while using only limited past data.

[LG-93] Measuring Learning Progress via Gradient-Momentum Coupling

链接: https://arxiv.org/abs/2605.05856
作者: Samuel Blad,Martin Längkvist,Amy Loutfi
类目: Machine Learning (cs.LG)
*备注: 23 pages, 15 figures, preprint

点击查看摘要

Abstract:Measuring learning progress is essential for curiosity-driven exploration in reinforcement learning, but widely used signals such as prediction error often fail to distinguish meaningful, learnable patterns from random noise. This paper proposes Gradient-Momentum Coupling (GMC), a signal derived from optimization dynamics that quantifies how useful each sample’s gradient is for ongoing learning by measuring its per-parameter normalized absolute product with the momentum from previous gradients. By leveraging momentum’s natural filtering of noise and oscillations, GMC identifies samples that contribute to ongoing parameter updates. Controlled experiments demonstrate noise robustness and emergent curriculum learning, with the signal prioritizing tasks by learning speed rather than difficulty. Experiments on MiniGrid suggest that replacing prediction error with GMC within existing curiosity-driven architectures can improve robustness to observation noise.

[LG-94] Hypothesis generation and updating in large language models

链接: https://arxiv.org/abs/2605.05851
作者: Hua-Dong Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as \16, 8, 2, 64\ : a rule like powers of 2 or an interval like numbers near 20. We measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. We then compare LLM behavior with an optimal Bayesian model and human behavior, and test whether the same posterior is expressed across probes. LLMs are often well described by a two-parameter Bayesian fit, but with systematic offsets: by default they show a strong-sampling assumption that creates an implicit Occam’s razor, favoring narrower hypotheses, while thinking mode shifts them toward greater prior reliance. We also find a robust evaluation–generation gap: LLMs select more correct hypotheses during hypothesis evaluation but generate simpler, more rule-like hypotheses. Finally, this Bayesian-with-bias pattern does not extrapolate. Models can behave as if they hold rule-like hypotheses over observed examples, yet generalize poorly to parts of the hypothesis domain not covered by those examples. Our results highlight a limitation of LLMs as general problem solvers, especially for scientific inference, where hypotheses must go beyond the data.

[LG-95] MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

链接: https://arxiv.org/abs/2605.05838
作者: Yulong Huang,Xiang Liu,Hongxiang Huang,Xiaopeng Lin,Zunchang Liu,Xiaowen Chu,Zeke Xie,Bojun Cheng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: this https URL .

[LG-96] HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices

链接: https://arxiv.org/abs/2605.05819
作者: Shen Xu,Xiangwen Zhuge,Zhe Xu,Yingkun Hu,Zheng Yang,Yunhao Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial parameter storage but access only a small subset of compensation parameters during each inference step. Motivated by this opportunity, we propose HCInfer, a heterogeneous inference system that offloads residual compensation to the CPU while executing the compressed backbone on the GPU, and further introduces an asynchronous compensation pipeline and sensitivity-aware dynamic rank allocation to hide compensation overhead and maximize accuracy recovery. Experimental results show that HCInfer achieves a maximum accuracy improvement of 5.2% on downstream tasks compared to compression model and sustaining a maximum speedup of 10.4x compared to full-precision model.

[LG-97] Retrieval from Within: An Intrinsic Capability of Attention-Based Models

链接: https://arxiv.org/abs/2605.05806
作者: Elad Hoffer,Yochai Blau,Ron Banner,Daniel Soudry,Boris Ginsburg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.

[LG-98] Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

链接: https://arxiv.org/abs/2605.05802
作者: Zhiyuan Zhai,Xin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long multi-turn dialogue with one LLM call per step, so this multi-sample multiplier dominates the total training cost. When every rollout of a prompt ends with the same reward, the group has zero reward variance and contributes no gradient, so the extra rollouts add no information; such groups are common in practice (typically around 40% of all groups), so the wasted-compute fraction is substantial rather than marginal. Existing methods filter such groups at the prompt level, either after their rollouts are paid for or before any rollout begins, but both decide without using information that becomes available during the rollout itself. We instead ask whether the in-group divergence between the partial trajectories at an intermediate step can already predict that the group will be zero-variance: when the parallel rollouts have already converged on the same action prefix, the group is on track to produce a single reward, and we can stop early. We propose a one-parameter gate that stops a group when the mean pairwise prefix edit distance between its partial action sequences falls below a threshold. On a 60-iteration on-policy GRPO run on ALFWorld with Qwen2.5-7B, averaged over four random seeds, the gated arm finishes 10.7% faster in wall-clock (bootstrap 95% CI excludes 0) and shifts held-out success rate on 50 unseen tasks by +2.5 pp, with the held-out gain tracing to a measurable reduction in zero-advantage gradient-batch dilution. Code is available at this https URL.

[LG-99] Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLM s

链接: https://arxiv.org/abs/2605.05795
作者: Nicholas Potteiger,Ankita Samaddar,Taylor T. Johnson,Xenofon Koutsoukos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure used as a reactive and modular reward and action mask function. We design an MRBT template and derive logical specifications to construct and verify MRBTs for a sequence of object-interaction subtasks. Further, we develop an automated pipeline that uses an LLM to generate MRBTs robust to varying task objects, an SMT-solver to verify correctness of specifications, and a neurosymbolic RL loop to train agents on compositional tasks. Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking. We further highlight three advantages of MRBTs: transferability, modularity, and verifiability.

[LG-100] A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

链接: https://arxiv.org/abs/2605.05791
作者: Manuel Haussmann,Mustafa Mert Çelikok,Melih Kandemir
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.

[LG-101] Full-Spectrum Graph Neural Network: Expressive and Scalable ICML2026

链接: https://arxiv.org/abs/2605.05759
作者: Xiaohan Wang,Deyu Bo,Longlong Li,Kelin Xia
类目: Machine Learning (cs.LG)
*备注: 40 pages, 3 figures. Accepted to ICML 2026

点击查看摘要

Abstract:It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNN (FSpecGNN), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering in two perspectives: (1) it lifts the signal from the node domain to the node-pair domain; and (2) it extends the univariate spectral filter over eigenvalues to a bivariate filter over eigenvalue pairs. We show that classical spectral GNNs arise as a diagonal special case of FSpecGNN, and prove that FSpecGNN can be at most as expressive as Local 2-GNN while universally approximating node-pair signals, the latter being particularly beneficial for heterophilic graph learning. Moreover, FSpecGNN admits scalable implementations that avoid explicit node-pair-level computations; combined with a low-rank approximation that reduces full-spectrum convolution to a combination of polynomial spectral filters, it enables learning on large graphs. Empirically, FSpecGNN validates the predicted expressivity and delivers strong performance on heterophilic benchmarks.

[LG-102] Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

链接: https://arxiv.org/abs/2605.05742
作者: Scott Geng,Dutch Hansen,Jerry Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weak-to-strong generalization is a phenomenon in post-training whereby a strong student model, when finetuned solely with feedback from a weaker teacher, can not only surpass the teacher, but can improve upon its own capabilities. Recent work of Burns et al. (2023) demonstrated that this can occur in the setting of frontier language models, and subsequently there has been a flurry of both empirical work trying to exploit this phenomenon, as well as theoretical work attempting to understand it. In this work, we demonstrate that weak-to-strong generalization occurs in standard linear logistic regression, under mild distributional assumptions on the data. In fact, we show that this happens for most student-teacher pairs, suggesting that weak-to-strong generalization is in fact \emphalmost inevitable, even in this basic setting. Notably, our setting does not require the student to be more expressive or have more model capacity in any way compared to the teacher, which runs contrary to the prevailing theoretical belief that a mismatch in model capacity is a central mechanism to weak-to-strong generalization.

[LG-103] Enabling Federated Inference via Unsupervised Consensus Embedding

链接: https://arxiv.org/abs/2605.05718
作者: Yui Hashimoto,Takayuki Nishio,Yuichi Kitagawa,Takahito Tanimura
类目: Machine Learning (cs.LG)
*备注: 18 pages, 15 figures, submitted to IEEE Transactions on Mobile Computing (TMC) (under review)

点击查看摘要

Abstract:Cooperative inference across independently deployed machine learning models is increasingly desirable in distributed environments, as there is a growing need to leverage multiple models while keeping their data and model parameters private. However, existing cooperative frameworks typically rely on sharing input data, model parameters, or a common encoder, which limits their applicability in privacy-sensitive or cross-organizational settings. To address this challenge, we propose Consensus Embedding-based Federated Inference (CE-FI), a framework that enables pretrained models to cooperate at inference time without sharing model parameters or raw inputs and without assuming a common encoder. CE-FI introduces two components: a Consensus Embedding (CE) layer that maps heterogeneous intermediate representations into a common embedding space, and a Cooperative Output (CO) layer that produces predictions from these embeddings. Both layers are trained using shared unlabeled data only, so the cooperative stage does not require additional labeled data. Experiments on image classification benchmarks – CIFAR-10 and CIFAR-100 – under diverse non-IID conditions show that CE-FI consistently outperforms solo inference and performs comparably to conventional methods that require stronger sharing assumptions. Additional evaluations on text and time-series tasks indicate applicability beyond image classification, although performance depends on the ensemble strategy. Further analysis identifies representation alignment as the primary bottleneck.

[LG-104] On the Blessing of Pre-training in Weak-to-Strong Generalization

链接: https://arxiv.org/abs/2605.05710
作者: Wei Yao,Wang Zhaoyang,Gengze Xu,Chen Qian,Dongrui Liu,Ziqiao Wang,Yong Liu,Yunbei Xu
类目: Machine Learning (cs.LG)
*备注: 40 pages, 14 figures

点击查看摘要

Abstract:The paradigm of Weak-to-Strong Generalization (W2SG) suggests that a pre-trained strong model can surpass its weak supervisor, yet the decisive role of pre-training remains theoretically and empirically under-explored. In this work, we identify pre-training as the essential prerequisite for the emergence of W2SG. Theoretically, we formalize the W2SG problem within a high-dimensional single-index model framework using spiked Gaussian data, modeling pre-training as a spectral initialization step. Building upon prior impossibility results regarding the failure of learning under random initialization, we prove that W2SG is achievable when pre-training provides a geometric warm start that places the model within an “effective region” characterized by a perturbed strong-convexity geometry. Within this region, we derive a rigorous generalization bound that naturally captures the optimization dynamics: an initial performance improvement followed by a saturation bottleneck dictated by the weak supervisor’s bias. Empirically, we first validate all our assumptions and theoretical insights through controlled synthetic simulations. Finally, through a massive-scale evaluation of hundreds of intermediate pre-training checkpoints from large language models, we demonstrate that W2SG is not an innate capability but emerges via a phase transition tightly coupled with the progression of pre-training.

[LG-105] Convex-Geometric Error Bounds for Positive-Weight Kernel Quadrature

链接: https://arxiv.org/abs/2605.05705
作者: Satoshi Hayakawa
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Kernel quadrature can exploit RKHS spectral structure and outperform Monte Carlo on smooth integrands, but optimized quadrature weights are generally signed and may be numerically unstable. We study whether spectral acceleration remains possible when the weights are constrained to be positive, i.e., simplex weights. In the exact-target fixed-pool setting, an evaluated i.i.d. candidate pool of size N is already available and the task is to reweight it so as to approximate the kernel mean embedding. We show that this positive reweighting problem is governed not by the equal-weight empirical average, but by the random convex hull generated by the pool. Our main geometric result shows that the mean of a bounded d -dimensional random vector can be approximated by a convex combination of N i.i.d. samples at accuracy O(d/N) with high probability, sharper than equal-weight averaging in the fixed-dimensional regime. We transfer this d -dimensional convex-hull approximation to full RKHS worst-case error through an augmented Mercer-truncation argument. The resulting positive-weight KQ bounds consist of a spectral tail term and a finite-sample convex-hull term, yielding Monte-Carlo-beating rates in favorable spectral regimes, including near- O(1/N) rates up to logarithmic factors under exponential spectral decay. We also provide a constructive Frank–Wolfe algorithm that operates directly on the pool atoms, maintains simplex weights, and admits an explicit optimization-error bound.

[LG-106] Distributionally Robust Multi-Objective Optimization

链接: https://arxiv.org/abs/2605.05660
作者: Yufeng Yang,Fangning Zhuo,Ziyi Chen,Heng Huang,Yi Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 47 pages

点击查看摘要

Abstract:Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributionally robust multi-objective optimization (DR-MOO), which minimizes multiple objectives under their respective worst-case distributions. We propose Pareto-type solution concepts for DR-MOO and develop multi-gradient descent algorithms (MGDA) with provable guarantees. Leveraging a Lagrangian dual reformulation, we first design a double-loop MGDA that uses an inner loop to estimate dual variables and achieves a total sample complexity \mathcalO(\epsilon^-12) for reaching an \epsilon -Pareto-stationary point. To further improve efficiency, we incorporate gradient clipping to handle generalized-smooth and biased gradient estimates, removing the need for double sampling. This yields a single-loop double-clip MGDA with substantially improved sample complexity \mathcalO(\epsilon^-4) . Our theory applies to the nonconvex setting and does not require bounded objectives or gradients. Experiments demonstrate that our methods are competitive with state-of-the-art MGDA baselines.

[LG-107] Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks

链接: https://arxiv.org/abs/2605.05659
作者: Ying Chen,Aoxi Li,Jihun Kim,Javad Lavaei
类目: Machine Learning (cs.LG)
*备注: 27 pages, 6 figures

点击查看摘要

Abstract:The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are less explored, often relying on restrictive priors like a pretrained base matrix, ReLU activations or non-verifiable singularity conditions. We first investigate the limits of neural networks constrained strictly to low-rank manifolds without pretrained dense priors. We demonstrate a theoretical paradox: while purely rank-1 layers can exactly interpolate arbitrary scalar datasets, they collapse for function approximations. To overcome this bottleneck without surrendering parameter efficiency, we introduce a unified \textitStructural Correspondence framework. We prove that augmenting low-rank layers with only a minimal sparse diagonal component, say a Diagonal plus Low-Rank (DLoR) structure, is sufficient to reach Universal Approximation. We show that any full-rank transformation can be exactly reconstructed using these DLoR components by trading off network width (additive decomposition) or depth (multiplicative decomposition). By tracking asymptotic Taylor remainders, we prove that DLoR neural networks fully restore the Universal Approximation Theorem for general activation functions. Finally, we establish that multiplicative depth provides superior parameter-to-expressivity scaling compared to additive width. Our results show that dense matrices and specific activation functions are not topological prerequisites for universal expressivity.

[LG-108] Information-Preserving Domain Transfer with Unlabeled Data in Misspecified Simulation-Based Inference

链接: https://arxiv.org/abs/2605.05652
作者: Joon Jang,Eunho Jeong,Kyu Sung Choi,Hyeonjin Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference (SBI) provides amortized Bayesian parameter inference from simulator-generated data without requiring explicit likelihood evaluation. Its reliability can degrade under model misspecification, where real-world observations are not well represented by the simulator used for training. Existing methods using unlabeled real-world data often align simulated and real-world data distributions, but marginal alignment alone does not directly preserve parameter-relevant information needed for posterior inference. We propose SPIN, an SBI framework with parameter-relevant information-preserving domain transfer using unlabeled, unpaired real-world observations. During training, SPIN translates labeled simulator observations toward the real-world domain and back to the simulator domain, using the original simulator labels to encourage domain transfer that preserves parameter-relevant mutual information. At test time, the learned real-to-simulator transport maps real-world observations into the simulator domain for posterior inference, without requiring real-world parameter labels or paired real–simulator observations. Across controlled synthetic and physical real-world benchmarks, SPIN improves real-world posterior inference, with the improvement becoming clearer as misspecification increases.

[LG-109] Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning

链接: https://arxiv.org/abs/2605.05638
作者: Brett Barkley,Preston Culbertson,David Fridovich-Keil
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Models trained with deep learning often fail to signal when inputs fall outside their training data manifold, leading to unreliable predictions under distribution shift. Prior work suggests that effective out-of-distribution (OOD) detection often requires class-conditional modeling or specialized models obtained through supervised fine-tuning. We revisit this assumption in modern pretrained models and show that their frozen representations already encode sufficient geometric structure for accurate label-free OOD detection. Across 59 backbone-task pairings spanning vision and language, we compare two complementary label-free detectors: a global Mahalanobis estimator fit on unlabeled latent representations, and ReSCOPED, a lightweight, diffusion-based typicality estimator operating on the same features at a local level. Despite their different detection mechanisms, representation scaling reveals a consistent regime-dependent pattern: both local and global detectors’ absolute performance improves with better representation quality, and performance gaps between the two detectors disappear across both language and vision tasks as representations scale. These results suggest that label-free OOD detection depends strongly on the geometry exposed by frozen pretrained backbones, reducing the importance of detector choice as backbone scale increases and enabling efficient deployment directly on frozen models.

[LG-110] Region-adaptable retrieval of coastal biogeochemical parameters from near-surface hyperspectral remote sensing reflectance using physics-aware meta-learning

链接: https://arxiv.org/abs/2605.05623
作者: Yiqing Guo,Nagur R. C. Cherukuru,Eric A. Lehmann,S. L. Kesav Unnithan,Tim J. Malthus,Gemma Kerrisk,Xiubin Qi,Faisal Islam,Tisham Dhar,Mark J. Doubell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperspectral in situ sensing has shown promise in retrieving aquatic biogeochemical (BGC) parameters, such as total suspended solids, dissolved organic carbon, and total chlorophyll-a, for cost-effective monitoring of coastal water quality. However, generalising such retrieval algorithms across water bodies remains challenging, as the relationship between remote sensing reflectance (Rrs) and BGC parameters can vary considerably from one region to another due to regional distinctions in environmental conditions and biogeochemistry that lead to different BGC ranges and bio-optical properties. In this study, we propose a two-stage physics-aware meta-learning framework for retrieving coastal BGC parameters from near-surface Rrs observations. In the first stage, a bio-optical forward model is used to generate a large synthetic dataset based on an in situ bio-optical spectral library with broad representativeness of Australian coastal waters. This dataset is then used to pretrain a region-agnostic base model with meta-learning, allowing the model to learn fundamental physical relationships. In the second stage, the pretrained base model is fine-tuned for specific regions with local samples. We collected in situ hyperspectral Rrs and BGC measurements from five geographically distinct sites in Australian coastal waters. Our experimental results suggest: (1) the BGC parameters and their corresponding hyperspectral Rrs signatures exhibited clear regional distinctions among the experimental sites; (2) the synthetic dataset was physically plausible and closely aligned with real-world samples in both parameter distributions and inter-parameter correlations; (3) the proposed approach outperformed five benchmark models in BGC retrieval; and (4) time series of in situ measured and model-predicted BGC parameters showed good agreement in both magnitude and temporal dynamics.

[LG-111] LLM Space: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites

链接: https://arxiv.org/abs/2605.05615
作者: Lei Jiang,Adrian Ildefonso,Daniel Loveless,Fan Chen
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 12 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) impose rapidly growing energy demands, creating an emerging energy and carbon crisis driven by large-scale inference. Solar-powered, AI-enabled low Earth orbit (LEO) satellites have been proposed to mitigate terrestrial electricity consumption, but their lifecycle carbon footprint remains poorly understood due to launch emissions, satellite manufacturing, and radiation-hardened hardware requirements. This paper presents \textitLLMSpace, the first carbon modeling framework for LLM inference on AI-enabled LEO satellites. LLMSpace jointly models operational and embodied carbon, peripheral subsystems, radiation-hardened accelerators and memories, and LLM-specific workload characteristics such as prefill-decode behavior and token generation. Using realistic satellite and GPU configurations, LLMSpace reveals key trade-offs among carbon footprint, inference latency, hardware design, and operational lifetime for sustainable space-based LLM inference. Source code: this https URL.

[LG-112] Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

链接: https://arxiv.org/abs/2605.05609
作者: Jianyu Xu,Yu-Xiang Wang
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 30 pages, 1 figure, 1 table

点击查看摘要

Abstract:We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only \tilde O(T^3/4) regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves \tilde O(T^2/3) optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.

[LG-113] When Can Voting Help Hurt or Change Course? Exact Structure of Binary Test-Time Aggregation

链接: https://arxiv.org/abs/2605.05592
作者: Yi Liu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Majority voting is one of the few black-box interventions that can improve a fixed stochastic predictor: repeated access can be cheaper than changing a high-capability model. Classical fixed-competence theory makes this intervention look monotone – more votes help above the majority threshold and hurt below it. We show that this picture is fundamentally incomplete. Under the de Finetti representation for exchangeable repeated correctness, voting is governed by a latent distribution of per-example correctness probabilities. Even simple latent mixtures can generate sharply different voting curves, including nonmonotone behavior and, in an explicit construction, infinitely many trend changes. The full latent law determines the curve, but the curve does not determine the law. The exact object recovered by voting is a signed voting signature: at each binomial variance scale, it records excess latent mass above rather than below the majority threshold. Our main theorem proves that the complete odd-budget curve and this signature are equivalent: the curve increments are signed Hausdorff moments, and the full curve recovers the signature uniquely. This viewpoint explains shape phenomena, branch-symmetric nonidentifiability, realizability, variation, and endpoint rates. It also separates estimation regimes: direct per-example success-probability information targets the full signature, whereas fixed-depth grouped labels reveal only a finite prefix.

[LG-114] AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling

链接: https://arxiv.org/abs/2605.05586
作者: Francisco Giral,Abhijeet Vishwasrao,Andrea Arroyo Ramo,Mahmoud Golestanian,Federica Tonti,Adrian Lozano-Duran,Steven L. Brunton,Sergio Hoyas,Hector Gomez,Soledad Le Clainche,Ricardo Vinuesa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aerodynamic surrogate models are increasingly used to replace repeated high-fidelity CFD evaluations in many-query design settings, but current approaches still face two important limitations: they often scale poorly to the very large fields arising in realistic 3D aerodynamics, and they rarely produce latent representations that are directly useful for analysis and design. We introduce AeroJEPA, a Joint-Embedding Predictive Architecture for aerodynamic field modeling that addresses both issues. Rather than predicting the full flow field directly from geometry, AeroJEPA predicts a target latent representation of the flow from a context latent representation of the geometry and operating conditions, and optionally reconstructs the field through a continuous implicit decoder. This formulation decouples latent prediction from field resolution while encouraging the latent space to organize semantically. We evaluate AeroJEPA on two complementary datasets: HiLiftAeroML, which stresses the method in a high-fidelity regime with extremely large boundary-layer fields, and SuperWing, which tests large-scale generalization and latent-space optimization over a broad family of transonic wings. Across these benchmarks, AeroJEPA is competitive as a continuous surrogate for aerodynamic fields, scales naturally to high-resolution outputs, and learns context and predicted latents that encode geometry and aerodynamic quantities not used directly as supervision. We further show that the resulting latent space supports controlled interpolation, linear probing, concept-vector arithmetic, and a constrained design latent-optimization experiment. These results suggest that predictive latent learning is a promising direction for scalable and design-meaningful aerodynamic surrogate modeling.

[LG-115] A Scalable Digital Twin Framework for Energy Optimization in Data Centers

链接: https://arxiv.org/abs/2605.05581
作者: Raphael Hendrigo de Souza Gonçalves,Wendel Marcos dos Santos
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:This study proposes a scalable Digital Twin framework for energy optimization in data this http URL framework integrates IoT-based data acquisition, cloud computing, and machine learning techniques to enable real-time monitoring, forecasting, and intelligent energy management. A controlled small-scale data center environment was developed to monitor variables such as power consumption, temperature, and computational workload. Long Short-Term Memory (LSTM) models were employed to predict energy demand and support operational decision-making. Experimental results demonstrated improvements in energy efficiency, including reductions in power consumption and enhancements in Power Usage Effectiveness (PUE). Despite being evaluated in a constrained environment, the proposed framework demonstrates strong potential as a scalable and cost-effective solution for sustainable data center management.

[LG-116] FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings

链接: https://arxiv.org/abs/2605.05553
作者: Quang-Huy Nguyen,Jiaqi Wang,Wei-shinn Ku
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) operates in heterogeneous environments, where variations in data distributions and asymmetric model design often result in negative transfer. While federated knowledge distillation (FKD) avoids direct model parameter sharing, existing methods typically rely on public datasets or assume that transferred knowledge is uniformly reliable, which limits their robustness in practice. This paper presents FedeKD, a reliability-aware FKD framework that makes sample-wise trust estimation an explicit component of knowledge transfer, without relying on additional public data. Each client maintains a high-capacity private model for local learning and a lightweight shared proxy model for cross-client knowledge exchange. During training, proxy models are aggregated on the server to form a global proxy, which is then used to guide updates of the private models. At the core of FedeKD is an energy-based gating mechanism that converts task-specific private-proxy disagreement into sample-wise trust weights for backward distillation. This mechanism enables sample-wise weighting of knowledge transfer, where the proxy model contributes more to reliable samples while down-weighting unreliable ones. Extensive experiments on six real-world datasets demonstrate that FedeKD significantly reduces negative transfer under heterogeneous settings while maintaining strong predictive performance.

[LG-117] Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

链接: https://arxiv.org/abs/2605.05544
作者: Nandiraju Gireesh,Yuanliang Ju,He Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal: near contact events the agent needs short chunks for reactive control, while during free-space motion long chunks provide better credit assignment. The natural solution is to train critics for several chunk sizes and select the best one at each state, but naive comparison of learned critic values systematically collapses to the shortest chunk due to discount-scale mismatch, and degrades to noise in low-value states. We propose Adaptive Q-Chunking (AQC), which resolves both failures by comparing the advantage of each chunk size relative to a per-horizon baseline, normalized by the discount factor. This criterion converts biased wrong answers into unbiased near-random choices when no genuine signal exists, and becomes discriminative when a particular scale enables better planning. We prove theoretical bounds on the advantage selector’s noise immunity and on the value dominance of adaptive chunking over any fixed chunk size. We demonstrate that AQC achieves state-of-the-art offline and online success rates on OGBench and Robomimic, and can be applied to enhance the performance of large-scale VLA models that predict action sequences, significantly boosting performance on RoboCasa-GR1 tasks.

[LG-118] owards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting

链接: https://arxiv.org/abs/2605.05540
作者: Tianyue Yang,Xiao Xue
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 42 pages, 15 figures

点击查看摘要

Abstract:Fast surrogate modeling for high-dimensional physical dynamics requires more than low short-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories. Neural operators provide inexpensive autoregressive forecasts but can drift in turbulent regimes, whereas rolling diffusion and latent generative surrogates can represent stochastic transitions at the cost of multi-step denoising, noise-schedule design, or auxiliary compression models. We propose MeanFlow Long-term Invariant Spatiotemporal Consistency Autoregressive Models (MeLISA), a latent-free autoregressive generative surrogate built on pixel-space MeanFlow. MeLISA defines a blockwise stochastic transition kernel that generates each forecast block with a single model evaluation, avoiding latent encoders and iterative diffusion solvers at inference time. To stabilize long-horizon rollouts, MeLISA combines a Window-Consistency MeanFlow objective that learns conditional spatiotemporal generation from partially observed temporal windows with a Time Increment Consistency loss that constrains multi-lag finite increments and targets temporal-correlation structure. We evaluate MeLISA with compact UNet and scalable DiT backbones on two high-resolution benchmarks, extended 2D Kolmogorov flow at 256 \times 256 and turbulent channel-flow slice at 192 \times 192 . MeLISA outperforms neural-operator baselines on short-term forecasting accuracy and long-horizon statistical metrics, including energy spectra, turbulent kinetic energy, and mixing-rate-related dynamics, while achieving inference speeds comparable to, and in some cases faster than, neural operators. Compact 3.7-5.7M-parameter variants already deliver strong parameter efficiency, and DiT variants provide a scalable path up to 150M parameters. Overall, MeLISA benefits both rollout efficiency and long-horizon statistical accuracy.

[LG-119] Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation

链接: https://arxiv.org/abs/2605.05534
作者: Tran Gia Bao Ngo,Zulfikar Alom,Federico Errica,Murat Kantarcioglu,Cuneyt Gurcan Akcora
类目: Machine Learning (cs.LG)
*备注: 49 pages, 6 figures

点击查看摘要

Abstract:Adversarial learning and the robustness of Graph Neural Networks (GNNs) are topics of widespread interest in the machine learning community, as documented by the number of adversarial attacks and defenses designed for these purposes. While a rigorous evaluation of these adversarial methods is necessary to understand the robustness of GNNs in real-world applications, we posit that many works in the literature do not share the same experimental settings, leading to ambiguous and potentially contradictory scientific conclusions. In this benchmark, we demonstrate the importance of adopting fair, robust, and standardized evaluation protocols in adversarial GNN research. We perform a comprehensive re-evaluation of seven widely used attacks and eight recent defenses under both poisoning and evasion scenarios, across six popular graph datasets. Our study spans over 453,000 experiments conducted within a unified framework. We observe substantial differences in adversarial attack performance when evaluated under a fair and robust procedure. Our findings reveal that previously overlooked factors, such as target node selection and the training process of the attacked model, have a profound impact on attack effectiveness, to the extent of completely distorting performance insights. These results underscore the urgent need for standardized evaluations in adversarial graph machine learning.

[LG-120] Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective

链接: https://arxiv.org/abs/2605.05530
作者: Yixuan Wang,Wenqian Xue,Warren E. Dixon
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely. We unify the training and sampling phases of this paradigm, conventionally treated as separate procedures, within a single framework: density transport on the Wasserstein space, cast as a nonlinear control problem in which the Kullback Leibler (KL) divergence serves as a Lyapunov function. Training and sampling are then two instances of this same master dynamics, differing only in initial condition. Within this autonomous framework we develop two analytic results. First, since the Lyapunov certificate is asymptotic, we derive a finite step stopping criterion for Langevin sampling and prove that no Lyapunov certificate exists for the deterministic gradient flow on the same energy landscape. Second, the reformulation brings the toolkit of nonlinear control theory to bear on static scalar energy generative modeling, that is, we show that additive composition of trained scalar energies retains an explicit Gibbs invariant measure and inherits the closed-loop Lyapunov certificate. Beyond these immediate results, this reformulation bridges static scalar energy generative models with the full toolkit of nonlinear control theory, opening the door to barrier functions for constrained generation and contraction metrics for accelerated sampling. Experiments on synthetic distributions validate the theoretical predictions.

[LG-121] Discrete Elastic Ribbons: A Unified Discrete Differential Geometry Framework for One-Dimensional Energy Models

链接: https://arxiv.org/abs/2605.05529
作者: Shivam Kumar Panda,M Khalid Jawed
类目: Computational Engineering, Finance, and Science (cs.CE); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 59 pages, 9 figures, 5 tables. Source code available on this https URL and this https URL

点击查看摘要

Abstract:Elastic ribbons, slender structures whose length ( L ), width ( W ), and thickness ( b ) satisfy L \gg W \gg b , exhibit mechanical behaviors intermediate between one-dimensional rods ( L \gg W, b ) and two-dimensional plates ( L, W \gg b ). In quadratic Kirchhoff-type rod-based frameworks, such as Discrete Elastic Rods (DER), the governing equilibrium equations are independent of width, and therefore these models cannot capture width-dependent mechanical effects. Reduced centerline-based ribbon models attempt to capture width dependence via coupled bending-twisting energies. However, their relative accuracy remain unclear due to the absence of a unified simulation framework. In this work, we formulate a framework grounded in discrete differential geometry where the energy is expressed as functions of coupled bending-twisting strain measures along the centerline, rather than a linear sum of quadratic bending and twisting energies in DER. We derive analytical gradients and Hessians of the energy that enable implicit time integration. Within this unified setting, we compare five ribbon models: Kirchhoff, Sadowsky, Wunderlich, Sano, and Audoly. As a benchmark, a straight ribbon is longitudinally constrained into a pre-buckled arch and subjected to transverse displacement, inducing a supercritical pitchfork bifurcation. Predicted bifurcation thresholds are compared against shell-based finite element simulations, with the Sano model providing the closest agreement in capturing width-dependent shifts. Our high-performance JAX-based implementation achieves \mathcalO(N) per-iteration cost and also confirms that Sano model introduces negligible per-iteration overhead relative to standard DER.

[LG-122] Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

链接: https://arxiv.org/abs/2605.05520
作者: Badr Moufad,Albina Ilina,Hai Victor Habi,Salem Lahlou,Yazid Janati,Hagit Messer,Eric Moulines
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.

[LG-123] OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination

链接: https://arxiv.org/abs/2605.05519
作者: Jae-Won Chung,Zhirui Liang,Yanyong Mao,Jiasi Chen,Mosharaf Chowdhury,Vladimir Dvorkin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Open-source at this https URL

点击查看摘要

Abstract:AI’s growing compute demand and new datacenter buildouts present major capacity and reliability challenges for the electricity grid, leading to multi-year interconnection delays for new datacenters and bottlenecking AI growth. To ease this strain, datacenters increasingly offer rapid power flexibility in response to grid signals, where the datacenter can increase or decrease its power consumption by adapting its workload in real time. In order to understand the impact of large datacenters on the grid and to facilitate the design of effective coordination strategies, we build OpenG2G, a simulation platform for AI datacenter-grid runtime coordination. We show that OpenG2G is capable of answering a wide range of coordination questions by allowing users to implement and compare various control paradigms (including classic, optimization, and learning-based controllers), and quantify how AI model and deployment choices affect datacenter flexibility and coordination outcomes. This versatility is enabled by OpenG2G’s modular and extensible architecture: a datacenter backend driven by real measurements of production-grade AI services, a grid backend built on high-fidelity grid simulators, and a generic controller interface that closes the loop between them. We describe the design of OpenG2G and demonstrate its usefulness through realistic grid scenarios and AI workloads. Comments: Open-source at this https URL Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.05519 [cs.LG] (or arXiv:2605.05519v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-124] Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients

链接: https://arxiv.org/abs/2605.05511
作者: Linus Aronsson,Morteza Haghir Chehreghani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Active feature acquisition (AFA) considers prediction problems in which features are costly to obtain and the learner adaptively decides which feature values to acquire for each instance and when to stop and predict. AFA can be formulated as a partially observable Markov decision process (POMDP), which naturally admits a sequential decision-making perspective. In this paper, we present non-myopic pathwise policy gradients (NM-PPG), a new AFA method built around this formulation. We introduce a continuous relaxation of the acquisition process that enables pathwise gradients through the full acquisition trajectory, avoiding the high variance of standard score-function policy gradients while allowing end-to-end optimization of a non-myopic acquisition policy. To better align training with deployment, we further develop a straight-through rollout scheme that follows hard feature acquisitions in the forward pass while backpropagating through the corresponding soft relaxation in the backward pass. We stabilize optimization with entropy regularization and staged temperature sharpening. Experiments on both synthetic and real-world datasets demonstrate that NM-PPG yields superior performance relative to state-of-the-art AFA baselines.

[LG-125] Online Localized Conformal Prediction

链接: https://arxiv.org/abs/2605.05497
作者: Yuheng Lai,Garvesh Raskutti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is a framework that provides valid uncertainty quantification for general models with exchangeable data. However, in the online learning and time-series settings, exchangeability is not satisfied. Existing online conformal methods, such as adaptive conformal inference (ACI), can achieve long-run validity, yet they remain inefficient under covariate heterogeneity because they rely on global calibration. We propose \emphOnline Localized Conformal Prediction (OLCP), which combines online adaptation with covariate-dependent localization to better reflect heterogeneity. To reduce sensitivity to the localization bandwidth, we further develop \emphOLCP-Hedge, which performs bandwidth selection as an online expert aggregation problem using a constrained online convex optimization framework. Importantly, we provide coverage guarantees for both algorithms and demonstrate through simulations and real-data experiments that the proposed methods attain valid long-run coverage with narrower prediction sets than existing baselines.

[LG-126] Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

链接: https://arxiv.org/abs/2605.05495
作者: William T. Redman,Erik C. Johnson,Brian Robinson
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting (“continual LEGO”). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a canonical feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find evidence supporting the hypothesis that ALBERT, a recurrent version of BERT, learns a For loop-esque solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of training strategies that combine data across experiences, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that the recurrent ALBERT model may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge in modern models and tasks.

[LG-127] MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria

链接: https://arxiv.org/abs/2605.05492
作者: Xuwei Yang,David B. Emerson,Fatemeh Tavakoli,Anastasis Kratsios
类目: Machine Learning (cs.LG)
*备注: 43 pages, 11 tables, 1 figure

点击查看摘要

Abstract:In the modern age of large-scale AI, federated learning has become an increasingly important tool for training large populations of AI agents; however, its computational and communication costs can rapidly fail to scale with the number of agents. This is precisely where decentralized agentic strategies shine: each agent acts autonomously, using only its own state together with a minimal summary of the ensemble, namely the mean-field. We derive the unique optimal decentralized policy in closed form. Optimality is characterized through a worst-client/minimax criterion: minimizing the under-performer regret, namely the maximal online cost incurred by the weakest agent in the ensemble. We further prove that the resulting decentralized policy asymptotically converges, in the large-population limit, to the Nash-optimal centralized policy, whose direct computation is not scalable. We use an online weighting mechanism to optimize the server-computed mixture of client predictions, thereby improving the mean prediction in addition to the previously optimized weakest-client prediction. Numerical experiments verify our theoretical guarantees and demonstrate that our decentralized policy typically outperforms natural greedy decentralized baselines.

[LG-128] A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

链接: https://arxiv.org/abs/2605.05488
作者: Taeyoung Kim,Joon-Hyuk Ko
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:We propose an architecture that augments the Flux Neural Operator (Flux NO), which combines the classical finite volume method (FVM) with neural operators, with ViT-based context injection. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context-conditioned neural operator. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long-time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes. Our code is available at this https URL.

[LG-129] Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

链接: https://arxiv.org/abs/2605.05481
作者: Dillon Sandhu,Ronald Parr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit a classic “chicken-and-egg” problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience. It only commits to a new policy once a convergence criterion has been met. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration. Applying SV-API to PPO yields Stable Value PPO (SV-PPO), which matches or improves performance on high-dimensional discrete (Atari) and continuous control benchmarks while executing substantially larger target policy updates. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL.

[LG-130] Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAG s

链接: https://arxiv.org/abs/2605.05459
作者: Kennedy Edemacu,Mohammad Mahdi Shokri,Vinay M. Shashidhar,Jong Wook Kim
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces PAS – Privacy Anchor Substitution, a structured mechanism for enabling user location privacy in spatial retrieval-augmented generation (RAG) systems. Unlike conventional differential privacy methods that directly perturb user locations, PAS represents location with relative anchor encoding consisting of an anchor, direction bin, and distance bin, allowing seamless integration with modern RAG pipelines. We evaluate PAS on a synthetic urban dataset and show that it achieves impressive coarse privacy guarantees, with approximately 370-400m adversarial location error, while retaining more than half of the baseline retrieval performance. Despite the slight drop in retrieval performance, the downstream generation quality under PAS remains comparatively robust, indicating that large language models can compensate for imperfect spatial retrieval. Furthermore, we provide empirical analysis showing that PAS exhibits non-monotonic privacy-utility relationship with respect to privacy parameters. We attribute this to geometric bias induced by anchor discretization, making it different from continuous noise mechanisms such as geo-indistinguishability. Our results show that structured spatial representations offer a practical approach to privacy in location based reasoning in RAG systems.

[LG-131] Active Learning for Conditional Generative Compressed Sensing

链接: https://arxiv.org/abs/2605.05435
作者: Alexander DeLise,Nick Dexter
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 33 pages, 11 figures

点击查看摘要

Abstract:Generative compressed sensing uses the range of a pretrained generator as a nonlinear model for recovering structured signals from limited measurements. We study a conditional version of this problem for image recovery from subsampled Fourier measurements using prompt-conditioned generative models. Our framework separates two roles of conditioning: the prompt used to design the sampling distribution and the prompt used to define the recovery model. For ReLU and Lipschitz conditional generators, we prove stable recovery bounds showing that prompt-matched Christoffel sampling retains the same Christoffel complexity constant as existing near-optimal generative compressed sensing theory, while prompt mismatch incurs an explicit compatibility penalty. Experiments with Stable Diffusion show that prompts meaningfully reshape Christoffel sampling distributions and influence image recovery. Overall, our results suggest that prompts should be treated as design variables with distinct effects on sensing, approximation, and recovery.

[LG-132] Differentiable Parameter Optimization for DAEs with State-Dependent Events

链接: https://arxiv.org/abs/2605.05395
作者: Ion Matei,Maksym Zhenirovskyy,Anthony Wong
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:Differential-algebraic equations (DAEs) with state-dependent events arise in systems whose continuous dynamics are constrained by algebraic equations and interrupted by mode changes, switching logic, impacts, or state reinitializations. Gradient-based parameter learning for such systems is challenging because algebraic variables are implicitly defined, event times depend on the parameters, and reset maps introduce discontinuities. This paper studies differentiable parameter optimization for semi-explicit DAEs with events. We formulate the learning problem as a constrained least-squares problem with DAE dynamics, algebraic constraints, guard equations, and reset maps. We then develop two complementary gradient-computation strategies. The first is an automatic-differentiation-through-simulation method that solves algebraic variables inside the vector field, differentiates the algebraic solve using the implicit function theorem, and handles events through segmented differentiable integration. The second is an explicit discrete-adjoint method that represents the forward simulation as an event-split residual system and computes gradients by solving for the Lagrange multipliers of smooth-segment and event residuals. The formulation clarifies that residual terms in the adjoint method are equality constraints, not heuristic penalties. We compare the two approaches in terms of gradient interpretation, event-time handling, implementation complexity, and local validity. Both methods provide gradients for the event path selected by the forward simulation and are valid under fixed event ordering and transversal guard crossings.

[LG-133] Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

链接: https://arxiv.org/abs/2605.05387
作者: Ahmad Aghapour,Erhan Bayraktar,Asaf Cohen
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We study zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, including inpainting and super-resolution. In these problems, the observation determines only part of the unknown signal. The remaining degrees of freedom must be sampled according to the correct conditional data distribution. Existing projection-based samplers enforce measurement consistency by correcting the observed component during reverse diffusion. However, measurement consistency alone does not determine how probability mass should be distributed along the feasible set, and this can lead to biased conditional samples. We analyze this issue through a normal–tangent decomposition of the score function. For Gaussian noising, the observed-direction score is exactly determined by the measurement; only the tangent conditional score is unknown. We prove that the error from replacing this score by the unconditional tangent score is upper bounded by a dimension-free conditional mutual information between observed and unobserved components. This gives an information-theoretic decomposition into initialization and pathwise score-mismatch errors. Motivated by the theory, we propose a projected-Langevin initialization followed by guided reverse denoising, which outperforms a strong projection-based baseline in inpainting and super-resolution experiments. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) MSC classes: 60J60 94A17 Cite as: arXiv:2605.05387 [cs.LG] (or arXiv:2605.05387v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-134] Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

链接: https://arxiv.org/abs/2605.05373
作者: David Leeftink,Max Hinne,Marcel van Gerven
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.

[LG-135] A Multi-Head Attention Approach for SLA Compliance Monitoring in Data Centers

链接: https://arxiv.org/abs/2605.05354
作者: Omanshu Thapliyal
类目: Machine Learning (cs.LG)
*备注: 6 pages, 9 figures, 46th IEEE International Conference on Distributed Computing Systems

点击查看摘要

Abstract:Service level agreements (SLAs) in data center colocation contracts define precise thresholds for power, temperature, and humidity, with tiered violation penalties expressed as credits against monthly recurring charges. Traditional reactive monitoring detects breaches only after they occur, limiting remediation opportunities. We present a framework that encodes SLA rules as structured JSON objects to generate training data without manual annotation. We train a per-customer multi-head transformer model in which each attention head specializes in one SLA rule, learning temporal dependencies that precede violations by 30 minutes. Post-training, the inference service emits structured prediction events transformed into three role-specific views: finance schemas exposing credit liability, operations schemas surfacing risk scores and recommended interventions, and compliance schemas bundling predictions with immutable telemetry signatures for audit. By aligning model architecture directly with contractual obligations, this framework enables operators to anticipate SLA breaches, prioritize corrective actions, and minimize financial penalties.

[LG-136] Attribution-Guided Continual Learning for Large Language Models

链接: https://arxiv.org/abs/2605.05285
作者: Yazheng Liu,Yuxuan Wan,Rui Xu,Xi Zhang,Sihong Xie,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often suffer from catastrophic forgetting in continual learning: after learning new tasks sequentially, they perform worse on earlier tasks. Existing methods mitigate catastrophic forgetting by data replay, parameter freezing, or regularization. However, these methods lack semantic awareness of internal knowledge distribution in LLMs. As a result, they cannot distinguish parameters that should be preserved or updated. We propose an attribution-guided continual fine-tuning framework for LLMs. Our method estimates task-specific, element-wise parameter importance in each Transformer layer and uses these scores to modulate gradients. Parameters important to previous tasks receive smaller updates, while less relevant ones remain plastic for learning new tasks. Experiments on continual learning benchmarks show that our method consistently outperforms baselines, achieving better retention of old tasks while maintaining competitive performance on new tasks.

[LG-137] Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles

链接: https://arxiv.org/abs/2605.05284
作者: Daniel Grimmer
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
*备注: 38 pages, 5 figures. Submitted to Evolutionary Computation, May 2026. Code available at: this https URL

点击查看摘要

Abstract:Evolutionary computation has long promised to deliver both high-performance optimization tools as well as rigorous scientific simulations of Darwinian evolution. However, modern algorithms frequently abandon evolutionary fidelity for physics-inspired heuristics or superficial biological metaphors. This paper derives a suite of advanced gradient-based optimization algorithms directly from evolutionary first principles. We introduce Darwinian Lineage Simulations (DLS) to prove that, in an asexual context, Fisher’s and Wright’s historically opposed views of evolution are actually formally equivalent. This unification requires carefully partitioning Fisher’s deterministically-evolving total population into Wright’s randomly-drifting sub-populations. We prove that proper bookkeeping requires introducing a specific kind of structured noise (the DLS noise relation). Crucially, however, any bookkeeping choices which satisfy this relation will result in a faithful simulation of evolution. Using this vast representational freedom, we prove that a broad family of battle-tested optimization algorithms are already perfectly compatible with evolutionary dynamics. These include: Stochastic Gradient Descent, Natural Gradient Descent, and the Damped Newton’s method among many others. By simply adding DLS noise (i.e., evolutionarily faithful genetic drift), these algorithms become scientifically valid in silico simulations of Darwinian evolution. Finally, we demonstrate that even the state-of-the-art Adam optimizer can be brought into evolutionary compliance through a minor mathematical surgery.

[LG-138] Forecasting Green Skill Demand in the Automotive Industry: Evidence from Online Job Postings

链接: https://arxiv.org/abs/2605.05280
作者: Sabur Butt,Joshua N. Arrazola E.,Hector G. Ceballos,Patricia Caratozzolo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global transition toward sustainable economies is reshaping labor markets, yet systematic methods for identifying and forecasting green skills remain limited. This study presents a computational framework to measure and predict green skill demand using online job postings from Mexico’s automotive industry, which contributes about 4% of national GDP. We compile a dataset of job advertisements from Indeed Mexico, OCC Mundial, and LinkedIn (July 2024 to July 2025), yielding 204,373 skill records. A two-stage pipeline combining multilingual embeddings and ESCO validation identifies 274 unique green skills across 8,576 occurrences (4.22% of all skills). We benchmark 15 time series forecasting models using a rolling origin evaluation. Transformer-based models, especially FEDformer, Reformer, and Informer, achieve the best performance, with MAE around 2.5e-5 and relative RMSE below 15. We further propose a framework to classify skills by absolute and relative growth, identifying stable, emerging, and high-impact competencies. Results show current demand is concentrated in operational sustainability practices, while the fastest-growing skills relate to renewable energy, recycling, and hydrogen technologies. This pipeline supports data-driven workforce planning in the green transition.

[LG-139] Expert Routing for Communication-Efficient MoE via Finite Expert Banks

链接: https://arxiv.org/abs/2605.05278
作者: Mohammad Reza Deylam Salehi,Ali Khalesi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing interface controlling computation, communication, and accuracy. Motivated by finite-rate interpretations of MoE gating, we treat the gate as a stochastic channel and use I(X;T) to quantify the routing information available to the selected expert. To make the associated information quantities tractable beyond synthetic examples, we develop a finite-bank MNIST construction using pretrained CNN experts and a discrete, data-dependent selection rule. Since the selected model belongs to a finite candidate set, the algorithmic mutual information I(S;W) admits a closed-form discrete-entropy estimator from the empirical posterior q(W|S) . Sweeping a data-dependence parameter \alpha , we observe that \widehat I(S;W) monotonically tracks the generalization gap, while the Xu-Raginsky bound exhibits the expected looseness. We also compare with a uniform union-bound baseline and introduce an empirical estimator of I(X;T) together with a Blahut-Arimoto procedure for tracing an accuracy-rate curve over the expert bank. The proposed framework provides a practical tool for analyzing resource-aware MoE inference systems and for interpreting I(X;T) and D(R_g) as design proxies for efficient expert routing.

[LG-140] Differential Privacy in the Extensive-Form Bandit Problem

链接: https://arxiv.org/abs/2605.05266
作者: Stephen Pasteris,Rahul Savani,Theodore Turocy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the extensive-form bandit problem, where on each trial the learner (a user coordinated by a server) plays an extensive-form game against an oblivious adversary, observing the information sets it finds itself in as well as the resulting payoff/loss. We give an algorithm for this problem that satisfies \epsilon -local differential privacy and attains a regret of \tildeO(\sqrtA\ln(S)T/\epsilon) , where A is the total number of actions that the learner can possibly take, S is the number of the learner’s possible reduced strategies, and T is the number of trials. On each trial, the time complexity of our algorithm is, up to a factor logarithmic in the maximum number of actions at an infoset, equal to the time required for the server to transmit the reduced strategy to the user. We note that local differential privacy is the strongest version of differential privacy and, to the best of our knowledge, this is the first work to study differential privacy of any form in the extensive-form bandit problem.

[LG-141] Identifier-Free Code Embedding Models for Scalable Search

链接: https://arxiv.org/abs/2605.05251
作者: Eric Wolos,Michael Doyle
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and generalizes to a constant-algorithm association task it is not explicitly trained on.

[LG-142] DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation

链接: https://arxiv.org/abs/2605.05241
作者: Zijian Zeng,Fei Ding,Huiming Yang,Xianwei Li,Yuhao Liao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Sim-to-real transfer remains a critical bottleneck for deploying dexterous manipulation policies learned in simulation to real-world robots. Existing approaches rely on manually designed domain randomization or task-specific adaptation, limiting their generalizability across diverse manipulation scenarios. We present DexSim2Real, an integrated framework that leverages vision-language foundation models to bridge the sim-to-real gap for dexterous manipulation. Our system combines three components: (1) Foundation Model-Guided Domain Randomization (FM-DR), which uses a vision-language model as a visual realism critic to optimize simulation parameters via closed-loop CMA-ES, complementing text-based approaches like DrEureka with direct visual feedback; (2) a Tactile-Visual Cross-Attention Policy (TVCAP) that adapts cross-attention visuo-tactile fusion to zero-shot sim-to-real RL; and (3) a Progressive Skill Curriculum (PSC) that builds on LLM-based task decomposition with a difficulty scheduler tailored to contact-rich dexterous tasks. Extensive experiments on six challenging manipulation tasks with blinded evaluation demonstrate that DexSim2Real achieves a 78.2% average real-world success rate, outperforming DrEureka and DeXtreme while reducing the sim-to-real performance gap to only 8.3%.

[LG-143] SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees AAMAS2026

链接: https://arxiv.org/abs/2605.05216
作者: Yi Xie,Yangyang Xu,Yi Fan,Bo Liu
类目: Machine Learning (cs.LG)
*备注: Published at AAMAS 2026

点击查看摘要

Abstract:Large language models (LLMs) with a large number of parameters achieve strong performance but are often prohibitively expensive to deploy. Recent work explores using teams of smaller, more efficient LLMs that collectively match or even outperform a single large model. However, jointly updating multiple agents introduces compounding distribution shifts, making coordination and stability during training difficult. We address this by introducing Sequential Agent Tuning (SAT), a coordinator-free training paradigm. SAT represents the team as a factorized policy and employs block-coordinate updates over agents, enabling scalable, decentralized training without a central controller. Specifically, we develop a sequence-aware, on-policy advantage estimator that conditions on the evolving team policy, coupled with per-agent KL trust regions that isolate occupancy drift. Theoretically, this framework provides two critical guarantees. First, it ensures monotonic improvement, stabilizing the training process. Second, it establishes provable plug-and-play invariance: any agent can be upgraded to a stronger model without retraining the rest of the team, with a formal guarantee that the performance bound improves. Empirically, a team of three 4B agents (12B total) trained with SAT surpasses the much larger Qwen3-32B on AIME24/25 benchmarks by 3.9% on average. We validate our plug-and-play theory by swapping in two 8B agents, which boosts the composite score by 10.4%. We provide code and appendix of proof at this https URL

[LG-144] Nationwide EHR-Based Chronic Rhinosinusitis Prediction Using Demographic-Stratified Models

链接: https://arxiv.org/abs/2605.05213
作者: Sicong Chang,Yidan Shen,Justina Varghese,Akshay R Prabhakar,Sebastian Guadarrama-Sistos-Vazquez,Jiefu Chen,Masayoshi Takashima,Omar G. Ahmed,Renjie Hu,Xin Fu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Sicong Chang, Yidan Shen are the co-first authors This paper is already accepted to IEEE Engineering in Medicine and Biology Society (EMBC) 2026 conference

点击查看摘要

Abstract:Chronic rhinosinusitis (CRS) is a common heterogeneous inflammatory disorder that causes substantial morbidity and healthcare costs. CRS is difficult to identify early from routine encounters, as symptom presentations overlap with common conditions such as allergic rhinitis, and heterogeneous phenotypes further obscure risk patterns. Prior predictive studies often rely on single-institutional cohorts , which reduce population-level generalizability. To overcome this, we leveraged nationwide longitudinal EHR data from the \textitAll of Us Research Program to predict CRS diagnosis using two years of pre-diagnostic history. To address extreme feature sparsity and dimensionality in coded EHR data, we implemented a hybrid feature-selection pipeline that combines prevalence-based statistical screening with model-based importance ranking, compressing approximately 110,000 candidate codes into 100 interpretable features. To capture demographic heterogeneity, we trained demographic stratified models across six adult sex and life-stage subgroups with subgroup-specific hyperparameter tuning. Our framework achieved an overall AUC of 0.8461, improving discrimination by 0.0168 over the best baseline. These results demonstrate that routinely collected EHR data may support population-representative CRS risk stratification and inform earlier triage and referral prioritization in primary care.

[LG-145] QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization GECCO’26

链接: https://arxiv.org/abs/2605.04267
作者: Florian A. D. Burnat
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注: Accepted at Genetic and Evolutionary Computation Conference (GECCO '26)

点击查看摘要

Abstract:Interactive multi-objective optimization systems face a budget allocation dilemma: one can spend resources on expensive objective evaluations or on eliciting decision-maker preferences that identify the relevant region of the Pareto set. Moreover, preference elicitation itself spans modalities with different information content and cognitive burden, ranging from cheap, noisy pairwise preference statements (PS) to richer but costlier indifference adjustments (IA). We study cost-aware optimization under an unknown scalarization and introduce QUIVER (Query-Informed Value Estimation for Regret), a surrogate-assisted evolutionary multi-objective optimizer that adaptively chooses between objective evaluations and heterogeneous preference queries. At each step, QUIVER selects the next action by maximizing the expected decision-quality improvement per unit total cost. Across DTLZ and WFG benchmarks under synthetic decision-maker models, QUIVER achieves the lowest final utility regret on challenging WFG problems (utility regret of 2.14 on WFG4, 2.82 on WFG9: a 25% improvement over baselines), outperforming all single-modality baselines. We analyze how the optimal mix of PS and IA adapts to problem difficulty: on easy problems (DTLZ2), QUIVER selects 80% PS queries; on hard problems (WFG9), it shifts to 35% IA queries. This adaptive modality selection demonstrates cost-aware preference learning in action. Comments: Accepted at Genetic and Evolutionary Computation Conference (GECCO '26) Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC) Cite as: arXiv:2605.04267 [cs.LG] (or arXiv:2605.04267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04267 Focus to learn more arXiv-issued DOI via DataCite

[LG-146] LiVeAction: a Lightweight Versatile and Asymmetric Neural Codec Design for Real-time Operation

链接: https://arxiv.org/abs/2605.06628
作者: Dan Jacobellis,Neeraja J. Yadwadkar
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: DCC 2026

点击查看摘要

Abstract:Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compression schemes based on scalar quantization or resolution reduction are broadly applicable but fail to exploit inherent signal redundancies, resulting in suboptimal rate-distortion performance. Recent generative neural codecs, or tokenizers, model complex signal dependencies but are often over-parameterized, data-hungry, and modality-specific, making them impractical for resource-constrained environments. We introduce a Lightweight, Versatile, and Asymmetric neural codec architecture (LiVeAction), that addresses these limitations through two key ideas. (1) To reduce the complexity of the encoder to meet the resource constraints of the execution environments, we impose an FFT-like structure and reduce the overall size and depth of the neural-network-based analysis transform. (2) To allow arbitrary signal modalities and simplify training, we replace adversarial and perceptual losses with a variance-based rate penalty. Our design produces codecs that deliver superior rate-distortion performance compared to state-of-the-art generative tokenizers, while remaining practical for deployment on low-power sensors. We release our code, experiments, and python library at this https URL .

[LG-147] DARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential Experiments

链接: https://arxiv.org/abs/2605.06608
作者: Kateryna Husar,Alexander Volfovsky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sampler learns which covariates are most prognostic across successive batches; selected covariates then drive rerandomization and regression adjustment to reduce batch-level average treatment effect variance. Our primary theoretical contribution is a decoupling result: adaptive covariate selection based on past batches preserves batch-level randomization validity, and the cumulative inverse-variance weighted estimator achieves at least nominal asymptotic coverage. We further derive a Bayes risk bound for the acquisition layer that matches the minimax lower bound up to logarithmic factors. Empirically, DARTS systematically concentrates the budget on informative features, significantly closing the efficiency gap to oracle designs while maintaining strict inferential validity.

[LG-148] Dynamic Treatment on Networks

链接: https://arxiv.org/abs/2605.06564
作者: Bengusu Nar,Jiguang Li,Veronika Ročková,Panos Toulis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.

[LG-149] Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts

链接: https://arxiv.org/abs/2605.06484
作者: Steven Wilkins-Reeves,Alexandra N. M. Darmon,Deeksha Sinha
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistical inference methods often depend on strict identifying assumptions (such as surrogacy, covariate/label shift, or missingness assumptions). These assumptions can be difficult to validate and may be violated by various additional sources of distribution shift, potentially leading to biased parameter estimates and miscalibrated uncertainty quantification. We introduce an estimate-level framework, inspired by domain adaptation techniques, to empirically calibrate proxy-based inference. This framework models the proxy-primary metric discrepancy as a random effect at the parameter level, estimating its distribution from aggregated historical observations across past domains (e.g., experiments, time periods, or distinct segments). This method avoids the requirement for retaining individual-level response data. Additionally, this adjustment can be layered on top of existing proxy-correction methods (such as prediction-powered inference or importance weighting) to account for additional biases not addressed by those corrections. To manage uncertainty when the number of historical domains is limited, we provide both a method-of-moments estimator and a domain bootstrap procedure. We further validate this approach using publicly available datasets and real-world experiments.

[LG-150] Risk-Controlled Post-Processing of Decision Policies

链接: https://arxiv.org/abs/2605.06479
作者: Sunay Joshi,Tao Wang,Hamed Hassani,Edgar Dobriban
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where switching to the oracle fallback policy yields a large reduction in conditional violation risk. At the finite-sample level, given a fitted fallback policy and score, we develop a post-processing algorithm that uses calibration data to select a threshold. Leveraging tools from algorithmic stability and stochastic processes, we show that under regularity conditions, in the i.i.d. setting, the expected excess risk of the post-processed policy is O(\log n/n) . In the special case when an exact-safe fallback policy is available, the algorithm achieves precise expected risk control under exchangeability. In this setting, we also give high-probability near-optimality guarantees on the post-processed policy. Experiments on a COVID-19 radiograph diagnosis task, an LLM routing problem, and a synthetic multiclass decision task show that targeted post-processing can meet or nearly meet risk budgets while preserving substantially more agreement with the baseline than score-blind random mixing.

[LG-151] Dynamic Controlled Variables Based Dynamic Self-Optimizing Control

链接: https://arxiv.org/abs/2605.06469
作者: Chenchen Zhou,Shaoqi Wang,Hongxin Su,Xinhui Tang,Yi Cao,Shuang-Hua Yang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimization effects, translating the process optimization problem into a process control problem. Currently, self-optimizing control is widely applied to steady-state optimization problems. However, the development of process systems exhibits a trend towards refinement, highlighting the importance of optimizing dynamic processes such as batch processes and grade transitions. This paper formally introduces the self-optimizing control problem for dynamic optimization, termed the dynamic self-optimizing control problem, extending the original definition of self-optimizing control. A novel concept, “dynamic controlled variables” (DCVs), is proposed, and an implicit control policy is presented based on this concept. The paper theoretically analyzes the advantages and generality of DCVs compared to explicit control strategies and elucidates the relationship between DCVs and traditional controllers. Moreover, this paper puts forth a data-driven approach to designing self-optimizing DCVs, which considers DCV design as a mapping identification problem and employs deep neural networks to parameterize the variables. Three case studies validate the efficacy and superiority of DCVs in approximating multi-valued and discontinuous functions, as well as their application to dynamic optimization problems with non-fixed horizons, which traditional self-optimizing control methods are unable to address.

[LG-152] Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management

链接: https://arxiv.org/abs/2605.06438
作者: Davide Rindori
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 26 pages, 12 figures. Code available at this https URL

点击查看摘要

Abstract:Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neural-actuarial framework that combines Hierarchical LSTM networks with a Mean-Bias Correction (MBC) anchoring mechanism. Positioned as a governance-friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out-of-sample validation (2012-2020): it outperforms Li-Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near-linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP-based cross-country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.

[LG-153] Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors

链接: https://arxiv.org/abs/2605.06413
作者: Richard Bergna,Stefan Depeweg,José Miguel Hernández-Lobato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic–aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.

[LG-154] Covariate Balancing and Riesz Regression Should Be Guided by the Neyman Orthogonal Score in Debiased Machine Learning

链接: https://arxiv.org/abs/2605.06386
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error generally contains treatment-specific components because the outcome regression is a function of the full regressor X=(D,Z) . In that case, balancing common functions of Z can leave the treatment-specific component unbalanced. We therefore advocate regressor balancing, implemented by Riesz regression with basis functions of X , as the general balancing principle for DML. The position is not that covariate balancing is invalid, but that covariate balancing should be understood as the special case that is appropriate when the score-relevant regression error is a function of covariates alone.

[LG-155] Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under τ-Mixing

链接: https://arxiv.org/abs/2605.06373
作者: Leon Halgryn(1),Sophie Langer(2),Janusz M. Meylahn(1),E. Moritz Hahn(1) ((1) University of Twente, (2) Ruhr-Universität Bochum)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages total. 6 figures; 3 tables

点击查看摘要

Abstract:Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as \tau -mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with \tau -mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of \tau -mixing data. Moreover, we derive the sample complexity of DQN under tau -mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

[LG-156] he Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

链接: https://arxiv.org/abs/2605.06367
作者: Flavio Nicoletti,Chenxiao Ma,Enrico Ventura,Luca Saglietti,Stefano Sarao Mannelli
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.

[LG-157] End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems

链接: https://arxiv.org/abs/2605.06315
作者: Carles Balsells-Rodas,Zhengrui Xiang,Xavier Sumba,Yingzhen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce \Omega SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that \Omega SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.

[LG-158] ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

链接: https://arxiv.org/abs/2605.06265
作者: Tianpai Luo,Fangwei Wu,Weichi Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbfconvolution-smoothed \textbfquantil\textbfe \textbfReLU neural \textbfnetworks, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guarantees over Besov function classes. In numerical studies, we demonstrate that the proposed approach outperforms standard quantile neural networks at multiple quantile levels, showing improved estimation accuracy and training efficiency across the board, with particularly pronounced advantages at high and low quantiles.

[LG-159] When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination

链接: https://arxiv.org/abs/2605.06204
作者: Congye Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.

[LG-160] Predictive-Generative Drift Decomposition for Speech Enhancement and Separation NEURIPS2026

链接: https://arxiv.org/abs/2605.06189
作者: Julius Richter,Yoshiki Masuyama,Christoph Boeddeker,Takahiro Edo,Gordon Wichern,Jonathan Le Roux
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2026

点击查看摘要

Abstract:We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.

[LG-161] Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective

链接: https://arxiv.org/abs/2605.06172
作者: Meira Iske,Carola-Bibiane Schönlieb
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are L^1 -dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.

[LG-162] Diffusion model for SU(N) gauge theories

链接: https://arxiv.org/abs/2605.06134
作者: Javad Komijani,Marina K. Marinkovic,Lara Turgut
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can be extended to other Lie groups. We apply the method to SU(3) gauge configurations with the Wilson gauge action in two and four dimensions and assess the quality of the generated samples by comparison with Hybrid Monte Carlo (HMC) simulations. We show that the diffusion models can be successfully trained and applied for sampling the Wilson gauge action. For large values of inverse coupling, accurate reverse-time integration requires predictor-corrector schemes, for which we introduce a corrector based on Hamiltonian molecular dynamics. While the corrector significantly improves sampling quality, it also increases the computational cost. We outline several strategies for improving sampling efficiency.

[LG-163] me-Inhomogeneous Preconditioned Langevin Dynamics

链接: https://arxiv.org/abs/2605.06091
作者: Alexander Falk,Laurenz Nagler,Andreas Habring,Thomas Pock
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Langevin sampling from distributions of the form p(x) \propto \exp(-\Psi(x)) faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential \Psi exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of \Psi . However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.

[LG-164] Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models

链接: https://arxiv.org/abs/2605.06059
作者: Jose Benitez-Aurioles,Ricardo Silva,Brian McMillan,Matthew Sperrin
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 4 figures, 2 tables, 4 supplementaries

点击查看摘要

Abstract:In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual’s diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04). Comments: 4 figures, 2 tables, 4 supplementaries Subjects: Applications (stat.AP); Machine Learning (cs.LG) Cite as: arXiv:2605.06059 [stat.AP] (or arXiv:2605.06059v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2605.06059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-165] Gaussian mixture models in Hilbert spaces via kernel methods

链接: https://arxiv.org/abs/2605.05996
作者: Daniel López-Montero,Antonio Álvarez-López,Marcos Matabuena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, 13 figures

点击查看摘要

Abstract:Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including L^2 -functional data and random graphs in Laplacian spaces arising in modern medical applications.

[LG-166] abCF: Distributional Control Function Estimation with Tabular Foundation Models

链接: https://arxiv.org/abs/2605.05993
作者: Geping Chen,Chunlin Li,Tianzhong Yang,Zhengyuan Zhu,Jing Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at this https URL.

[LG-167] Architecture Shape Governs QNN Trainability: Jacobian Null Space Growth and Parameter Efficiency

链接: https://arxiv.org/abs/2605.05942
作者: Michael Poppel,David Bucher,Maximilian Zorn,Markus Baumann,Sebastian Wölckert,Claudia Linnhoff-Popien,Philipp Altmann,Jonas Stein
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational quantum circuits with angle encoding implement truncated Fourier series, and architectures arranging N qubits with L encoding layers each – sharing encoding budget E = NL – generate identical frequency spectra, identical frequency redundancy, and require the same minimum parameter count for coefficient control. Despite this equivalence, trainability varies substantially with architecture shape (N,L) at fixed E . We identify structural rank deficiency of the coefficient matching Jacobian J as the mechanism responsible. For serial single-qubit architectures, we prove \mathrmrank(J) \leq 2L+1 regardless of parameter count P , with \dim(\ker J) \geq P-(2L+1) growing without bound – a phenomenon we term \emphstructural gradient starvation: a growing fraction of parameters become structurally decoupled from the loss as P increases at fixed L . Parallel architectures avoid this via independent phase trajectories, ensuring \sigma_\min(J^(\mathrmpar)) 0 generically for P \leq 2E+1 , so no parameter lies in \ker J . For practitioners, we further show that the two natural routes to increasing parameter count have fundamentally different effects: adding feature map (FM) layers monotonically strengthens the Jacobian QFIM eigenvalue spectrum and achieves R^2 \geq 0.95 with 1.6 – 2.2\times fewer parameters than adding trainable blocks across all tested architectures, while trainable blocks improve training only through the classical interpolation mechanism with no quantum-specific benefit.

[LG-168] Ratio-based Loss Functions

链接: https://arxiv.org/abs/2605.05808
作者: Lena Helgerth,Andreas Christmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending on the product of the output values y_i and the predictions f(x_i) as well as distance-based loss functions depending on the difference of y_i and f(x_i) for regression are common. Distance-based loss functions are in particular useful, if an additive model assumption seems plausible, i.e. the common signal plus noise assumption. However, in the literature, several loss functions proposed for regression purposes have a multiplicative error structure in mind and pay attention to relative errors, i.e. to the ratio of y_i and f(x_i) . In this survey article, we systematically investigate such ratio-based loss functions and propose a few new losses, which may be interesting for future research. We concentrate on investigating general properties of ratio-based loss functions like continuity, Lipschitz-continuity, convexity, and differentiability, because these properties play a central role in most machine learning algorithms. Therefore, we do not focus on some specific machine learning algorithm to derive universal consistency, learning rates, or stability results. Instead, we want to enable future research in this direction.

[LG-169] Optimal Confidence Band for Kernel Gradient Flow Estimator

链接: https://arxiv.org/abs/2605.05768
作者: Yuqian Cheng,Zhuo Chen,Qian Lin
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition s\alpha_0 , where \alpha_0\in(0,1) denotes the embedding index of the kernel function. Moreover, we show that these rates match the minimax optimal rates. Building on this result, we then construct simultaneous confidence bands for both continuous and discrete kernel gradient flows. Notably, the widths of the proposed confidence bands are also optimal, in the sense that their shrinkage rates are greater than, while can be arbitrarily close to, the minimax optimal rates.

[LG-170] Polarizable atomic multipoles for learning long-range electrostatics

链接: https://arxiv.org/abs/2605.05746
作者: Dongjin Kim,Daniel S. King,Yoonjae Park,Roya Savoj,Sebastien Hamel,Xiaoyu Wang,Bingqing Cheng
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Long-range electrostatics and polarization remain central obstacles to extending machine learning interatomic potentials (MLIPs) to ionic, polar, and interfacial systems. Here, we introduce a semi-local framework for learning electrostatics from energies and forces using polarizable atomic multipoles. Local equivariant descriptors predict environment-dependent latent monopoles, dipoles, and quadrupoles, while residual non-local charge transfer and polarization are captured by non-self-consistent linear response in induced charges and dipoles. Across four diverse benchmarks and four short-range MLIP architectures, the multipole hierarchy and response terms systematically improve potential energy surface accuracy, with the largest gains in systems where long-range effects are essential. More importantly, the learned latent variables recover physically meaningful electrical responses: accurate Born effective charge tensors, emergent polarizabilities, infrared spectra in close agreement with experiments, and semi-quantitative Raman spectra for bulk water and hybrid MAPbI _3 perovskite. This systematically improvable, physically transparent framework enables MLIPs trained on standard energy and force labels to predict polarization-sensitive observables.

[LG-171] Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

链接: https://arxiv.org/abs/2605.05683
作者: Andy Zeyi Liu,Elliot Paquette,John Sous
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

[LG-172] Quantum Kernels for Parity-Structured Classification: A Hybrid Pipeline

链接: https://arxiv.org/abs/2605.05625
作者: Tushar Pandey
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parity (XOR) classification requires detecting discrete, high-order feature interactions that smooth classical kernels cannot efficiently capture. We study how quantum kernel advantage depends on parity complexity, the number of features entering the XOR rule, and find a clear threshold behavior. We pair a ZZ quantum feature map with binary 0, pi encoding (features median thresholded before circuit input) to expose parity structure. A binary encoding ablation, RBF SVM trained on the identical 0, pi features, separates encoding from circuit effects: at low complexity (n = 5 features), binary RBF achieves 83.4% +/- 1.7% and the quantum kernel 81.2% +/- 1.9%, showing encoding drives performance there. At high complexity (n = 11 features, 11 qubits, r = 3 ZZ repetitions), all classical methods collapse to near-random (approx. 50%), binary RBF reaches only 54.3% +/- 1.1%, and the quantum ZZ kernel achieves 66.3% +/- 3.2% (mean +/- std, 10 seeds), a +12.0 percentage-point margin over the binary ablation and approx. 7x higher kernel-target alignment (0.094 +/- 0.020 vs. 0.013 +/- 0.001). These results identify parity complexity as a concrete axis along which genuine quantum kernel advantage, not attributable to encoding alone, emerges.

[LG-173] Variational Smoothing and Inference for SDEs from Sparse Data with Dynamic Neural Flows

链接: https://arxiv.org/abs/2605.05606
作者: Yu Wang,Arnab Ganguly
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: Yu Wang and Arnab Ganguly contributed equally to this work. Corresponding to Arnab Ganguly

点击查看摘要

Abstract:Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, noisy observations. Classical smoothing methods for this problem are often limited by path degeneracy and poor scalability. In this work, we developed a novel method based on characterization of the posterior SDE in terms of conditional backward-in-time score defined as the gradient of a function solving a Kolmogorov backward equation with multiplicative updates at observation times. We learn this conditional score using neural networks trained to satisfy both the governing PDE and the observation-induced jump conditions, thereby integrating continuous-time dynamics with discrete Bayesian updates. The resulting score induces a posterior SDE with the same diffusion coefficient but a modified drift, enabling efficient posterior trajectory sampling. We further derive a likelihood-based objective for learning the SDE parameters, yielding an evidence lower bound (ELBO) for joint state smoothing and parameter estimation. This leads to a variational EM-style procedure, where the neural conditional score is optimized to approximate the smoothing distribution, followed by a maximization step over the SDE parameters using samples from the induced posterior. Experiments on nonlinear systems demonstrate accurate and stable inference with a very few observations demonstrating significant improved scalability compared to classical MCMC methods.

[LG-174] In-Context Positive-Unlabeled Learning

链接: https://arxiv.org/abs/2605.05591
作者: Siyan Liu,Yi Chang,Manli Cheng,Qinglong Tian,Pengfei Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 12 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.

[LG-175] Stability of the Monge Map in Semi-Dual Optimal Transport

链接: https://arxiv.org/abs/2605.05569
作者: Anton Selitskiy,David Millard
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper shows that the semi-dual formulation of the optimal transport problem has a degenerate saddle-point structure, and that its numerical solution is equivalent to solving a constrained optimization problem. We derive necessary and sufficient conditions for the convergence of Monge maps without requiring optimality of the dual potential. This analysis helps explain why, in practice, numerical algorithms often require more iterations to update the transport map than the potential.

[LG-176] Relaxed Sparsest-Permutation Formulation for Causal Discovery at Scale

链接: https://arxiv.org/abs/2605.05568
作者: Sunmin Oh,Sang-Yun Oh,Gunwoong Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary for structure recovery. This observation motivates a support-level relaxation that searches for sparse triangular factors over a precision-support screening graph. The relaxed formulation can be efficiently evaluated via masked zero-fill incomplete Cholesky factorization, enabling scalable comparison of candidate orderings. At the population level, we establish soundness for Markov equivalence class (MEC) recovery under no-cancellation and sparsest Markov representation assumptions, as well as robustness to ordering misspecification. Motivated by these guarantees, we introduce SCOPE, a sparse-Cholesky pipeline that provides a scalable implementation of the relaxed formulation. Experiments on synthetic and real datasets demonstrate that SCOPE matches the MEC recovery accuracy of substantially slower baselines, while achieving significantly reduced runtime and scaling to 10k variables.

[LG-177] Permutation-preserving Functions and Neural Vecchia Covariance Kernels

链接: https://arxiv.org/abs/2605.05523
作者: Jian Cao,Nian Liu,Ying Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes (GPs) by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures. Specifically, we model kriging coefficients and conditional standard deviations, deterministic quantities that uniquely characterize the covariance, providing stable and informative learning targets. Exploiting the permutation-equivariant structure of conditioning sets in the Vecchia factorization, we derive a universal representation for permutation-preserving functions and design neural architectures that respect this symmetry, leading to improved training stability and data efficiency. The proposed approach enables expressive, non-stationary kernel learning while maintaining computational scalability, thereby bridging classical GP methodology with modern deep learning.

[LG-178] A renormalization-group inspired lattice-based framework for piecewise generalized linear models

链接: https://arxiv.org/abs/2605.05493
作者: Joshua C. Chang
类目: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Under review

点击查看摘要

Abstract:We formally introduce a class of models inspired by renormalization group (RG) theory, built on additive hierarchical expansions analogous to those appearing in functional ANOVA and mixed-effects models. Like ReLU convolutional neural networks, they are almost everywhere locally linear; unlike ReLU networks, their partition structure is explicit, interpretable, and easy to modify or constrain. In these models, one defines a multidimensional lattice partition of the input space and uses it to scaffold variations in regression parameters. Each dimension of the lattice corresponds to an attribute by which the statistics of the problem may vary. The parameters are themselves expressed in the form of an expansion, where each term captures variations relative to a lower (coarser) interaction scale. These models admit multiple equivalent interpretations: as piecewise GLMs, as hierarchical mixed-effects regressions, or as regression trees with structured parameter sharing. Since RG motivates the design of these models, we use techniques from statistical physics – specifically replica analysis – to study their generalization properties. Specifically, we analyze the behavior of the Watanabe-Akaike Information Criterion (WAIC) as a proxy for generalization loss. This analysis yields two practical results: (i) guidance on the lattice design as a function of dataset size and predictor dimensionality; and (ii) a principled scaling law for the regularization prior when adding higher-order terms to the expansion so that one can increase model complexity without an expected increase in generalization loss. We evaluate the methodology on public datasets and find performance competitive against both blackbox methods and other intrinsically interpretable approaches.

[LG-179] Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

链接: https://arxiv.org/abs/2605.05446
作者: Chengyu Cui,Gongjun Xu
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a \it benign regularizer that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.

[LG-180] Estimating Implicit Regularization in Deep Learning

链接: https://arxiv.org/abs/2605.05436
作者: Joseph H. Rudoler,Kevin Tan,Giles Hooker,Konrad P. Kording
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization – connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like \ell_1 and \ell_2 . It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit \ell_2 effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

[LG-181] Direct Estimation of Schrödinger Bridge Time-Series Drifts: Finite-Sample Asymptotic and Adaptive Guarantees

链接: https://arxiv.org/abs/2605.05432
作者: Othmane Mazhar,Huyên Pham
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 3 figures, 8 tables

点击查看摘要

Abstract:We study nonparametric estimation of Schrödinger bridge (SB) drifts from i.i.d.\ data observed on a single time interval. Starting from the conditional-ratio form of the Schrödinger bridge time-series (SBTS) drift formula, we analyze a direct Nadaraya–Watson plug-in estimator built from kernelized numerator and denominator terms. Unlike recent SB analyses based on entropic-OT potentials, Sinkhorn iterations, or iterative bridge solvers, our approach works directly at the drift level and isolates \emphstatistical error from optimization, approximation, and discretization error. Under Hölder regularity, a marginal-density floor, and bounded support, we prove a uniform non-asymptotic bound for admissible bandwidth pairs, a pointwise CLT under genuine undersmoothing, and an adaptive bandwidth selector satisfying an oracle inequality. We also prove a pivot-local minimax lower bound which, through an explicit uniform pivot, yields a global minimax lower bound under transparent compatibility conditions; hence the adaptive selector is minimax-rate optimal up to logarithmic factors. Synthetic experiments provide theorem-targeted diagnostics for finite-sample scaling, Gaussian approximation, and adaptive behavior. Comments: 36 pages, 3 figures, 8 tables Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: Primary 62G05, Secondary 62G20, 62C20, 62M10, 60F05, 60E15 Cite as: arXiv:2605.05432 [math.ST] (or arXiv:2605.05432v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2605.05432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-182] Meta-learning for sample-efficient Bayesian optimisation of fed-batch processes

链接: https://arxiv.org/abs/2605.05382
作者: Becky Langdon,Gabriel D. Patrón,Chrysoula D. Kappatou,Robert M. Lee,Behrang Shafei,Jixiang Qing,Ruth Misener,Mark van der Wilk,Calvin Tsay
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:The optimisation of fed-batch (bio)chemical process recipes is subject to inherent, underlying, and unmeasurable fluctuations across batches, whose trajectories are difficult to model and costly to measure. Bayesian Optimisation (BayesOpt) is a powerful tool for sampling and optimisation of expensive-to-measure functions. Gaussian Processes (GPs), the surrogate models used in BayesOpt, are static, forecast poorly, and lack generalisation across experiments, limiting their applicability to time-varying batch processes with stochastic parameters, i.e., process fluctuations. This work investigates System-Aware Neural ODE Processes (SANODEP) as a meta-learning model to overcome the limitations of GPs and increase few-shot optimisation performance in BayesOpt. Using a penicillin batch production case study, we find that SANODEP outperforms GP-based BayesOpt in the low-data regime, resulting in improved objectives when few experimental runs are performed. These improvements are observed in both on- and off-distribution batches, highlighting the generalisation capabilities of SANODEP. Using this approach, batch process operators can accelerate the initial optimisation steps in BayesOpt by deploying meta-learning or optimise the process with fewer experiments when the experimental cost is high.

[LG-183] Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models

链接: https://arxiv.org/abs/2605.05270
作者: Ademir Batista dos Santos Neto,Tiago Alessandro Espinola Ferreira,Paulo Renato Alves Firmino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.

附件下载

点击下载今日全部论文列表