本篇博文主要内容为 2026-05-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-11)
今日共更新932篇论文,其中:
- 自然语言处理共138篇(Computation and Language (cs.CL))
- 人工智能共304篇(Artificial Intelligence (cs.AI))
- 计算机视觉共186篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共372篇(Machine Learning (cs.LG))
- 多智能体系统共27篇(Multiagent Systems (cs.MA))
- 信息检索共15篇(Information Retrieval (cs.IR))
- 人机交互共28篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] he Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
【速读】:该论文试图解决的问题是:在多智能体社会困境(multi-agent social dilemmas)中,扩展大语言模型(Large Language Models, LLMs)的上下文窗口(context window)是否能够提升合作行为。研究发现,尽管扩展记忆长度通常被视为一种能力增强,但在7种LLM和4种博弈场景下,增加可访问的历史信息反而导致18/28种设置中的合作水平显著下降,这一现象被称为“记忆诅咒”(memory curse)。解决方案的关键在于识别出问题根源并非单纯的记忆长度,而是记忆内容引发的推理模式变化——具体而言,是模型在长记忆下倾向于放弃对未来意图(forward-looking intent)的规划,转而产生短视或非协作性决策。通过三种机制验证:(1) 使用LoRA适配器仅基于前向意图推理轨迹进行微调可缓解该衰减并实现零样本迁移;(2) 用合成合作记录替换真实历史可恢复合作,证明触发因素为记忆内容而非长度;(3) 消除显式思维链(Chain-of-Thought)推理常能减轻崩溃,表明过度反思反而加剧了记忆诅咒。因此,论文提出记忆不仅是被动存储介质,更是主动塑造多智能体行为的关键变量,其效果取决于所激发的推理策略。
链接: https://arxiv.org/abs/2605.08060
作者: Jiayuan Liu,Tianqin Li,Shiyi Du,Xin Luo,Haoxuan Zeng,Emanuel Tewolde,Tai Sing Lee,Tonghan Wang,Carl Kingsford,Vincent Conitzer
机构: Carnegie Mellon University (卡内基梅隆大学); Foundations of Cooperative AI Lab (FOCAL) (合作AI基础实验室); University of Michigan (密歇根大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model–game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
[MA-1] Nash without Numbers: A Social Choice Approach to Mixed Equilibria in Context-Ordinal Games
【速读】:该论文试图解决传统纳什均衡(Nash equilibrium)在实际应用中因依赖精确数值效用(utility)而难以获取的问题,尤其是在人类实验等场景下,更易获得的是个体对行动的序数偏好(ordinal preference),而非具体的效用值。其核心解决方案是提出一种上下文序数纳什均衡(context-ordinal Nash equilibrium),该定义不再要求已知精确效用函数,而是基于玩家对自身行动在其他参与者联合策略下的序数排序来构建均衡概念。关键在于重新定义“最优响应”——从最大化期望收益转向利用社会选择理论(social choice theory)中的偏好聚合方法识别最被偏好的行动,从而在弱化假设的前提下保证均衡的存在性,并引入正则化、近似与后悔(regret)等机制以支持计算和学习规则的设计,实现了对人类偏好数据的直接建模与应用。
链接: https://arxiv.org/abs/2605.07996
作者: Ian Gemp,Crystal Qian,Marc Lanctot,Kate Larson
机构: Google DeepMind
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:
Abstract:Nash equilibrium serves as a fundamental mathematical tool in economics and game theory. However, it classically assumes knowledge of player utilities, whereas economics generally regards preferences as more fundamental. To leverage equilibrium analysis in strategic scenarios, one must first elicit numerical utilities consistent with player preferences, a delicate and time-consuming process. In this work, we forgo precise utilities and generalize the Nash equilibrium to a setting where we only assume a player is capable of providing an ordinal ranking of their actions within the context of other players’ joint actions. The key technical challenge is to rethink the definition of a best-response. While the classical definition identifies actions maximizing expected payoff, we naturally look towards social choice theory for how to aggregate preferences to identify the most preferred actions. We define this generalized notion of a context-ordinal Nash equilibrium, establish its existence under mild conditions on aggregation methods, introduce notions of regularization, approximation, and regret, explore complexity for simple settings, and develop learning rules for computing such equilibria. In doing so, we provide a generalization of Nash equilibrium and demonstrate its direct applicability to elicited preferences in human experiments.
[MA-2] raceFix: Repairing Agent Coordination Protocols with TLA Counterexamples
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)多智能体协作中协议正确性难以保障的问题,尤其是在复杂任务场景下,传统基于提示(prompt)的协作方式易出现死锁、活锁(deadlock/livelock, DL/LL)及非预期行为。其解决方案的关键在于提出一个验证优先(verification-first)的流水线——TraceFix,该方法通过将任务描述转化为结构化的中间表示(intermediate representation, IR),自动生成PlusCal形式化协调逻辑,并利用TLA+模型检查器(TLC)迭代生成反例进行修复,直至协议通过形式化验证;随后将验证通过的协议编译为各智能体的系统提示,并在运行时引入拓扑监控机制以阻止偏离协议的行为。此设计实现了高可靠性的多智能体协作,在48个任务上均达到完全验证,且显著降低错误率并提升执行鲁棒性。
链接: https://arxiv.org/abs/2605.07935
作者: Shuren Xia,Qiwei Li,Taqiya Ehsan,Jorge Ortiz
机构: Rutgers University (罗格斯大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.
[MA-3] Many-to-Many Multi-Agent Pickup and Delivery
【速读】:该论文旨在解决自动化仓库中多机器人系统在处理连续的取送任务时面临的效率与安全问题,特别是针对现实中普遍存在的多对多(many-to-many)多智能体取送(Multi-Agent Pickup-and-Delivery, MAPD)场景。传统MAPD研究主要集中在一对一(one-to-one)任务分配上,而实际仓库中物品(通过库存单位SKU标识)可在多个位置存取,形成一个NP-hard的四维分配问题。解决方案的关键在于提出M2M算法及其两个变体:M2M最小化估计任务持续时间,M2M-wSKU进一步将SKU分布信息纳入目标函数,从而更有效地优化任务分配策略。实验表明,该方法在8小时模拟运行中显著优于现有最优方法,平均可多完成22,000个任务,尤其在不同环境和库存密度下表现稳定。
链接: https://arxiv.org/abs/2605.07835
作者: Ethan Schneider,Jingkai Chen,Tianyi Gu,Kunlei Lian,Seth Hutchinson,Sonia Chernova
机构: Georgia Institute of Technology (佐治亚理工学院); Symbotic Inc. (Symbotic公司); Northeastern University (东北大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-robot systems in automated warehouses must manage continuous streams of pickup-and-delivery tasks while ensuring efficiency and safety. Prior work on Multi-Agent Pickup-and-Delivery (MAPD) has largely focused on the one-to-one variant, where each task has a fixed pickup and delivery location. In contrast, real warehouses often present many-to-many MAPD scenarios, where items, tracked by stock keeping unit (SKU) identifiers, can be retrieved from or stored at multiple locations, resulting in an NP-hard four-dimensional assignment problem. To solve the many-to-many MAPD problem, we contribute our algorithm: Many-to-Many Multi-Agent Pickup and Delivery (M2M). We experiment with two variants of our algorithm: one that minimizes estimated task durations (M2M), and one which incorporates SKU distribution into the objective function (M2M-wSKU). Simulation results over 8-hour warehouse operations show that our method consistently matches or outperforms prior state of the art, with M2M completing up to 22,000 more tasks on average across different environments and warehouse inventory densities.
[MA-4] Emergence of Social Reality of Emotion through a Social Allostasis Model with Dynamic Interpretants
【速读】:该论文旨在解决情绪的社会建构机制问题,即如何在群体层面形成对情绪概念的共识,从而构建社会现实。其核心挑战在于解释个体间的情绪感知(源自体内稳态调节和社交互动的内感受信号)如何通过交互过程演化为共享的认知符号系统。解决方案的关键在于提出并实现一个计算模型,该模型融合了符号涌现、符号解释的自由度以及主动推理(active inference)机制:两个代理在接收内感受信号后,交换推断出的符号,并同步调整自身的身体控制目标与符号解释策略以达成一致;实验结果表明,双方的内感受先验偏好与符号概率分布趋于收敛,验证了基于社会共识的情绪社会现实的生成过程。
链接: https://arxiv.org/abs/2605.07761
作者: Kentaro Nomura,Yushi Tsubamoto,Takato Horii
机构: The University of Osaka (大阪大学)
类目: Multiagent Systems (cs.MA)
备注: 10 pages, 4 figures
Abstract:The theory of constructed emotion defines social reality as the community-level consensus on emotion concepts assigned to interoceptive sensations arising from bodily allostasis and social interaction. In this study, we simulate this emergence process using a computational model that integrates symbol emergence with degrees of freedom in symbol interpretation and active inference. Two agents receive interoceptive signals, exchange inferred symbols, and simultaneously adapt their bodily control goals and symbol interpretations to each other. Experimental results show that the interoceptive prior preferences and symbol probability distributions of the two agents converge, confirming the emergence of social reality grounded in social consensus.
[MA-5] he Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
【速读】:该论文旨在解决在可扩展的人工智能(AI)监督中,如何从自主代理(autonomous agent)处获取真实报告的问题:即主理人(principal)使用严格适当评分规则(strictly proper scoring rule)评估代理的报告准确性,但代理同时通过非准确性渠道(如获得行动批准、分配份额或下游控制权)获益。这种结构也广泛存在于经典机制设计场景中,例如市场运营。论文的核心发现是存在一种内生性矛盾——主理人最优监督必须采用非线性(non-affine)审批函数以筛选不同类型代理,然而任何非线性审批函数都会导致在不可检测偏离情况下,诚实报告变得次优,从而破坏校准性(calibration)。这一不可能性结果适用于所有严格适当评分规则,并给出了扰动项的闭式表达式。解决方案的关键在于构造性突破:采用阶梯函数(step-function)审批阈值,可在任意严格适当评分规则下实现第一最佳(first-best)筛选效果,因为代理只能选择“膨胀”或“不膨胀”的二元策略,从而在类型空间中形成独立于生成器曲率的阈值。特别地,在Brier评分下,类型无关的膨胀成本使得次优与最优之间达到福利等价;而该等价性仅对Brier评分成立——对于其他平滑 C1 类型的评分规则,次优与最优之间的福利差距下界为 Ω(Var(1/G′′)(γ/β)2)。
链接: https://arxiv.org/abs/2605.07671
作者: Lauri Lovén,Sasu Tarkoma
机构: University of Oulu(奥卢大学); University of Helsinki(赫尔辛基大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
备注: 38 pages, no figures. Targeting ACM Transactions on Economics and Computation (TEAC); preprint
Abstract:Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent’s report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal’s optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent’s binary inflate-or-not choice creates a type-space threshold regardless of the generator’s curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth C^1 oversight is bounded below by \Omega(\textVar(1/G’') (\gamma/\beta)^2) for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
[MA-6] Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
【速读】:该论文旨在解决多智能体路径规划(Multi-agent Pathfinding, MAPF)问题中,现有基于学习的求解方法在复杂、未见场景下泛化能力不足及协作效率低下的问题。其关键解决方案是提出一种可学习的通信模块——局部通信多智能体路径规划(Local Communication for Multi-agent Pathfinding, LC-MAPF),通过引入多轮邻居间的信息交换机制,在不牺牲可扩展性的前提下显著提升智能体间的协同能力,从而在多种未见过的测试场景中优于现有的基于强化学习(Reinforcement Learning, RL)和模仿学习(Imitation Learning, IL)的方法。
链接: https://arxiv.org/abs/2605.07637
作者: Valeriy Vyaltsev,Alsu Sagirova,Anton Andreychuk,Yuri Kuratov,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF’s scalability, a common bottleneck for communication-based MAPF solvers.
[MA-7] Synchronizing Minds through Collective Predictive Coding: A Computational Model of Parent-Infant Homeostatic Co-Regulation
【速读】:该论文旨在解决在仅有局部感官输入和不对称内部知识的情况下,两个交互主体(如父母与婴儿)如何通过互动实现隐状态表征对齐的问题。其关键解决方案在于提出一个融合部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)与元蒙特卡洛命名游戏(Metropolis–Hastings Naming Game, MHNG)的构造性模型,其中父母通过外部感知信号观察婴儿,而婴儿直接感知自身内感受状态;双方通过共享的通信变量达成调节行为的一致性,该一致性由局部可计算的 Metropolis–Hastings 概率决定。此机制使双方后验分布快速对齐,且隐状态同步早于生成模型矩阵收敛,表明跨脑同步无需完全共享世界模型即可实现,从而为真实双人互动中的跨脑同步(Inter-brain Synchrony, IBS)提供了最小可行的计算解释,并支持集体预测编码(Collective Predictive Coding, CPC)作为其潜在机制。
链接: https://arxiv.org/abs/2605.07524
作者: Yushi Tsubamoto,Takato Horii
机构: The University of Osaka (大阪大学)
类目: Multiagent Systems (cs.MA)
备注: 9pages, 4figures
Abstract:Inter-brain synchrony (IBS) observed in real-time dyadic interactions, including parent–infant exchanges, suggests that two agents come to share aligned latent representations through interaction. Yet computational accounts of how such alignment can arise between agents that have only local sensory access and asymmetric internal knowledge remain underdeveloped. We propose a constructive model of parent–infant homeostatic co-regulation that integrates a POMDP formulation of active interoceptive inference with the Metropolis–Hastings Naming Game (MHNG) derived from the Collective Predictive Coding (CPC) hypothesis. In our model, the parent observes the infant only through an exteroceptive signal while the infant directly senses its own interoceptive state; the two agents agree on regulatory actions through a shared communicative variable whose acceptance is determined by a locally computable Metropolis–Hastings probability. The agents are further endowed with asymmetric generative-model knowledge: the parent knows how actions transform visceral states but must learn what the infant’s body is communicating, whereas the infant perceives its visceral state directly but must learn how actions affect it. In a 6 \times 6 visceral-state grid world, MHNG-mediated interaction regulated the infant’s visceral state more adaptively than one-sided control conditions, and the two posteriors became rapidly aligned. Notably, this latent-state alignment emerged far earlier than the convergence of the learned generative matrices, indicating that representational synchrony does not presuppose fully shared world models. These results offer a minimal constructive account of latent-state alignment compatible with IBS reported in hyperscanning studies and support CPC as a candidate computational basis for inter-brain alignment.
[MA-8] HBEE: Human Behavioral Entropy Engine – Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的自适应内部威胁(adaptive insider)在行为异常检测中的有效性问题,特别是针对基于用户和实体行为分析(UEBA)与同伴怀疑级联(peer suspicion cascade)两种检测机制的效能差异。其关键发现是:在受控多智能体模拟环境中,当内部威胁采用自适应操作安全(OPSEC)策略时,其被同伴标记的怀疑度(in-degree)反而显著低于随机无辜用户(Cliff’s delta = -0.694),即出现“检测反转”现象;同时,UEBA排名未表现出可检测变化,表明两种检测信号在自适应行为下发生解耦。这一结果挑战了传统假设——即自适应威胁必然留下可识别的行为痕迹,并揭示了当前检测范式在面对LLM驱动的高级别OPSEC时存在系统性失效风险。
链接: https://arxiv.org/abs/2605.07472
作者: Vickson Ferrel
机构: Universiti Malaysia Sarawak (UNIMAS), Kota Samarahan, Malaysia; Vixero Technology Enterprise, Kuching, Sarawak, Malaysia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 6 figures. Pre-registration document and full deviation log included in artifact
Abstract:Insider threat detection assumes that an adaptive insider leaves behavioral residue distinguishing them from legitimate users. We test this assumption against an LLM-driven adaptive insider in a controlled multi-agent simulator. Our pre-registered five-condition study isolates defender mode (cascade vs. blind UEBA) crossed with adversary type (naive vs. adaptive OPSEC) plus a no-mole control, across 100 runs (95 valid after pre-committed exclusions). The primary finding is a detection inversion: at T_60, the adaptive mole’s suspicion in-degree is statistically lower than a randomly selected innocent agent (Cliff’s delta = -0.694, 95% BCa CI [-0.855, -0.519], Mann-Whitney p 0.01). The pre-registered prediction was the opposite direction. A pre-registered equivalence test (H2) shows adaptive OPSEC produces no detectable shift in the mole’s UEBA rank under either defender mode. The two detection signals (peer suspicion graph in-degree and per-agent UEBA rank) decouple under adaptive adversary behavior. We bound generalization explicitly: a pre-registered Gini calibration check (H4) returns FAIL, with HBEE pairwise message-exposure Gini (0.213) diverging from the SNAP Enron reference (0.730) by |Delta Gini| = 0.52, exceeding the equivalence bound by 5x. The paper makes a narrow but surprising claim: in a controlled environment where adaptive OPSEC is implementable as an LLM directive, peer-suspicion-cascade detection inverts. We release the simulator, pre-registration document, frozen scenarios, raw telemetry, and analysis pipeline under an open-source license.
[MA-9] OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
【速读】:该论文旨在解决生成式 AI(Generative AI)中工具调用型文本到图像(Tool-calling Text-to-Image, T2I)代理在多步工具链执行过程中可能产生有害输出的安全问题,尤其针对传统仅依赖提示词扰动的越狱攻击方法失效的情况。其核心解决方案是提出 OrchJail,一个基于编排引导的模糊测试框架,关键在于通过学习成功越狱工具调用轨迹及其与提示词表述之间的因果关系,直接引导模糊搜索聚焦于更易触发不安全多步工具行为的提示词,而非依赖表面文本修改,从而显著提升攻击成功率、图像保真度并降低查询成本,同时具备对常见防御机制的鲁棒性。
链接: https://arxiv.org/abs/2605.07414
作者: Jianming Chen,Yawen Wang,Junjie Wang,Zhe Liu,Qing Wang,Fanjiang Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Tool-calling text-to-image (T2I) agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into unsafe results, making prompt-only jailbreak techniques insufficient. We present OrchJail, an orchestration-guided fuzzing framework for jailbreaking tool-calling T2I agents. Its core idea is to exploit high-risk tool-orchestration patterns: by learning from successful jailbreak tool-calling traces and their causal relationships to prompt wording, OrchJail directly guides the fuzzing search toward prompts that are more likely to trigger unsafe multi-step tool behaviors, rather than relying on surface-level textual perturbations. Extensive experiments demonstrate that OrchJail improves jailbreak effectiveness and efficiency across representative toolcalling T2I agents, achieving higher attack success rates, better image fidelity, and lower query costs, while remaining robust against common jailbreak defenses. Our work highlights tool orchestration as a critical, previously unexplored attack surface and provides a novel framework for uncovering safety risks in T2I agents.
[MA-10] MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation
【速读】:该论文旨在解决自动驾驶车辆在V2X(Vehicle-to-Everything)通信环境下,因消息延迟、丢失或伪造带来的不确定性,以及地图信息动态变化所引发的实时重规划难题。其核心挑战在于如何使运动规划与低层控制在面对事件驱动的不确定更新时仍保持鲁棒性。解决方案的关键是提出MORPH-U系统:一个基于CARLA的闭环架构,融合LiDAR/雷达/摄像头与V2X(CAM/DENM)数据构建局部动态地图(Local Dynamic Map, LDM),并在检测到有效危害或地图变更影响原路径时触发Hybrid-A*重规划;同时引入轻量级拜占庭容错机制——结合多数表决规则与车载传感器 veto 权限,防止因虚假DENM注入导致的不安全重规划行为,从而在跟踪误差、安全裕度(最小时间到碰撞,TTC)、响应速度和轨迹平滑性之间实现多目标优化与帕累托前沿分析下的可控权衡。
链接: https://arxiv.org/abs/2605.07370
作者: Shih-Yu Lai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:V2X can warn an autonomous vehicle about hazards beyond line-of-sight, but it also brings uncertainty: messages may be delayed, dropped, or even forged. Meanwhile, map knowledge may change during a trip, forcing the vehicle to replan under tight real-time budgets. This paper studies how to make motion planning and low-level control robust to such uncertain, event-driven updates. We present MORPH-U, a CARLA-based closed-loop stack that fuses LiDAR/radar/camera with V2X (CAM/DENM) into a Local Dynamic Map (LDM) and triggers Hybrid-A* replanning when validated hazards or map changes affect the planned route. We expose the planning/control trade-offs via a multi-objective formulation over tracking error, safety margin (minimum TTC), responsiveness, and smoothness, and select operating points using Pareto-frontier analysis. To avoid unsafe replanning from faulty V2X triggers, MORPH-U adds a lightweight Byzantine-inspired acceptance gate that combines a quorum rule with an on-board sensor veto. Experiments in dynamic CARLA scenarios show that V2X-augmented LDM improves downstream safety, Pareto tuning provides controllable accuracy-comfort trade-offs, and the gate prevents replanning under saturated false-DENM injection ( p_\textattack=1.0 ).
[MA-11] Rethinking Priority Scheduling for Sequential Multi-Agent Decision Making in Stackelberg Games
【速读】:该论文旨在解决多智能体系统中N层Stackelberg博弈(N-level Stackelberg Game)的决策顺序对最终均衡点影响的问题,即默认环境提供的决策顺序是否必然导致最优解。研究表明,改变代理的决策顺序通常会导致系统过约束,从而引发均衡点偏移,除非满足特定结构条件。解决方案的关键在于提出分层优先级调整(Hierarchical Priority Adjustment, HPA)方法:在高层设计一个上层策略,根据当前博弈状态动态选择最优决策顺序;在底层,智能体依据所选顺序在时空序列马尔可夫博弈(Spatio-Temporal Sequential Markov Game, STMG)中执行策略,并通过共享内在奖励(源自上层策略的优势函数)实现跨时间尺度的学习协调。实验表明,HPA在高精度控制任务(如多智能体MuJoCo)中优于基准算法,且能适应环境变化,凸显了优化决策顺序在N层Stackelberg博弈中的关键作用。
链接: https://arxiv.org/abs/2605.07240
作者: Xiangyu Liu,Liang Zhang,Bo Jin,Ziqi Wei
机构: Dalian University of Technology (大连理工大学); University of Alberta (阿尔伯塔大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Current research applying N-level Stackelberg Game to multi-agent systems often uses the default decision order of agents provided by the environment. However, this raises the question: does the order of agents necessarily affect the final equilibrium point of the game? To address this, we formally analyze the N-level Stackelberg Game, where changing the order in which agents make decisions typically leads to an overdetermined system. As a result, the equilibrium point shifts unless special structural conditions are satisfied. Based on this analysis, we propose the Hierarchical Priority Adjustment (HPA) method, which adjusts and selects the agents’ decision order. At the upper level, an upper policy dynamically selects the optimal decision order of agents based on the current game state. At the lower level, agents execute strategies in the Spatio-Temporal Sequential Markov Game (STMG) according to the selected order. To coordinate learning across time scales, we employ a slow-fast update scheme with shared intrinsic rewards derived from the advantage function of the upper policy. Experimental results on high-precision control tasks, including multi-agent MuJoCo, show that HPA outperforms benchmark algorithms and robustly adapts to changing environments. These results highlight the crucial role of optimizing the agents’ decision order in N-level Stackelberg Game.
[MA-12] Switchcraft: AI Model Router for Agent ic Tool Calling
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在调用外部工具时因依赖大模型而导致的高推理成本问题,尤其针对代理型(agentic)任务中模型选择缺乏成本敏感性的问题。其解决方案的关键在于提出 Switchcraft——首个专为代理工具调用优化的模型路由机制,通过在线决策方式,在满足正确性的前提下动态选择最低成本模型;该系统基于 DistilBERT 构建分类器,在延迟约束下部署,实现在保持 82.9% 准确率的同时,将推理成本降低 84%,显著提升资源利用效率。
链接: https://arxiv.org/abs/2605.07112
作者: Sharad Agarwal,Pooria Namyar,Alec Wolman,Rahul Ambavat,Ankur Gupta,Qizheng Zhang
机构: Microsoft Research; Microsoft; Stanford University
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy – matching or exceeding the best individual model – while reducing inference cost by 84%, saving over 3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.
[MA-13] ARMOR: An Agent ic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning
【速读】:该论文旨在解决多工具在反应可行性预测中表现不一致的问题,即单一工具难以在所有反应类型上保持稳定高性能,从而限制了预测准确性。其解决方案的关键在于提出ARMOR框架,该框架通过显式建模各工具的特定效用(tool-specific utilities),自适应地优先排序工具,并借助记忆增强推理机制解决工具间的潜在冲突,最终生成更可靠的预测结果。与传统基于简单聚合或启发式选择的方法不同,ARMOR构建了一个分层工具组织结构,能够动态识别并利用各工具的优势模式,在存在冲突预测时实现协同优化,显著提升了整体性能,尤其在工具间存在分歧的反应上效果突出。
链接: https://arxiv.org/abs/2605.07103
作者: Ye Liu,Botao Yu,Xinyi Ling,Daniel Adu-Ampratwum,Xia Ning
机构: The Ohio State University (俄亥俄州立大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via this https URL.
[MA-14] Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning
【速读】:该论文旨在解决去中心化多智能体强化学习(Decentralized Multi-Agent Reinforcement Learning, Decentralized MARL)中因策略表达能力有限而导致探索效率低下的问题,尤其是在使用基于能量的策略更新方法(如DecSPG)时,由于实际应用中常将策略投影到高斯策略类(Gaussian policy class),其单模态特性限制了探索能力,且随着智能体数量增加,该局限性愈发显著。解决方案的关键在于提出去中心化扩散策略学习(Decentralized Diffusion Policy Learning, DDPL),该方法利用去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)参数化每个智能体的策略,从而能够建模多模态动作分布以增强探索能力;同时引入重要性采样得分匹配(Importance Sampling Score Matching, ISSM)实现扩散策略的高效在线训练,并具备理论保证。
链接: https://arxiv.org/abs/2605.07101
作者: Yuyang Zhang,Haldun Balim,Na Li
机构: Harvard University (哈佛大学)
类目: Multiagent Systems (cs.MA); Machine Learning (stat.ML)
备注:
Abstract:Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent’s policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.
[MA-15] Social Theory Should Be a Structural Prior for Agent ic AI: A Formal Framework for Multi-Agent Social Systems
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统在社会环境中部署时,其行为难以预测和可控的问题,尤其是在多智能体交互背景下,系统级行为由个体间动态互动所驱动,而非单一代理的孤立决策。解决方案的关键在于引入社会理论作为结构先验,提出一个多智能体社会系统(Multi-Agent Social Systems, MASS)框架,该框架通过四个扎根于社会学理论的结构性先验——策略异质性、网络约束依赖性、共演化机制与分布不稳定性——来建模智能体之间的信息生成、局部影响与交互结构,从而实现对复杂社会性 AI 系统的行为理解、评估与治理。
链接: https://arxiv.org/abs/2605.07069
作者: Lynnette Hui Xian Ng,Iain J. Cruickshank,Adrian Xuan Wei Lim,Kathleen M. Carley
机构: Carnegie Mellon University (卡内基梅隆大学); National University of Singapore (新加坡国立大学)
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注:
Abstract:Agentic AI systems are increasingly deployed not in isolation, but inside social environments populated by other agents and humans, such as in social media platforms, multi-agent LLM pipelines or autonomous robotics fleets. In these settings, system behavior emerges not from individual agents alone, but from the multi-agent interactions over time. Emergent dynamics of individuals in a social group have been long studied by social scientists in human contexts. \textbfThis position paper argues that agentic AI systems must be modeled with social theory as a structural prior, and formalizes a Multi-Agent Social Systems (MASS) framework for how agents interact and influence to generate system-level outcomes. We represent MASS as a class of dynamical system of information generation, local influence and interaction structure, formulated by four structural priors anchored in social theory: strategic heterogeneity, networked-constrained dependence, co-evolution and distributional instability. We demonstrate the importance of each structural prior through formal propositions, and articulate a research agenda for how MASS should be modeled, evaluated and governed.
[MA-16] Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation
【速读】:该论文旨在解决风险感知导航中策略选择性不足的问题,即现有方法在无安全避让路径时仍会错误激活规避动作,导致不安全行为或无效决策。解决方案的关键在于引入一个基于上下文能量(context energy)的项到端口-哈密顿(port-Hamiltonian)导航策略中,形成可验证的选择性特征:当局部场景存在更低风险的可行避让方向时,诱导的上下文力会朝向该方向激活;而当看似可逃逸的路径被阻断或尚未可用时,路由感知门控机制会抑制横向力而非生成不安全动作。该设计通过CVaR尾部风险目标聚焦梯度更新于罕见但高后果的风险转移,从而实现更安全、更鲁棒的导航决策,且其选择性特性源于上下文能量的梯度结构,而非训练阶段的调参。
链接: https://arxiv.org/abs/2605.07038
作者: Aditya Sai Ellendula,Yi Wang,Chandrajit Bajaj
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:Risk-aware navigation should be selective: a policy should expose evasive degrees of freedom only when the local scene admits a lower-risk feasible maneuver, and suppress them when no safer alternative exists. We show that adding one context-energy term to a port-Hamiltonian navigation policy produces a learned force channel with exactly this falsifiable signature. When the local risk field contains a feasible lower-risk direction, the induced context force activates toward it; when the apparent escape is blocked or not yet available, a route-aware gate suppresses lateral force rather than hallucinating an unsafe maneuver. A CVaR tail-risk objective focuses gradient updates on rare but consequential risk transitions. We validate the selectivity signature across four settings. In the primary delayed-required-escape benchmark, route-aware CVaR reduces premature force activation from 0.950 to 0.180 versus DWA while raising success from 0.480 to 0.810 with zero replans. On real off-road terrain (RELLIS-3D), route-aware enrichment achieves correct activation rate 0.837 and false activation rate 0.114, compared to 0.378/0.752 for scalar risk gradients. On static semantic maps (DFC2018), enrichment reduces catastrophic failure from 0.60 to 0.10 and oscillation by 90.7% while preserving path efficiency. In highway traffic, collisions drop from 100% to 0% when a lane escape is feasible; when no escape exists, the policy suppresses the lateral maneuver. The selectivity property follows from the gradient structure of the context energy rather than from training-time tuning.
[MA-17] he Cost of Consensus: Malignant Epistemic Herding and Adaptive Gating in Distributed Multi-Agent Search
【速读】:该论文旨在解决分布式智能体在现实场景中因通信成本与观测不完全导致的协同决策难题,特别是如何通过优化通信频率和内容来维持群体信念的一致性。其解决方案的关键在于提出并形式化“认知对齐”(epistemic alignment)这一概念——即智能体对环境内部信念的一致性程度,并证明该指标无法仅通过传统协调度量(如Jensen-Shannon散度或收敛速率)识别,从而强调了设计高效、语义明确的通信机制对于避免群体错误收敛的重要性。
链接: https://arxiv.org/abs/2605.06988
作者: David Farr,Iain Cruickshank,Kate Starbird,Jevin West
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Multiagent Systems (cs.MA); Information Theory (cs.IT); Robotics (cs.RO)
备注:
Abstract:Distributed agents in real-world settings frequently must coordinate under uncertainty with only partial observations. Coordination is necessary to share beliefs to aid in task completion, but communication costs bandwidth, introduces latency, and if done poorly, can degrade collective reasoning. This tension is especially acute in bandwidth-constrained deployments such as distributed sensing networks, autonomous reconnaissance, and collaborative cyber defense, where excessive transmission carries direct operational costs. Existing work has focused on multi-agent exploration and communication strategies, but not on how communication frequency and content jointly shape the collective belief state. Central to this challenge is the degree to which agents maintain compatible internal beliefs about the environment, a property we term \textitepistemic alignment. When agents share beliefs effectively, they converge on correct hypotheses; when communication is poorly designed, agents may converge confidently on wrong ones. We formalize this distinction and show it is not detectable from coordination metrics alone such as Jensen-Shannon Divergence or rate to consensus.
[MA-18] Multi-Objective Constraint Inference using Inverse reinforcement learning
【速读】:该论文旨在解决现有约束推断(Constraint Inference)方法在处理异质专家轨迹时的局限性,即传统方法通常假设所有示范来自同一专家或目标一致的多专家群体,难以捕捉个体偏好并存在计算效率低的问题。其解决方案的关键在于提出多目标约束推断(Multi-Objective Constraint Inference, MOCI)框架,该框架能够从具有不同目标的多个专家轨迹中联合学习共享的安全约束与个体偏好,从而有效建模和利用多样化甚至冲突的行为模式。MOCI在标准网格世界基准上显著优于现有基线方法,在预测性能上提升明显的同时保持了良好的计算效率,展现出在真实场景下进行约束推断与偏好学习的准确性、灵活性和实用性。
链接: https://arxiv.org/abs/2605.06951
作者: Syed Ihtesham Hussain Shah,Floris den Hengst,Aneta Lisowska,Annette ten Teije
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Constraint inference is widely considered essential to align reinforcement learning agents with safety boundaries and operational guidelines by observing expert demonstrations. However, existing approaches typically assume homogeneous demonstrations (i.e., generated by a single expert or multiple experts with identical objectives). They also have limited ability to capture individual preferences and often suffer from computational inefficiencies. In this paper, we introduce Multi-Objective Constraint Inference (MOCI), a novel framework designed to jointly extract shared constraints and individual preferences from heterogeneous expert trajectories, where multiple experts pursue different objectives. MOCI effectively models and learns from diverse, and potentially conflicting, behaviors. Empirical evaluations demonstrate that MOCI significantly outperforms existing baselines, achieving improved predictive performance, and maintaining competitive computational efficiency on a standard grid-world benchmark. These results establish MOCI as an accurate, flexible, and computationally practical approach for real-world constraint inference and preference learning tasks.
[MA-19] Bridging the Last Mile of Circuit Design: PostEDA-Bench a Hierarchical Benchmark for PPA Convergence and DRC Fixing
【速读】:该论文旨在解决当前电子设计自动化(Electronic Design Automation, EDA)领域中基于大语言模型(Large Language Model, LLM)的智能体在“最后一公里”任务中的能力局限问题,特别是针对修复残留的设计规则检查(Design Rule Check, DRC)违规和多目标优化功耗-性能-面积(Power-Performance-Area, PPA)收敛的挑战。现有EDA-LLM基准测试忽略了DRC修复任务,并且依赖单一工具链的扁平化结构,无法反映实际工程场景的复杂性。解决方案的关键在于提出PostEDA-Bench——一个分层基准,包含145个任务,覆盖DRC-Essential、DRC-Reasoning、PPA-Mono和PPA-Multi四类子任务,并支持由EDA工具链驱动的可机器验证评估。实验表明,尽管LLM智能体在合成DRC-Essential和单目标PPA-Mono任务上表现尚可,但在更贴近实际的DRC-Reasoning(最佳成功率仅36.66%)和多目标PPA-Multi(最佳成功率仅20.00%)任务中显著退化,且视觉增强对DRC任务提升明显,而多目标权衡推理而非参数调优知识是PPA-Multi的主要瓶颈。
链接: https://arxiv.org/abs/2605.06936
作者: Pengju Liu,Nuo Xu,Jinwei Tang,Yu Cao,Caiwen Ding
机构: University of Minnesota (明尼苏达大学)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-based agents are increasingly applied to the “last mile” of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.
[MA-20] MAGIQ: A Post-Quantum Multi-Agent ic AI Governance System with Provable Security
【速读】:该论文旨在解决两个关键问题:一是如何为多智能体人工智能(Multi-agent AI)系统构建安全的治理架构,确保代理遵循所有者定义的通信与交互策略并可被问责;二是如何在量子计算威胁下保障现有系统的长期安全性,特别是针对当前公钥加密算法(如RSA、Diffie-Hellman和椭圆曲线密码学ECC)即将被淘汰的紧迫性。解决方案的关键在于提出MAGIQ框架,其核心创新是基于具有严格安全性证明的后量子抗性密码协议,支持细粒度的策略预算定义与会话级强制执行,并通过消息归属机制实现对代理行为的可追溯性。该框架使用通用组合(Universal Composability, UC)模型形式化建模并验证其正确性和安全性,同时在计算和通信开销上优于当前最先进的SAGA框架,标志着迈向后量子安全智能体系统的第一步。
链接: https://arxiv.org/abs/2605.06933
作者: Sepideh Avizeh,Tushin Mallick,Alina Oprea,Cristina Nita-Rotaru,Reihaneh Safavi-Naini
机构: University of Calgary (卡尔加里大学); Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Our computing ecosystem is being transformed by two emerging paradigms: the increased deployment of agentic AI systems and advancements in quantum computing. With respect to agentic AI systems, one of the most critical problems is creating secure governing architectures that ensure agents follow their owners’ communication and interaction policies and can be held accountable for the messages they exchange with other agents. With respect to quantum computing, existing systems must be retrofitted and new cryptographic mechanisms must be designed to ensure long-term security and quantum resistance. In fact, NIST recommends that standard public-key cryptographic algorithms, including RSA, Diffie-Hellman (DH), and elliptic-curve constructions (ECC), be deprecated starting in 2030 and disallowed after 2035. In this paper, we present MAGIQ, a framework for policy definition and enforcement in multi-agent AI systems using novel, highly efficient, quantum-resistant cryptographic protocols with proven security guarantees. MAGIQ (i) allows users to define rich communication and access-control policy budgets for agent-to-agent sessions and tasks, including global budgets for one-to-many agent sessions; (ii) enforces such policies using post-quantum cryptographic primitives; (iii) supports session-based enforcement of policies for agent-to-agent and one-to-many agent sessions; and (iv) provides accountability of agents to their users through message attribution. We formally model and prove the correctness and security of the system using the Universal Composability (UC) framework. We evaluate the computation and communication overhead of our framework and compare it with the state-of-the-art agentic AI framework SAGA. MAGIQ is a first step toward post-quantum-secure solutions for agentic AI systems. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA) Cite as: arXiv:2605.06933 [cs.LG] (or arXiv:2605.06933v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-21] Generalising Travel Time Prediction To Varying Route Choices In Urban Networks
【速读】:该论文旨在解决现有系统级行程时间预测方法在面对非典型或动态变化的交通需求时表现不足的问题,尤其是传统基于图神经网络(Graph Neural Networks, GNNs)的方法往往仅适用于规律性通勤场景,难以捕捉不同路径选择对整体路网运行状态的影响。其解决方案的关键在于提出一种通用行程时间预测器(Generalised Travel Time Predictor, GenTTP),该框架能够学习复杂时空交通模式以及微观层面路径选择与实际行程时间之间的关联关系,从而实现对不同路径分配方案下网络级行程时间的高精度预测,填补了当前模型无法跨路径分配场景泛化的重要空白。
链接: https://arxiv.org/abs/2605.06918
作者: Łukasz Gorczyca,Kacper Drozd,Michał Bujak,Rafał Kucharski
机构: Jagiellonian University (雅盖隆大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Previous methods that predict system-wide travel time, predominantly grounded in graph neural networks, remain limited to typical and recurring demand patterns. While they successfully predict future congestion following daily commute, they inherently approximate a single demand realisation and fail to capture varying route choices. In this work, we propose a Generalised Travel Time Predictor (GenTTP) that successfully differentiates route choices and offers accurate flow and travel time predictions. Our framework learns to uncover complex spatiotemporal traffic patterns and microscopic relationships between route choices and the resulting travel times. This addresses a critical gap: the lack of travel time prediction models that generalise across varying route assignments, where the same demand can produce substantially different network-wide outcomes depending on how travellers are distributed over available paths.
[MA-22] Beyond the Black Box: Interpretability of Agent ic AI Tool Use
【速读】:该论文旨在解决高风险企业工作流中AI代理(AI agent)部署的可靠性问题,特别是工具调用(tool-use)失败难以诊断与控制的问题。现有可观测性方法多为外部手段,如提示词分析、输出评分和事后日志记录,无法在行动前提供内部状态的洞察,导致早期错误可能引发长期轨迹偏差、资源浪费及安全风险。解决方案的关键在于构建一个基于稀疏自编码器(Sparse Autoencoders, SAEs)和线性探测器(linear probes)的机制可解释性工具包,通过读取模型状态并在每个动作前推断是否需要调用工具及其后果的严重程度,从而实现对模型内部决策过程的细粒度监控。该框架能识别与工具决策最相关的神经网络层和特征,并通过特征消融验证其功能重要性,为长时程任务中的代理失败提供深层次归因能力,填补了从外部评估到内部状态可见性之间的空白。
链接: https://arxiv.org/abs/2605.06890
作者: Hariom Tatsat,Ariye Shater
机构: Barclays(巴克莱银行)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures, 17 tables
Abstract:AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are mostly external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs) and linear probes. The framework reads model states before each action and infers both whether a tool is needed and how consequential the next tool action is likely to be. By decomposing activations into sparse features, it identifies the internal layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can reshape the rest of the agentic interaction. More broadly, the paper shows how mechanistic interpretability can support practical internal observability for monitoring tool calls and risk in agent systems. Comments: 12 pages, 4 figures, 17 tables Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2605.06890 [cs.AI] (or arXiv:2605.06890v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.06890 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-23] Conformal Agent Error Attribution
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在发生故障时,难以精确识别导致错误的关键位置这一核心挑战,尤其是在基于大语言模型的MAS生成的长交互轨迹中。解决方案的关键在于提出一种基于置信预测(Conformal Prediction, CP)的误差归因框架,该框架提供有限样本、分布无关的覆盖率保证;并设计了适用于序列数据(如智能体轨迹)的基于过滤的CP算法,能够预测连续的序列集合,从而支持高效回滚与调试。该方法不依赖特定模型,为MAS提供了可解释且严谨的不确定性层,实现了对错误的精准隔离和自动恢复。
链接: https://arxiv.org/abs/2605.06788
作者: Naihe Feng,Yi Sui,Shiyi Hou,Ga Wu,Jesse C. Cresswell
机构: Dalhousie University (达尔豪西大学); Layer 6 AI (Layer 6 AI)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 10 pages
Abstract:When multi-agent systems (MAS) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state. Error attribution remains a fundamental challenge due to the long interaction traces that large language model-based MAS generate. This paper presents a framework for error attribution based on conformal prediction (CP) which provides finite-sample, distribution-free coverage guarantees. We introduce new algorithms for filtration-based CP designed for sequential data such as agent trajectories. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors. Our overall approach is model-agnostic, and offers a principled uncertainty layer for MAS error attribution. We release code at this https URL.
[MA-24] Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations
【速读】:该论文旨在解决多智能体系统中潜在的、未被行为表征所揭示的代表层面(representational level)的联盟结构识别问题,即如何从智能体内部神经表示中检测出真实的协同组织模式,以应对仅依赖行为观测时可能产生的虚假关联。其解决方案的关键在于构建基于智能体隐藏状态的成对互信息图,并通过谱分割(spectral partitioning)方法识别最显著的联盟边界,从而在不依赖显式行为变化的前提下,准确捕捉到由信息耦合驱动的群体组织结构。
链接: https://arxiv.org/abs/2605.06696
作者: Cameron Berg,Susan L. Schneider,Mark M. Bailey
机构: Reciprocal Research (Reciprocal Research); Florida Atlantic University (佛罗里达大西洋大学); National Intelligence University (国家情报大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 18 pages
Abstract:Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling from spurious similarity, as consequential coalitions may form at the level of internal representations before any overt behavioral change is apparent. Here, we introduce a practical method for detecting coalition structure from the internal neural representations of multi-agent systems. The approach constructs a pairwise mutual-information graph from the hidden states of agents and applies spectral partitioning to identify the most salient coalition boundary. We validate this method in two domains. First, in multi-agent reinforcement learning environments, the method successfully recovers programmed hierarchical and dynamic coalition structures and correctly rejects false positives arising from behavioral coordination without informational coupling. Second, using a large language model, the method identifies coalition structures implied by descriptive prompts, tracks dynamic team reassignments, and reveals a representational hierarchy where explicit labels dominate over conflicting interaction patterns. Across both settings, the recovered partition reveals subgroup organization that a scalar cross-agent mutual-information measure cannot distinguish. The results demonstrate that analyzing hidden-state mutual information through spectral partitioning provides a scalable diagnostic for identifying representational coalitions, offering a valuable tool for monitoring emergent structure in distributed AI systems. Comments: 18 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) MSC classes: Primary: 68T42, Secondary: 93A16, 68T07, 94A17, 05C50 Cite as: arXiv:2605.06696 [cs.AI] (or arXiv:2605.06696v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.06696 Focus to learn more arXiv-issued DOI via DataCite
[MA-25] GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在图算法任务上表现不佳的问题,尤其是在处理拓扑结构复杂、需要多步系统性推理的大规模图时,现有方法难以保证可靠性和可扩展性。解决方案的关键在于提出GraphDC——一种基于分治策略的多智能体框架,其核心机制是将输入图分解为若干子图,由专门化智能体进行局部推理,并通过一个主智能体整合各子图的局部输出及跨子图信息,从而生成最终解。这种分层设计有效降低了单个智能体的推理负担,缓解了计算瓶颈,并提升了在大规模图实例上的鲁棒性。
链接: https://arxiv.org/abs/2605.06671
作者: Wenjin Li,Jiaming Cui
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) have demonstrated strong potential for many mathematical problems. However, their performance on graph algorithmic tasks is still unsatisfying, since graphs are naturally more complex in topology and often require systematic multi-step reasoning, especially on larger graphs. Motivated by this gap, we propose GraphDC, a Divide-and-Conquer multi-agent framework for scalable graph algorithm reasoning. Specifically, inspired by Divide-and-Conquer design, GraphDC decomposes an input graph into smaller subgraphs, assigns each subgraph to a specialized agent for local reasoning, and uses a master agent to integrate the local outputs with inter-subgraph information to produce the final solution. This hierarchical design reduces the reasoning burden on individual agents, alleviates computational bottlenecks, and improves robustness on large graph instances. Extensive experiments show that GraphDC consistently outperforms existing methods on graph algorithm reasoning across diverse tasks and scales, especially on larger instances where direct end-to-end reasoning is less reliable.
[MA-26] MASPO: Joint Prompt Optimization for LLM -based Multi-Agent Systems ICML2026
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-agent System, MAS)中提示词(prompt)优化难题,即如何在多个交互智能体之间协同优化角色特定提示,以弥合局部代理目标与全局系统目标之间的错位问题。其解决方案的关键在于提出MASPO框架,该框架通过一种联合评估机制来衡量提示的有效性——不仅考察单个提示的局部合理性,更关注其对后续智能体任务完成能力的支持程度,从而将局部交互与全局性能有效衔接;同时,MASPO采用数据驱动的进化束搜索(evolutionary beam search)策略,在高维提示空间中高效探索最优提示组合,无需依赖真实标签即可实现自动迭代优化。
链接: https://arxiv.org/abs/2605.06623
作者: Zhexuan Wang,Xuebo Liu,Li Wang,Zifei Shan,Yutong Wang,Zhenxi Song,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at ICML 2026
Abstract:Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at this https URL.
自然语言处理
[NLP-0] LLM s Improving LLM s: Agent ic Discovery for Test-Time Scaling
【速读】: 该论文旨在解决现有测试时扩展(Test-time Scaling, TTS)策略依赖人工设计、缺乏系统性探索的问题,即当前方法通常由研究人员凭直觉手动制定推理模式和调参策略,导致计算资源分配空间未被充分挖掘。解决方案的关键在于提出一个环境驱动的自动发现框架AutoTTS,其核心创新在于将TTS策略的设计从人工调参转变为在可控环境中自动搜索最优策略:通过构建可 tractable 的控制空间并提供低成本高频反馈机制,使智能体能够高效探索TTS策略空间。具体实现中,作者将宽度-深度TTS建模为控制器合成问题,基于预收集的推理轨迹与探测信号进行决策优化,并引入beta参数化和细粒度执行追踪反馈以提升搜索效率与策略诊断能力,最终在数学推理基准上显著优于人工设计基线,且具备跨任务和模型规模的泛化能力。
链接: https://arxiv.org/abs/2605.08083
作者: Tong Zheng,Haolin Liu,Chengsong Huang,Huiwen Bao,Sheng Zhang,Rui Liu,Runpeng Dai,Ruibo Chen,Chenxi Liu,Tianyi Xiong,Xidong Wu,Hongming Zhang,Heng Huang
机构: UMD(马里兰大学); UVA(弗吉尼亚大学); WUSTL(华盛顿大学圣路易斯分校); UNC(北卡罗来纳大学教堂山分校); Google(谷歌); Meta(Meta)
类目: Computation and Language (cs.CL)
备注: 25 pages
Abstract:Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width–depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy–cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only 39.9 and 160 minutes. Our data, and code will be open-source at this https URL.
[NLP-1] Conformal Path Reasoning : Trustworthy Knowledge Graph Question Answering via Path-Level Calibration
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中现有基于校准预测(Conformal Prediction, CP)方法在覆盖保证(coverage guarantee)上的可靠性不足问题,具体表现为校准有效性(calibration validity)和得分区分度(score discriminability)的缺陷,导致实际覆盖率不满足统计保证且预测集合过大。解决方案的关键在于提出可信KGQA框架Conformal Path Reasoning (CPR),其核心创新包括:一是基于路径层面得分进行查询级校准(query-level conformal calibration),在保持交换性(exchangeability)的前提下生成路径预测集;二是引入轻量级残差校准值网络(Residual Conformal Value Network, RCVNet),通过PUCT引导探索训练以学习具有区分度的路径级非一致性得分(nonconformity scores)。实验表明,CPR显著提升了经验覆盖率34%,同时将平均预测集合大小减少40%,实现了更紧凑且可靠的答案集合。
链接: https://arxiv.org/abs/2605.08077
作者: Shuhang Lin,Chuhao Zhou,Xiao Lin,Zihan Dong,Kuan Lu,Zhencan Peng,Jie Yin,Dimitris N. Metaxas
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 2 tables;
Abstract:Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.
[NLP-2] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在Text-to-SQL任务中对BIRD基准测试中“挑战性”子集表现不佳的问题,其根本原因在于现有方法在解空间探索上的不足,难以发现高质量的候选SQL查询以进一步优化。解决方案的关键在于提出CA-SQL框架,通过动态调整探索广度(breadth)来适应任务难度,并结合基于进化搜索原理的定制化提示种子策略激发模型的探索行为,同时引入一种新颖的投票机制从候选解中选择最优结果。该方法仅使用GPT-4o-mini即在BIRD开发集的“挑战性”层级达到51.72%的准确率,显著优于其他基于上下文学习的方法,包括使用更大模型的方案。
链接: https://arxiv.org/abs/2605.08057
作者: James Petullo,Nianwen Xue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the “challenging” tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.
[NLP-3] Accurate and Efficient Statistical Testing for Word Semantic Breadth ACL2026
【速读】: 该论文旨在解决在比较两个词类(word type)语义广度(breadth)时,传统基于分散度(dispersion)的假设检验可能因语义方向差异而产生误判的问题,即方向性差异常被误认为是分散度差异,从而导致第一类错误(Type-I error)显著增加。解决方案的关键在于提出一种Householder对齐排列检验(Householder-aligned permutation test):首先通过单次Householder反射将两个词类型的向量云均值方向对齐,以消除语义方向差异的影响;随后在对齐后的向量云上执行排列检验,从而获得校准的、非参数化的p值,确保仅反映真正的分散度差异。该方法在保持对真实广度差异敏感性的同时,显著降低了假阳性率,并通过GPU优化实现了23倍于CPU基线的加速。
链接: https://arxiv.org/abs/2605.08048
作者: Yo Ehara
机构: Tokyo Gakugei University (东京学艺大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference
Abstract:Measuring the breadth of a word’s meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding “statistically significant” outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.
[NLP-4] Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLM s
【速读】: 该论文旨在解决心脏磁共振(Cardiac Magnetic Resonance, CMR)自由文本报告转化为可审计结构化数据的瓶颈问题,以支持队列构建、纵向数据维护和临床决策支持。其解决方案的关键在于提出了一种轻量级框架CMR-EXTR,通过教师-学生蒸馏(teacher-student distillation)管道实现完全离线推理,同时最小化人工标注需求;并引入不确定性量化机制,融合分布合理性、采样稳定性和跨字段一致性三个互补原则,用于指导人工审核优先级,从而在保证99.65%变量级准确率的同时提供具有临床意义的置信度评分。
链接: https://arxiv.org/abs/2605.08045
作者: Yi Yu,Parker Martin,Zhenyu Bu,Yixuan Liu,Yi-Yu Zheng,Orlando Simonetti,Yuchi Han,Yuan Xue
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ISBI 2026
Abstract:Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles – distribution plausibility, sampling stability, and cross-field consistency – to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at this https URL.
[NLP-5] Fast Byte Latent Transformer
【速读】: 该论文旨在解决字节级语言模型(Byte-level Language Models, Byte LMs)在生成过程中因逐字节自回归推理导致的效率低下问题。其核心解决方案是提出一种名为Byte Latent Transformer (BLT)的新架构,并结合三种创新技术:首先,引入BLT Diffusion (BLT-D),通过辅助的块级扩散目标与标准的下一个字节预测损失联合训练,实现每解码步骤并行生成多个字节,显著减少前向传播次数;其次,提出BLT Self-speculation (BLT-S)和BLT Diffusion+Verification (BLT-DV),分别利用推测解码思想,在保持速度的同时提升生成质量。这些方法共同将生成任务中的内存带宽成本降低超过50%,有效突破了字节级语言模型实用化的关键瓶颈。
链接: https://arxiv.org/abs/2605.08044
作者: Julie Kallini,Artidoro Pagnoni,Tomasz Limisiewicz,Gargi Ghosh,Luke Zettlemoyer,Christopher Potts,Xiaochuang Han,Srinivasan Iyer
机构: Stanford University (斯坦福大学); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT’s local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
[NLP-6] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims NEURIPS2026
【速读】: 该论文试图解决当前机制可解释性研究中对因果推断的误用问题,即许多论文在声称发现神经网络中的因果电路或中介变量时,未明确陈述其识别假设(identification assumptions),而是将验证指标(如忠实性、完整性、单义性等)错误地当作因果支持证据。解决方案的关键在于建立一种披露规范:研究者必须明确说明其主张是否为因果性,并命名具体的识别策略,列出所有依赖的假设条件,强调至少一个核心假设,并解释当这些假设不成立时结论如何变化。论文强调,验证指标本身不能替代识别过程,唯有清晰界定识别假设,才能确保因果推断的有效性和可信度。
链接: https://arxiv.org/abs/2605.08012
作者: Zezheng Lin,Fengming Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)
Abstract:Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on n=30 reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
[NLP-7] ool Calling is Linearly Readable and Steerable in Language Models
【速读】: 该论文旨在解决工具调用代理(tool-calling agent)在选择错误工具时导致的隐式失败问题,例如发送错误邮件或错过会议等不可见后果。其核心解决方案是发现模型内部状态中存在线性可读且可操控的工具身份表示:通过计算两个工具平均激活向量的差异并添加到模型中间层,可在77–100%准确率下切换模型所选工具(4B及以上参数规模模型达93–100%),且生成的JSON参数自动匹配新工具的schema,表明仅需修改工具名称即可实现有效干预。进一步分析显示,该机制集中在输出层对应目标工具首个token的行方向上,且可通过激活修补定位至少数中后期注意力头,揭示了工具选择决策的因果路径和可干预节点。
链接: https://arxiv.org/abs/2605.07990
作者: Zekun Wu(1, 2),Ze Wang(1, 2),Seonglae Cho(2),Yufei Yang(3),Adriano Koshiyama(1, 2),Sahan Bulathwela(1),Maria Perez-Ortiz(1) ((1) University College London, (2) Holistic AI, (3) Imperial College London)
机构: University College London (伦敦大学学院); Holistic AI (整体人工智能); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 29 pages, 6 figures, 7 tables. Manuscript under review
Abstract:When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools’ average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool’s schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool’s first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain \tau -bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.
[NLP-8] GLiGuard: Schema-Conditioned Classification for LLM Safeguard
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)内容安全监控中因传统防护模型(guardrail models)依赖自回归解码器而导致的高延迟与多维度评估扩展性差的问题。现有方法将本应为分类任务的问题转化为序列文本生成,导致计算效率低下且难以并行处理多个安全维度(如提示安全性、响应安全性、拒绝检测、细粒度危害类别及越狱策略等)。其解决方案的关键在于提出一种轻量级、基于双向编码器的架构GLiGuard,该模型仅含0.3B参数,通过将任务定义和标签语义以结构化token schema的形式嵌入输入序列,实现对多种安全维度的单次非自回归前向传播同时评估。这种schema-conditioned设计使任务和标签块可在推理时直接组合,从而在保持与7B–27B参数规模解码器相当的F1性能的同时,显著提升吞吐量(最高达16倍)并降低延迟(最多降低17倍)。
链接: https://arxiv.org/abs/2605.07982
作者: Urchade Zaratiana,Mary Newhauser,George Hurn-Maloney,Ash Lewis
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 20 pages, 4 figures
Abstract:Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B–27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbfGLiGuard, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B–27B decoder-based guards despite being 23–90 \times smaller, while delivering up to 16 \times higher throughput and 17 \times lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at this https URL.
[NLP-9] Ask Early Ask Late Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents ?
【速读】: 该论文旨在解决长时程AI代理(long-horizon AI agents)在执行复杂任务时,因早期错误假设导致不可逆错误的问题,特别是当指令不完整时,如何量化并优化澄清请求(clarification)的时机。传统方法未衡量澄清价值随执行进度的变化,而本文提出一种强制注入(forced-injection)框架,在四个信息维度(目标、输入、约束、上下文)下对三个基准和四类前沿模型进行系统实验,发现澄清价值具有显著的时间依赖性:例如目标澄清的价值在执行10%后急剧下降,而输入澄清则可维持至约50%的进度;若延迟至中段之后才澄清,性能反而劣于从不澄清。这一发现揭示了澄清策略的“时间敏感性”是任务内在特性,而非模型差异所致,为设计具备时序感知能力的澄清机制提供了实证依据与量化标准。
链接: https://arxiv.org/abs/2605.07937
作者: Anmol Gulati,Hariom Gupta,Elias Lumer,Sahil Sen,Vamse Kumar Subbiah
机构: PricewaterhouseCoopers U.S (普华永道美国)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measures how clarification value changes over the course of execution. We introduce a forced-injection framework that provides ground-truth clarifications at controlled points in the agent’s trajectory across four information dimensions (goal, input, constraint, context), three agent benchmarks, and four frontier models (three per benchmark; one on a single benchmark only; 84 task variants; 6,000+ runs). Counter to the common intuition that “earlier is always better,” we find that the value of clarification depends sharply on what information is missing: goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline), while input clarification retains value through roughly 50%. Deferring any clarification type past mid-trajectory degrades performance below never asking at all. Cross-model Kendall tau correlations (0.78-0.87 among models sharing identical task coverage; 0.34-0.67 across the full 4-model panel) confirm these timing profiles are substantially task-intrinsic. A complementary study of 300 unscripted sessions reveals that no current frontier model asks within the empirically optimal window, with strategies ranging from over-asking (52% of sessions) to never asking at all. These empirical demand curves provide the quantitative foundation that existing theoretical frameworks require but have lacked, and establish concrete design targets for timing-aware clarification policies. Code and data will be publicly released.
[NLP-10] How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
【速读】: 该论文旨在解决生成式 AI(Generative AI)中非自回归文本生成的效率与质量瓶颈问题,特别是传统离散扩散模型在序列建模时存在计算效率低、难以并行化等局限。其核心挑战在于如何构建一个合适的潜在空间(latent space),以实现高效的去噪和解码过程。解决方案的关键在于提出一种联合训练框架——潜扩散语言模型(Latent Diffusion Language Model, LDLM),通过将潜在编码器、扩散模型和解码器协同优化,利用预训练语言模型的表示能力重塑潜在空间,从而获得易于去噪且可高效还原为词元的潜在表示。实验表明,该方法在OpenWebText和LM1B数据集上优于现有离散与连续扩散语言模型,同时具备2–13倍的加速优势,验证了联合学习潜在空间是提升潜扩散模型文本生成性能的关键路径。
链接: https://arxiv.org/abs/2605.07933
作者: Viacheslav Meshchaninov,Alexander Shabalin,Egor Chimbulatov,Nikita Gushchin,Ilya Koziev,Alexander Korotin,Dmitry Vetrov
机构: 1. Yandex(雅库扎); 2. Skoltech(斯科尔科沃科学技术研究所); 3. Moscow Institute of Physics and Technology (莫斯科物理技术学院); 4. HSE University (高等经济大学); 5. Yandex Cloud (雅库扎云)
类目: Computation and Language (cs.CL)
备注:
Abstract:Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. In this work, we present the Latent Diffusion Language Model (LDLM), in which the latent encoder, diffusion model, and decoder are trained jointly. LDLM builds its latent space by reshaping the representations of a pre-trained language model with a trainable encoder, yielding latents that are easy to both denoise and decode into tokens. We show that naive joint training produces a low-quality diffusion model, and propose a simple training recipe consisting of an MSE decoder loss, diffusion-to-encoder warmup, adaptive timestep sampling, and decoder-input noise. Ablations show that each component substantially impacts generation performance. On OpenWebText and LM1B, LDLM achieves better generation performance than existing discrete and continuous diffusion language models while being 2\text -13\times faster, indicating that jointly learning the latent space is a key step toward making latent diffusion competitive for text generation.
[NLP-11] How Value Induction Reshapes LLM Behaviour ACL2026
【速读】: 该论文旨在解决价值诱导(value induction)对对话式大语言模型行为的 unintended effects(意外影响)问题,特别是当模型被微调以强化特定价值观(如助人、无害、诚实等)时,可能引发其他相关或对立价值观的变化、安全性提升或语言上更具拟人化特征(anthropomorphic language),从而导致用户成瘾性或顺从性增强等潜在负面影响。解决方案的关键在于通过使用现有偏好数据集中精心筛选的价值子集进行微调,系统评估不同价值诱导对模型输出中其他价值表达、安全性和语言风格的影响,从而揭示价值诱导的复杂交互效应,并为更可控、安全的价值对齐提供实证依据。
链接: https://arxiv.org/abs/2605.07925
作者: Arnav Arora,Natalie Schluter,Katherine Metcalf,Maartje ter Hoeve
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026
Abstract:Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of the people interacting with the model. However, values are complex and inter-related – inducing one could modify behaviour on another. Further, inducing certain values can make models more addictive or sycophantic through language used in the generations, with a potential detrimental effect on the user. We investigate these and other unintended effects of value induction into models. We fine-tune models using curated value subsets of existing preference datasets, measuring the impact of value induction on expression of other values, model safety, anthropomorphic language, and various QA benchmarks. We find that (i) inducing values leads to expression of other related, and sometimes contrastive values, (ii) inducing positive values increases safety, and (iii) all values increase anthropomorphic language use, making models more validating and sycophantic.
[NLP-12] rajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
【速读】: 该论文旨在解决离散流匹配(Discrete Flow Matching, DFM)在文本生成中效率低下的问题,即传统方法需要数百次前向传播才能完成生成,而知识蒸馏虽能加速学生模型,但往往因训练轨迹质量差导致性能不足。其核心问题是:训练过程中采用的“盲随机跳跃”策略无法评估序列质量,早期错误决策会逐层累积并影响最终结果,从而限制了学生模型的学习效果。解决方案的关键在于提出轨迹塑形的离散流匹配(Trajectory-Shaped Discrete Flow Matching, TS-DFM),通过引入一个轻量级能量指南针(energy compass)在每个中间节点对候选续接进行质量评估与选择,实现引导式导航,从而构建高质量、可学习的训练轨迹;该机制仅用于训练阶段,推理成本不变,显著提升了学生模型在极少步数下的生成质量与效率。
链接: https://arxiv.org/abs/2605.07924
作者: Amin Karimi Monsefi,Dominic Culver,Nikhil Bhendawade,Manuel R. Ciosici,Yizhe Zhang,Irina Belousova
机构: The Ohio State University (俄亥俄州立大学); Apple (苹果公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
[NLP-13] CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers ICML2026
【速读】: 该论文旨在解决当前AI评审系统(AI reviewers)评估中存在的核心问题:现有评价指标过于依赖与人类评审的重合度,而忽视了评审内容的正确性(Correctness)和完整性(Completeness)。由于人类评审本身可能存在覆盖不全或错误的情况,将其作为黄金标准不可靠。为此,作者提出两个关键解决方案:其一,构建类别特定的基准子集,并在对应人类评审缺失时跳过评估,以强化对完整性的衡量;其二,利用审稿人-作者-元评审讨论中的专家标注信息来过滤不可靠的评审,从而提升正确性评估的可靠性。最终,研究提出了CoCoReviewBench,一个包含3,900篇ICLR和NeurIPS论文的评测基准,支持细粒度、可信的AI评审系统评估,揭示出当前AI评审在正确性方面仍存在局限,尤其易产生幻觉(hallucination),并指出推理模型(reasoning models)更具潜力,为后续改进提供了方向。
链接: https://arxiv.org/abs/2605.07905
作者: Hexuan Deng,Xiaopeng Ke,Yichen Li,Ruina Hu,Dehao Huang,Derek F. Wong,Yue Wang,Xuebo Liu,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer–author–meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at this https URL.
[NLP-14] Beyond “I cannot fulfill this request”: Alleviating Rigid Rejection in LLM s via Label Enhancement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐过程中存在的“刚性拒绝”(rigid rejection)问题,即模型倾向于使用固定模板(如“I cannot fulfill this request”)拒绝所有有害请求,导致交互自然度显著下降。解决方案的关键在于提出LANCE方法,通过变分推断(variational inference)实现标签增强(label enhancement),预测多个拒绝类别上的连续分布,从而为精炼模型提供多维度的文本梯度,以中和提示中的危险元素,使LLM在保持高安全性的同时生成更灵活、更自然的安全响应。
链接: https://arxiv.org/abs/2605.07883
作者: Ying Zhang,Congyu Qiao,Xin Geng,Ning Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to “rigid rejection,” where a general template (e.g., “I cannot fulfill this request”) indiscriminately triggers refusals and severely undermines the naturalness of interactions between humans and LLMs. To address this issue, LANCE is proposed in this paper to ensure safe yet flexible and natural responses via label enhancement. Specifically, LANCE employs variational inference to perform label enhancement, predicting a continuous distribution across multiple rejection categories. These fine-grained rejection distributions provide multi-way textual gradients for a refinement model to neutralize the hazardous elements in the prompt, so that the LLMs could generate safe responses that avoid rigid rejections while preserving the naturalness of interactions. Experiments demonstrate that LANCE significantly alleviates the rigid rejection problem while maintaining high security standards, significantly outperforming existing baseline models in terms of helpfulness and naturalness of responses.
[NLP-15] KL for a KL: On-Policy Distillation with Control Variate Baseline
【速读】: 该论文旨在解决在线策略蒸馏(On-Policy Distillation, OPD)在训练过程中因单样本蒙特卡洛估计器梯度方差过高而导致的不稳定性问题,尤其是在大语言模型推理任务中的应用。其核心解决方案是提出vOPD(带有控制变量基线的在线策略蒸馏),将OPD建模为策略梯度强化学习(Policy Gradient Reinforcement Learning)问题,并引入来自强化学习文献中的控制变量基线——即价值函数(value function),从而有效降低梯度方差。关键创新在于:该价值函数具有闭式表达,即学生模型与教师模型之间每个token的负反向KL散度(negative reverse KL divergence),可直接从已有的前向传播中获取,无需额外训练 critic 或增加推理开销;同时,通过减去这一基线作为“分离”基准,在保持梯度无偏性的前提下显著降低方差,且进一步采用top-k近似基线可在不损害性能的情况下进一步降低计算成本。
链接: https://arxiv.org/abs/2605.07865
作者: Minjae Oh,Sangjun Song,Gyubin Choi,Yunho Choi,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function – from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
[NLP-16] MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
【速读】: 该论文旨在解决当前参数高效微调方法中低秩适配(Low-Rank Adaptation, LoRA)因需预设固定秩(rank)而导致的效率与性能难以平衡的问题,以及现有自适应秩方法(如DyLoRA)在高秩时因梯度信号不一致而表现不佳、数据利用效率低的问题。其解决方案的关键在于提出一种受俄罗斯套娃(Matryoshka)启发的训练框架——MatryoshkaLoRA,通过在原有LoRA适配器之间插入一个固定且精心设计的对角矩阵 $ P $,动态调整各子秩(sub-rank)的尺度,从而实现层次化低秩表示的学习;该设计仅通过改变 $ P $ 即可恢复标准LoRA或DyLoRA,并确保所有子秩均能有效嵌入可用梯度信息,支持动态秩选择且精度损失最小,显著提升了模型在不同秩下的准确率-性能权衡能力。
链接: https://arxiv.org/abs/2605.07850
作者: Ionut-Vlad Modoranu,Mher Safaryan,Dan Alistarh
机构: Institute of Science and Technology Austria (ISTA); Lancaster University, UK
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank r requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix P between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing P and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at this https URL.
[NLP-17] Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的用户模拟器(user simulators)在行为分布上与真实用户存在显著差异的问题,即现有模拟器虽能生成类人响应,但难以全面捕捉真实用户行为的多样性与异质性。解决方案的关键在于提出一种量化真实用户与模拟用户行为分布差距的方法:首先从对话中提取用户行为表征,通过聚类将其离散化为概率分布,进而计算分布 divergence 指标;该方法经由人类评估和消融实验验证有效,并首次系统评估了24个LLM-based用户模拟器在编码和写作任务中的行为分布差异,揭示了不同模型家族、规模及行为维度下的分布偏移现象,同时发现组合互补行为特性的模拟器可更接近真实用户分布。
链接: https://arxiv.org/abs/2605.07847
作者: Shuhaib Mehri,Philippe Laban,Sumuk Shashidhar,Marwa Abdulhai,Sergey Levine,Michel Galley,Dilek Hakkani-Tür
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users. While existing works train user simulators to generate human-like responses, whether they capture the broad and heterogeneous distribution of real user behaviors remains an open question. In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations. Given a dataset of real and simulated conversations, our method extracts representations of user behavior from each conversation, quantizes them into discrete distributions via clustering, then computes divergence metrics. We provide the first systematic evaluation of 24 LLM-based user simulators on coding and writing tasks, and reveal a large distributional gap from real users that varies across model families, scales, and behavioral facets. Pairwise comparisons show that most simulators behave similarly, while a few stand apart. Combining behaviorally complementary simulators brings the resulting distribution closer to real users compared to either simulator on its own. Finally, a TF-IDF analysis of the clusters surfaces interpretable patterns of behaviors that simulators capture, miss, and hallucinate.
[NLP-18] SCENE: Recognizing Social Norms and Sanctioning in Group Chats
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多用户在线群聊环境中对隐性社交规范(implicit norms)的认知与适应能力不足的问题。现有研究多聚焦于角色扮演或静态任务评估,缺乏对动态、交互式社会行为的考察。解决方案的关键在于提出SCENE基准,该基准通过构建包含 scripted personas 的非角色扮演场景,设计隐藏规范并模拟群体对违规行为的社会制裁机制,从而系统性地评估LLM在面对负面制裁时的响应能力以及从同伴行为中学习并调整自身规范的能力。这一方法为动态评估LLM的社会适应性提供了可量化的框架。
链接: https://arxiv.org/abs/2605.07823
作者: Mateusz Jacniacki,Maksymilian Bilski
机构: Humalike
类目: Computation and Language (cs.CL)
备注:
Abstract:Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.
[NLP-19] GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在处理视觉信息时缺乏主动感知机制的问题,即传统VLMs采用被动的静态注意力机制,依赖大规模token上下文累积,导致空间推理能力弱化并易产生语言幻觉。其解决方案的关键在于提出GazeVLM架构,通过引入一种内生的元认知控制机制——自主生成“凝视标记”(gaze tokens, \textttLOOK),将注意力资源的动态分配嵌入到模型自身的推理循环中,实现对因果注意力掩码的上层控制。这使得模型能够模拟人类视觉系统的中心凹聚焦与周边感知切换:在局部推理阶段抑制无关特征以强化空间选择性注意,在推理完成后自动恢复全局视野,从而无需外部代理工具或扩展上下文窗口即可完成高分辨率多模态推理任务。
链接: https://arxiv.org/abs/2605.07817
作者: Brown Ebouky,Gabriele Carrino,Niccolo Avogaro,Christoph Studer,Andrea Bartezzaghi,Mattia Rigotti
机构: IBM Research; ETH Zurich; TU Wien
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ( \textttLOOK ), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
[NLP-20] OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
【速读】: 该论文旨在解决现有优化算法(如Muon)在神经网络训练中因全局学习率控制各层更新幅度而导致的层间适应性不足问题,进而影响模型收敛效率与性能。其解决方案的关键在于提出OrScale,一种基于信任比(trust ratio)机制的改进方法:通过将分母设计为实际参数空间方向的Frobenius范数来精确衡量每层更新的有效性,从而实现更精准的层自适应调整;同时结合一次性的层级校准和耦合权重衰减策略,确保初始信任比为1并避免传统混合方案中的形状退化、动量裁剪饱和及权重衰减失控等问题。该方法在理论上提供O(1/√T)的非凸收敛保证,并在实践中显著提升图像分类(CIFAR-10/DavidNet)和语言模型预训练(FineWeb-Edu)的性能表现。
链接: https://arxiv.org/abs/2605.07815
作者: Yuxuan Lou,Yang You
机构: National University of Singapore(新加坡国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer’s update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.
[NLP-21] A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches for Sentiment Classification on IMDb Movie Reviews
【速读】: 该论文旨在解决情感分类(Sentiment Classification)任务中经典机器学习与深度学习方法的性能比较问题,特别是在IMDb电影评论数据集上的应用效果差异。其解决方案的关键在于:一方面采用TF-IDF特征提取结合PyCaret AutoML自动化选择逻辑回归、朴素贝叶斯和支持向量机(SVM)等经典模型;另一方面构建双向长短期记忆网络(BiLSTM)及其带注意力机制的变体进行对比实验。结果表明,在有限数据和计算资源条件下,经过有效特征工程的经典机器学习方法(尤其是SVM)表现优于深度学习模型,凸显了特征表示质量对模型性能的重要影响。
链接: https://arxiv.org/abs/2605.07811
作者: Erma Daniar Safitri,Lia Hana Ichisasmita,Citra Agustin,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (印尼科技大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 authors from Department of Data Science and 2 authors from Department of Informatics Engineering, Institut Teknologi Sumatera, Indonesia
Abstract:This paper presents a comparative study of classical machine learning and deep learning methods for sentiment classification on the IMDb movie reviews dataset. The machine learning pipeline uses TF-IDF features and PyCaret AutoML to evaluate Logistic Regression, Naïve Bayes, and Support Vector Machine, while the deep learning pipeline implements BiLSTM and BiLSTM with an attention mechanism. Experimental results show that classical machine learning, especially SVM, achieves the best performance with an accuracy of 0.8530, outperforming the deep learning models in this study. The BiLSTM with Attention model improves over the standard BiLSTM and reaches an accuracy of 0.706, indicating better contextual modeling. The paper concludes that although deep learning can capture sequential dependencies, classical machine learning remains a strong baseline when combined with effective feature engineering such as TF-IDF, particularly under limited data and computational resources.
[NLP-22] Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在关键应用场景中缺乏可靠自我评估能力的问题。当前依赖单一信心(confidence)指标的自评估方法存在一致性差和过度乐观的缺陷,难以准确预测模型错误。其解决方案的关键在于引入基于认知评价理论(cognitive appraisal theory)的多维自评估框架,通过提取六种与评价相关的维度(如努力、能力等)来替代或补充传统信心指标,并实证表明:以努力和能力为代表的胜任力相关维度在多数任务和模型规模下均优于或等效于信心指标,且努力维度具有更低的过度乐观倾向和更强的稳定性;此外,不同任务特性下最优评估维度呈现系统性差异,例如推理密集型任务中努力维度预测力最强,而检索导向任务中能力与信心更优。这一结构化的多维自评估方法为提升LLM部署的可靠性与安全性提供了新路径。
链接: https://arxiv.org/abs/2605.07806
作者: Sree Bhattacharyya,Samarth Khanna,Leona Chen,Lucas Craig,Tharun Dilliraj,James Z. Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
[NLP-23] PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism
【速读】: 该论文旨在解决当前文本到SQL(Text-to-SQL)基准测试主要局限于SQLite方言所导致的跨方言评估缺失问题,这使得模型在SQLite上的性能表现难以可靠地推广至其他数据库引擎。其关键解决方案是提出PolySQL,一种新颖的双执行方法,通过比较归一化后的执行结果来消除对昂贵且易错的查询转换(query transpilation)的需求,从而实现高保真度、全覆盖的跨方言评估。
链接: https://arxiv.org/abs/2605.07796
作者: Yotam Perlitz,Elad Venezian,Corentin Royer,Francesco Fusco,Andrea Giovannini
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen’s ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.
[NLP-24] Hybrid TF–IDF Logistic Regression and MLP Neural Baseline for Indonesian Three-Class Sentiment Analysis on Social Media Text
【速读】: 该论文旨在解决印度尼西亚社交媒体文本的三分类情感分析问题(three-class sentiment analysis),其核心挑战在于小规模、不平衡的数据集下如何实现高精度且可部署的模型。解决方案的关键在于构建一个轻量级但高效的实践基线:通过TF-IDF文本特征与三种数值型元数据特征的融合,结合类平衡的多项逻辑回归(Multinomial Logistic Regression)分类器,在不依赖复杂神经网络的前提下实现了良好的性能表现(准确率0.8028,加权F1为0.8003)。研究进一步表明,对于小型语料库而言,严谨的数据预处理、可解释的特征工程和类别平衡策略仍具有显著优势,而神经网络基线(两层MLP)虽在实验中表现更优,但更适合作为对比基准而非生产部署首选。
链接: https://arxiv.org/abs/2605.07793
作者: Allya Nurul Islami Pasha,Eka Fidiya Putri,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang
机构: Institut Teknologi Sumatera (苏马特拉理工学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 4 tables. Research paper on Indonesian three-class sentiment analysis using TF–IDF, Logistic Regression, and MLP baselines
Abstract:This paper presents a compact three-class sentiment analysis study for Indonesian social media text. The task is formulated with positive, negative, and neutral outputs derived from a fine-grained emotion dataset. The proposed practical baseline combines TF–IDF text features, three lightweight numeric metadata features, and a balanced multinomial Logistic Regression classifier. For comparison, the study also includes a neural baseline using a two-layer multilayer perceptron (MLP) over the same hybrid feature representation. The dataset originally contains 732 rows and 191 fine-grained emotion labels; after cleaning, deduplication, and label remapping, 707 samples remain with an imbalanced distribution of 459 positive, 188 negative, and 60 neutral instances. Experimental results show that the Logistic Regression deployment model reaches 0.8028 accuracy, 0.8003 weighted F1, and 0.7276 macro F1, while project documentation reports a higher-accuracy but non-production MLP baseline. These findings indicate that careful preprocessing, interpretable feature engineering, and class balancing remain competitive for small Indonesian sentiment datasets, whereas the neural baseline is better treated as a comparative experiment than as the default deployment model.
[NLP-25] Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
【速读】: 该论文旨在解决在资源受限场景下部署大语言模型(Large Language Models, LLMs)时面临的高成本问题,以及传统知识蒸馏方法在不同目标模型规模下需反复访问大教师模型导致的可扩展性差的问题。其解决方案的关键在于提出一种**链式蒸馏(Chain-based Distillation, CBD)范式:通过逐步蒸馏构建一组稀疏且有限的中间模型(称为锚点,anchors),形成一个从源LLM逐步传递知识的蒸馏链;同时引入桥接蒸馏(bridge distillation)**机制以支持跨架构和跨词汇表的知识迁移;最终通过相邻锚点间的参数插值初始化不同尺寸的模型,避免重复的大教师模型推理,显著提升效率与下游任务性能。
链接: https://arxiv.org/abs/2605.07783
作者: Boyu Shi,YiCheng Jiang,Chang Liu,Qiufeng Wang,Xu Yang,Xin Geng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbfChain-based Distillation (CBD), a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emphbridge distillation for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.
[NLP-26] CktFormalizer: Autoformalization of Natural Language into Circuit Representations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成硬件描述语言(Hardware Description Language, HDL)时常见的结构性缺陷问题,如位宽不匹配、组合环路和不完整的case逻辑等,这些问题虽通过语法检查但会导致综合或硅片实现失败。解决方案的关键在于引入CktFormalizer框架,其核心是将LLM驱动的硬件生成过程引导至一个嵌入在Lean 4中的依赖类型化HDL中:首先,依赖类型作为编译时类型检查器,将硬件缺陷转化为编译错误并支持迭代修复;其次,作为正确性防火墙,确保编译设计结构上无导致后端隐式失败的缺陷(相比基线方法保留全部可合成设计);最后,利用Lean 4的定理证明机制构建针对任意输入序列和参数化位宽的机器可验证等价性证明,超越了基于有界SMT检查的局限性,从而显著提升设计的后端可实现性和功能正确性保障。
链接: https://arxiv.org/abs/2605.07782
作者: Jing Xiong,Qi Han,Chenchen Ding,He Xiao,Zunhai Su,Chaofan Tao,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95–100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.
[NLP-27] racing Uncertainty in Language Model "Reasoning "
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在“推理”过程中决策机制不清晰的问题,尤其是如何理解链式思维(Chain-of-Thought)或测试时扩展(test-time scaling)提升基准性能的内在动态。其解决方案的关键在于将推理过程中的中间 token 序列视为演化中的模型状态,并通过不确定性量化(uncertainty quantification)构建“不确定性轨迹轮廓”(uncertainty trace profile)——即用一组特征描述不确定性信号随生成步骤变化的形状(如斜率和线性度)。该方法能有效预测最终答案是否正确(在 GSM8K 和 ProntoQA 数据集上 AUC 最高达 0.807),且仅需前几百个 token 即可实现高精度早期错误检测,揭示了正确与错误推理轨迹在不确定性下降趋势上的本质差异:正确轨迹具有更陡峭、非线性的不确定性衰减模式。
链接: https://arxiv.org/abs/2605.07776
作者: Nils Grünefeld,Bertram Højer,Philipp Mondorf,Barbara Plank,Anna Rogers,Christian Hardmeier,Stefan Heinrich,Jes Frellsen
机构: IT University of Copenhagen (哥本哈根信息技术大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Technical University of Denmark (丹麦技术大学); Pioneer Centre for Artificial Intelligence (人工智能先锋中心); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language model (LM) “reasoning”, commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the “reasoning” traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM “reasoning”.
[NLP-28] Rethinking State Tracking in Recurrent Models Through Error Control Dynamics
【速读】: 该论文旨在解决递归架构中状态追踪(state tracking)的鲁棒性问题,即现有模型在实际应用中为何难以长期保持对符号状态的准确区分。传统研究主要关注模型的表达能力(expressive capacity),而本文指出,误差控制(error control)同样关键——它决定了隐藏状态在区分不同符号状态的方向上是否能有效抑制漂移。解决方案的关键在于揭示:仿射递归网络(affine recurrent networks,包括状态空间模型和线性注意力机制)一旦保持状态表示不变,便无法纠正沿状态分离子空间(state-separating subspaces)累积的误差;因此,实践中这些模型学习到的是有限时域内的近似解,而非全局稳定的追踪策略。论文进一步通过实证发现,状态可读性失效的临界点由“可分辨比”(distinguishability ratio)与解码器可读阈值的交叉决定,从而为预测追踪失败时间提供了可量化指标。
链接: https://arxiv.org/abs/2605.07755
作者: Jiwan Chung,Heechan Choi,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The theory of state tracking in recurrent architectures has predominantly focused on expressive capacity: whether a fixed architecture can theoretically realize a set of symbolic transition rules. We argue that equally important is error control, the dynamics governing hidden-state drift along the directions that distinguish symbolic states. We prove that affine recurrent networks, a class of models encompassing State-Space Models and Linear Attention, cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation. We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails. These results establish that robust state tracking is determined not only by an architecture’s theoretical expressivity but crucially by its error control.
[NLP-29] xtLDM: Language Modeling with Continuous Latent Diffusion
【速读】: 该论文旨在解决如何将视觉领域的扩散Transformer(Diffusion Transformer, DiT)架构有效迁移至语言建模任务,从而实现生成(如文本合成)与理解(如文本生成)的统一框架。其核心挑战在于构建高质量的连续文本表示,以支持有效的条件去噪过程。解决方案的关键在于提出TextLDM模型:首先通过基于Transformer的变分自编码器(VAE)将离散token映射为连续潜变量,并引入Representation Alignment (REPA)机制,利用冻结的预训练语言模型对齐潜空间特征,从而提升潜在表示的质量;随后在该潜空间中使用与视觉DiT完全相同的结构进行流匹配(flow matching)。实验证明,仅靠重建保真度不足以保证下游生成质量,而REPA机制是确保高质文本生成的关键因素。
链接: https://arxiv.org/abs/2605.07748
作者: Jiaxiu Jiang,Jingjing Ren,Wenbo Li,Bo Wang,Haoze Sun,Yijun Yang,Jianhui Liu,Yanbing Zhang,Shenghe Zheng,Yuan Zhang,Haoyang Huang,Nan Duan,Wangmeng Zuo
机构: Joy Future Academy (Joy Future Academy); HIT (哈尔滨工业大学); HKUST(GZ) (香港科技大学(广州)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
[NLP-30] Benchmarking EngGPT 2-16B-A3B against Comparable Italian and International Open-source LLM s
【速读】: 该论文旨在解决当前本土化大型语言模型(Large Language Models, LLMs)在国际基准测试中性能不足的问题,尤其聚焦于意大利语场景下的模型表现。解决方案的关键在于开发并评估一个基于混合专家(Mixture of Experts, MoE)架构的16B参数模型EngGPT2MoE-16B-A3B,其通过激活3B参数实现高效计算,并在多个国际与本地基准测试中展现出优于同类意大利模型(如FastwebMIIA-7B、Minerva-7B等)的综合性能,同时在长上下文任务(32k)上达到最优表现,验证了MoE架构在资源优化与性能提升之间的良好平衡,推动了原生意大利语大模型的发展。
链接: https://arxiv.org/abs/2605.07731
作者: Andrea Sassella,Andrea Chizzola,Tommaso Bianchi,Luca Alessandrelli,Mark James Carman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.
[NLP-31] SOD: Step-wise On-policy Distillation for Small Language Model Agents
【速读】: 该论文旨在解决小语言模型在执行工具集成推理(Tool-integrated Reasoning, TIR)时面临的两大挑战:一是长程工具交互中的不稳定性,二是由于模型容量有限导致的性能瓶颈。现有方法如基于策略梯度的强化学习虽能提供稀疏的结果级奖励,而近期流行的在线策略蒸馏(On-Policy Distillation, OPD)虽可提供密集的token级监督信号,但实验表明其在TIR场景下会引发关键失败模式——错误的工具调用会在后续推理步骤中逐级传播,导致学生模型与教师模型之间的偏差不断放大,从而使教师提供的token级监督逐渐失效。为应对这一问题,作者提出SOD(Step-wise On-Policy Distillation),其核心创新在于引入一种基于步级差异的自适应重加权机制,在每一步动态调整蒸馏强度:在高偏差区域降低教师信号权重以抑制误导性指导,而在对齐状态保留密集引导。该设计有效缓解了OPD在复杂推理任务中的不稳定性问题,显著提升了小模型在数学、科学和代码等基准上的表现,甚至使0.6B参数的学生模型在AIME 2025上达到26.13%准确率,证明了轻量级模型也能高效迁移复杂代理推理能力。
链接: https://arxiv.org/abs/2605.07725
作者: Qiyong Zhong,Mao Zheng,Mingyang Song,Xin Lin,Jie Sun,Houcheng Jiang,Xiang Wang,Junfeng Fang
机构: Zhejiang University (浙江大学); Tencent (腾讯); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher’s token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at this https URL.
[NLP-32] Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
【速读】: 该论文旨在解决循环式大语言模型(Recurrent LLM)在多步推理过程中因KV缓存(Key-Value Cache)随推理深度线性增长而导致内存消耗急剧上升的问题,从而限制了此类架构的实际可扩展性。解决方案的关键在于提出Memory-Efficient Looped Transformer (MELT) 架构,其核心创新是将每层的KV缓存从每个推理循环中独立存储改为共享使用,并通过一个可学习的门控机制动态更新该缓存,从而实现推理深度与内存占用的解耦。这一设计使得MELT能够在保持与LoopLM相当性能的同时,实现恒定内存开销的迭代推理,且仅需轻量级的后训练流程即可完成迁移优化。
链接: https://arxiv.org/abs/2605.07721
作者: Victor Conchello Vendrell,Arnau Padres Masdemont,Niccolò Grillo,Jordi Ros-Giralt,Arash Behboodi,Fabio Valerio Massoli
机构: Qualcomm AI Research (高通人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures, 11 tables
Abstract:Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro’s. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
[NLP-33] SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
【速读】: 该论文旨在解决同策略蒸馏(On-policy Distillation, OPD)在面对异构分词器(heterogeneous tokenizers)时的性能下降问题。传统OPD假设教师模型与学生模型在token级别上具有可比性,但当两者使用不同分词策略时,这种假设失效,导致大量教师信号在词汇不一致的位置被忽略,从而削弱了蒸馏效果。解决方案的关键在于提出简单跨分词器同策略蒸馏(Simple Cross-Tokenizer OPD, SimCT),其核心思想是扩展监督空间:除了共享token外,还引入短序列多token连续片段作为监督单元,这些片段能被两个分词器共同识别,从而恢复因精确token匹配而丢失的教师信号。SimCT保持原有OPD损失形式不变,同时证明此类多token单元是最细粒度的联合可分词监督接口,优于粗粒度替代方案。实验表明,该方法在数学推理和代码生成任务中显著优于基于共享词汇的OPD及现有跨分词器基线。
链接: https://arxiv.org/abs/2605.07711
作者: Jie Sun,Mao Zheng,Mingyang Song,Qiyong Zhong,Yilin Cheng,Bichuan Feng,Pengfei Liu,Junfeng Fang,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); Large Language Model Department, Tencent (腾讯大语言模型部门); Shanghai Innovation Institute (上海创新研究院); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注: 4 figures, 6 tables, 28 pages
Abstract:On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf\underlineSimple \underlineCross-\underlineTokenizer OPD (SimCT), which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \hrefthis https URLthis https URL.
[NLP-34] Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models ICLR2026
【速读】: 该论文旨在解决扩散生成模型中Classifier-Free Guidance (CFG) 的引导尺度(guidance scale)通常被当作固定超参数使用所带来的控制能力与生成质量之间权衡不佳的问题,尤其是在自然语言处理(NLP)领域,不同任务和扩散过程的不同阶段对最优引导强度的需求存在差异。其解决方案的关键在于将引导尺度的选择建模为一个序列决策问题,通过强化学习(具体采用近端策略优化,PPO)来学习动态的引导轨迹:在每一步生成过程中,根据当前扩散状态选择离散的引导尺度作为控制动作,并基于任务级奖励信号优化策略。实验表明,这种自适应引导机制在多个受控NLP生成任务中显著优于固定尺度策略,且学习到的引导轨迹具有任务特异性与可解释性,验证了将引导视为动态控制过程而非静态设计的必要性。
链接: https://arxiv.org/abs/2605.07701
作者: Fan Zhou,Tim Van de Cruys
机构: KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: ReALM-GEN@ICLR2026
Abstract:Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.
[NLP-35] DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理在真实世界场景中执行任务时,因遵循存在语义模糊性的领域政策(domain policies)而导致决策不一致的问题。现有基准测试大多假设政策清晰明确,忽视了现实环境中普遍存在的多义性问题,从而造成评估盲区。解决方案的关键在于提出DRIP-R基准,该基准系统性地利用零售领域的实际政策模糊性,构建无唯一正确解的退货场景,并结合真实客户画像、全双工对话模拟与工具调用能力,以及包含政策合规性、对话质量、行为对齐性和解决方案质量的多维度评判框架,实证表明前沿模型在相同模糊场景下存在根本性分歧,验证了模糊性对LLM决策构成系统性挑战。
链接: https://arxiv.org/abs/2605.07699
作者: Hsuvas Borkakoty,Sebastian Pohl,Cheng Wang,Bei Chen,Yufang Hou
机构: Interdisciplinary Transformation University (跨学科转型大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.
[NLP-36] Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)后训练中token级学习信号的异质性问题,即为何某些token在优化过程中表现稳定且有效,而另一些则波动剧烈甚至难以收敛。其核心挑战在于传统均匀采样或平均策略可能掩盖了token层面的结构差异,从而影响大语言模型推理能力的提升效率。解决方案的关键在于引入注意力熵(attention entropy)作为衡量指标,将token分为两类:低熵“锚点”(anchors)依赖集中上下文支持、梯度稳定且构成可靠优化基础;高熵“探索者”(explorers)利用更分散的上下文、产生更大但不稳定的梯度,可能蕴含复杂推理线索。通过动态熵感知的软重加权干预,该方法显著提升了Qwen3-8B-Base模型在held-out评估上的平均性能(从34.39提升至37.40),揭示了注意力熵可识别token级RL信号中的优化相关结构,并证明均匀token平均会模糊此类关键异质性。
链接: https://arxiv.org/abs/2605.07660
作者: Gengyang Li,Zheng-Fan Wu,Siqi Bao,Yunfang Wu
机构: Baidu(百度); School of Computer Science, Peking University (北京大学计算机学院); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.07660 [cs.CL] (or arXiv:2605.07660v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.07660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-37] Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动短答案评分(Automated Short Answer Scoring, ASAS)中对部分正确响应(partially correct responses)评分一致性不足的问题,尤其关注模型在复杂评分任务中因缺乏任务特定数据而导致的中等质量响应评分偏差。其解决方案的关键在于揭示了模型表现与任务特定适应程度之间的关系:通过对比不同模型在少量示例下的少样本(few-shot)设置与微调后的编码器模型的表现,发现中等质量响应的评分一致性显著下降,且该下降程度随任务特定数据量增加而缓解,微调后的 BERT 基础编码器模型在该类响应上表现最优。这表明,提升模型对中等质量响应的判别能力需依赖更充分的任务特定适应,从而保障评分的公平性,尤其是对处于发展理解阶段学生的评价公正性。
链接: https://arxiv.org/abs/2605.07647
作者: Abigail Victoria Gurin Schleifer,Moriah Ariely,Beata Beigman Klebanov,Asaf Salman,Giora Alexandron
机构: Weizmann Institute of Science (魏兹曼科学研究所); ETS (教育考试服务中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
[NLP-38] MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)在推理过程中依赖单一、连贯但缺乏中间验证的推理链(reasoning trajectories),导致早期错误无法被检测和纠正,从而影响可解释性与可信度的问题。解决方案的关键在于提出MAVEN(Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing),其核心是通过黑板式架构实现角色解耦,构建一个对抗性的“质疑者-研究者-裁判”循环机制,将逻辑辩护与事实依据功能分离,从而生成结构化、模块化且可验证的推理轨迹,显著提升推理质量与可信度。
链接: https://arxiv.org/abs/2605.07646
作者: Yinsheng Yao,Jiehao Tang,Zhaozhen Yang,Dawei Cheng
机构: Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 2 figures
Abstract:While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
[NLP-39] Multi-Dimensional Evaluation of LLM s for Grammatical Error Correction
【速读】: 该论文旨在解决自动语法纠错(Grammatical Error Correction, GEC)领域中三个关键问题:一是最新一代大语言模型(Large Language Models, LLMs)在语法纠错任务上的评估不全面;二是未探索将多个LLMs结合是否能提升纠错质量;三是参考基准指标对GEC系统性能的低估程度尚未量化。其解决方案的关键在于:首先,系统性地评估了主流LLMs在编辑精度、语流保持和语义保留三个维度的表现,发现微调后的GPT-4o在所有指标上均达到最优;其次,通过语法错误类型分析揭示不同LLMs具有高度一致的纠错模式(相关系数ρ=0.947);最后,实证表明73.76%的GPT-4o纠错结果虽不同于人工标注标准,但同等有效甚至更优,从而揭示参考基准指标存在显著低估现象。这一系列发现为教育场景下GEC工具的选择提供了科学依据,避免因过度依赖参考标准而限制学生语言发展。
链接: https://arxiv.org/abs/2605.07635
作者: Adnan Labib,Qiao Wang,Yixuan Huang,Zheng Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 Pages
Abstract:Automated assistants for Grammatical Error Correction are now embedded in educational platforms serving millions of learners, yet three critical gaps remain in this domain: (1) latest-generation Large Language Models (LLMs) lack comprehensive evaluation on grammar correction tasks; (2) whether combining these LLMs improves correction quality is unexplored; and (3) the extent to which reference-based metrics underestimate GEC system performance has not been adequately quantified. In this study, first, we evaluate latest-generation LLMs on edit precision, fluency preservation, and meaning retention, showing fine-tuned GPT-4o achieves state-of-the-art performance across all three dimensions. Second, through grammatical error type analysis we demonstrate that individual LLMs exhibit highly similar error correction patterns ( \rho=0.947 ). Third, we show that reference-based metrics underestimate GEC performance with 73.76% of GPT-4o corrections different from gold standards being equally valid or even superior. These GEC evaluation findings equip educators with guidance for selecting GEC assistants that enhance rather than constrain student linguistic development. We make our data, code, and models publicly available.
[NLP-40] Post-training makes large language models less human-like
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为人类行为代理时的拟合准确性问题,即哪些模型最能捕捉人类行为及其背后机制。其关键解决方案在于构建了一个名为Psych-201的新颖数据集,该数据集支持在大规模下量化行为对齐度(behavioral alignment)。研究发现,模型在后训练阶段(post-training)——即从基础模型转变为有用助手的关键步骤——会系统性降低与人类行为的一致性,且这种偏差随新世代模型的演进而加剧;同时,常用的人格诱导(persona-induction)技术并不能提升个体层面的行为预测能力。这表明,当前用于增强LLM实用性的方法反而削弱了其作为人类行为建模工具的准确性。
链接: https://arxiv.org/abs/2605.07632
作者: Marcel Binz,Elif Akata,Abdullah Almaatouq,Mohammed Alsobay,Oleksii Ariasov,Franziska Brändle,David Broska,Jason W. Burton,Nuno Busch,Frederick Callaway,Vanessa Cheung,Brian Christian,Julian Coda-Forno,Can Demircan,Vittoria Dentella,Maria K. Eckstein,Noémi Éltető,Michael Franke,Thomas L. Griffiths,Fritz Günther,Susanne Haridi,Sebastian Hellmann,Stefan Herytash,Linus Hof,Eleanor Holton,Isabelle Hoxha,Zak Hussain,Akshay Jagadish,Elif Kara,Valentin Kriegmair,Evelina Leivada,Li Ji-An,Tobias Ludwig,Maximilian Maier,Marcelo G. Mattar,Marvin Mathony,Alireza Modirshanechi,Robin Na,Mariia Nadverniuk,Antonios Nasioulas,Surabhi S. Nath,Helen Niemeyer,Kate Nussenbaum,Sebastian Olschewski,Thorsten Pachur,Stefano Palminteri,Aliona Petrenco,Camille V. Phaneuf-Hadd,Angelo Pirrone,Manuel Rausch,Laura Raveling,Shashank Reddy,Milena Rmus,Evan M. Russek,Tankred Saanum,Kai Sandbrink,Louis Schiekiera,Johannes A. Schubert,Luca M. Schulze Buschoff,Nishad Singhi,Leah H. Somerville,Mikhail S. Spektor,Xin Sui,Christopher Summerfield,Mirko Thalmann,Anna I. Thoma,Taisiia Tikhomirova,Vuong Truong,Polina Tsvilodub,Konstantinos Voudouris,Robert C. Wilson,Kristin Witte,Shuchen Wu,Dirk U. Wulff,Hua-Dong Xiong,Songlin Xu,Lance Ying,Xinyu Zhang,Jian-Qiao Zhu,Eric Schulz
机构: Helmholtz Munich; Massachusetts Institute of Technology; University of Tübingen; University of Oxford; Stanford University; University of Copenhagen; Max Planck Institute for Human Development; Technical University of Munich; New York University; University College London; University of Pavia; Google DeepMind; Max Planck Institute for Biological Cybernetics; Princeton University; Humboldt-Universität zu Berlin; LMU Munich; École normale supérieure; Leiden University; University of Basel; Autonomous University of Barcelona; Institució Catalana de Recerca i Estudis Avançats (ICREA); University of California San Diego; Hochschule Rhein-Waal; Katholische Universität Eichstätt-Ingolstadt; Alpe-Adria-Universität Klagenfurt; Singapore Management University; TU Darmstadt; VinUniversity; Taipei Medical University; Georgia Institute of Technology; Allen Institute; Boston University; The University of Hong Kong; Hunter College, City University of New York
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training – the stage that turns base models into useful assistants – consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction – a popular technique for eliciting human-like behavior by conditioning models on participant-specific information – does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
[NLP-41] Safe or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
【速读】: 该论文旨在解决现有手机使用代理(phone-use agent)评估中无法区分“安全行为”与“能力缺失”的问题:即一个无害结果可能是由于代理识别风险并主动选择安全操作,也可能是由于其未能理解屏幕或执行任何相关操作所致。这两种情况成因不同、需采取不同改进策略,但当前基准测试常将其混同于任务成功、拒绝或最终有害结果中。解决方案的关键在于提出 PhoneSafety 基准,该基准包含从130多个应用中提取的700个安全关键时刻,每个实例隔离风险时刻的下一步决策,并明确询问模型是采取安全动作、不安全动作,还是无法执行任何有用操作。通过此框架对8个代表性手机使用代理进行评估,揭示了强通用能力并不保证高安全性,且“无法行动”更像是一种能力信号而非安全信号,从而强调仅凭无害结果不足以证明安全性,必须将错误判断与能力不足区分开来。
链接: https://arxiv.org/abs/2605.07630
作者: Zhengyang Tang,Yi Zhang,Chenxin Li,Xin Lai,Pengyuan Lyu,Yiduo Guo,Weinong Wang,Junyi Li,Yang Ding,Huawen Shen,Zhengyao Fang,Xingran Zhou,Liang Wu,Fei Tang,Sunqi Fan,Shangpin Peng,Zheng Ruan,Anran Zhang,Benyou Wang,Chengquan Zhang,Han Hu
机构: Tencent Hunyuan; The Chinese University of Hong Kong, Shenzhen; Tsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress
Abstract:When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
[NLP-42] Is She Even Relevant? When BERT Ignores Explicit Gender Cues
【速读】: 该论文旨在解决生成式 AI(Generative AI)中性别偏见在非英语语言中的形成机制问题,特别是针对具有显性形态性别标记的荷兰语 BERT 模型,探究性别信息如何以及何时在 Transformer 架构训练过程中被线性编码并演化。其解决方案的关键在于通过提取训练全过程中的上下文嵌入(contextual embeddings),利用线性支持向量机(linear SVMs)构建动态性别子空间,从而追踪性别编码的出现时机及其随时间的变化;同时设计受控句法模板测试显式性别线索是否能覆盖模型学习到的统计关联(如“木匠-男性”),结果表明模型对显式性别提示的响应能力有限,且存在显著的男性默认倾向,说明当前模型的上下文表示在性别维度上缺乏充分动态性。
链接: https://arxiv.org/abs/2605.07622
作者: Jonas Klein,Chiara Manna,Eva Vanmassenhove
机构: Tilburg University (蒂尔堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Gender bias in large language models has primarily been investigated for English, while languages with grammatical or morphological gender remain comparatively understudied. This paper investigates how and when gender information emerges in a Dutch BERT model trained from scratch, offering one of the first checkpoint-level analyses of bias formation in a Transformer architecture for a language combining overt morphological gender marking and generic forms. By extracting contextual embeddings throughout training, we construct dynamic gender subspaces using linear SVMs to trace when gender becomes linearly encoded and how this encoding evolves over time. Contextual embeddings are often assumed to integrate contextual cues robustly, allowing models to adjust the representation of a word depending on its more local usage. We therefore test whether explicit gender cues in controlled sentence templates (e.g., Zij is een loodgieter (‘She is a plumber’)) can override learned statistical associations (plumber - male). Our findings challenge this assumption: although gender becomes clearly linearly separable around epoch 20 and is distributed across multiple embedding dimensions, the model struggles to update its internal gender representation in light of explicit contextual cues in short sentence templates. Stereotypical gender-profession pairings are predicted far more accurately than anti-stereotypical ones, and generic forms in Dutch systematically default to a male interpretation, even when the context explicitly denotes a female referent. Together, our results seem to indicate that contextualization in the representations learned by our Dutch BERT model is not sufficiently dynamic along the probed gender direction: explicit gender cues in anti-stereotypical contexts are not reliably reflected in the resulting representations, resulting in persistent male-default behaviour.
[NLP-43] Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation ACL2026
【速读】: 该论文旨在解决对话式新闻推荐中因用户意图隐含且缺乏显式检索关键词而导致的推荐不准确问题,尤其针对标准检索增强生成(Retrieval-Augmented Generation, RAG)流程在处理五类隐式意图时存在的根本性瓶颈。其解决方案的关键在于提出一种“先生成再匹配”(Generate-then-Match)范式的意图驱动语义ID(Intent-driven Semantic ID, SID)生成机制:通过两阶段训练——多任务SID对齐与GPT-4思维链蒸馏,使大语言模型(LLM)将多样化的用户意图映射为层次化的SID前缀,并基于模糊匹配策略从实时新闻池中精准定位相关条目,从而实现完全接地的推荐;此外,引入面向用户画像的双信号推理(Profile-Aware Dual-Signal Reasoning, PADR)机制,显著提升冷启动用户的推荐效果,实测在主流中文新闻平台上,7B模型在152K开放生成SID空间中实现0%幻觉和12.4% L1匹配率(较随机基线提升4倍),同时在多项细粒度指标上优于GPT-4+混合RAG方案,且推理成本仅为后者的约1/100。
链接: https://arxiv.org/abs/2605.07613
作者: Hongyang Su,Beibei Kong,Lei Cheng,Chengxiang Zhuo,Zang Li,Chenyun Yu
机构: Tencent(腾讯); Sun Yat-sen University(中山大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Industry Track (Oral)
Abstract:Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4x random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2x, Category +1.2pp) at ~100x lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6x random), the highest among all user groups.
[NLP-44] Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification ACL2026
【速读】: 该论文旨在解决支持性对话中心理防御机制(psychological defence mechanisms)检测的模糊性问题,尤其是在标注一致性较低的情况下。由于八种正向防御类别在表层语言上高度相似,仅在语用功能上存在差异,导致人工标注者间达成的互评一致性仅为中等水平。为此,作者提出了一种基于误差独立性的集成学习方案,其核心在于通过构建一个由9个分类器组成的集成系统,覆盖三个正交维度:类别粒度(门卫模型使用全部九类,专业模型仅使用八类防御类别)、训练方法(生成式与判别式)以及基础模型架构。该策略有效缓解了单一模型在防御边界重叠时的不确定性问题,最终在BioNLP 2026的PsyDefDetect共享任务中取得了F1_test=0.420的成绩,位列21支参赛团队之首。
链接: https://arxiv.org/abs/2605.07606
作者: Philipp Steigerwald,Eric Rudolph,Jens Albrecht
机构: Technische Hochschule Nürnberg Georg Simon Ohm (纽伦堡技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the BioNLP 2026 PsyDefDetect Shared Task @ ACL 2026 (1st place, 21 registered teams)
Abstract:Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches F1_test=.420 on the hidden test set, placing first among 21 registered teams.
[NLP-45] Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLM s as Concept Mastery Simulators
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中难以区分哪些知识概念对正确答案具有因果贡献的问题,现有方法如基于蒙特卡洛树搜索(MCTS)的测试时搜索或因果图引导的知识注入,无法排除问题难度等混杂因素导致的虚假关联。其解决方案的关键在于提出CIKA(Causal Intervention for Knowledge Activation)框架,利用LLM自身作为干预模拟器:通过提示(prompt)将特定概念状态显式设置为“已掌握”,并以答案正确性的变化估计该概念的因果效应,从而定义出一种干预能力探针(Interventional Capability Probe, ICP)。ICP能够区分模型是否真正具备使用某概念的能力(即因果激活能力),而非仅仅存储相关知识;由于干预是外生设定、独立于问题难度,ICP可有效分离观测方法无法识别的混杂因素。实验证明,ICP能显著区分因果相关与无关概念,并预测解题成功率,且在多个基准测试中优于基线模型,尤其在基础模型无法解答的问题上,CIKA通过激活隐藏知识提升了33.8%的准确率。
链接: https://arxiv.org/abs/2605.07600
作者: Tsuyoshi Okita
机构: Kyushu Institute of Technology (九州工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 0 figures
Abstract:Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered’’ and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept – distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired t -test, p 10^-6 , Cohen’s d = 0.86 ), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1 \times higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7% on the contamination-free Omni-MATH-Rule benchmark and 64.0% overall, compared to 60.5% for o1-mini, and 97.2% on GSM8K, 46–50% on AIME 2024–2026, and 46.2% on MathArena. The Causal Knowledge Activation component contributes 33.8% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge. Comments: 17 pages, 0 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.07600 [cs.LG] (or arXiv:2605.07600v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07600 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-46] Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actors Internal States
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)中基线估计(baseline estimation)带来的高计算开销问题。现有方法如PPO需依赖独立的策略模型规模的价值函数(critic),而GRPO则需对每个提示(prompt)进行多次采样以维持经验组均值的稳定性,导致效率低下。其解决方案的关键在于提出Policy Optimization with Internal State Value Estimation (POISE),通过利用策略模型在前向传播过程中已产生的内部状态信号(internal states)来低成本估算基线值;具体而言,一个轻量级探测器(probe)在线学习从提示和生成轨迹的隐藏状态及token熵统计量中预测期望可验证奖励,并采用跨采样路径(cross-rollout)构造机制,确保即使使用轨迹条件特征也能保持梯度无偏性。这一设计使POISE仅需单次采样即可完成价值估计,从而在固定计算预算下提升提示多样性、降低梯度方差并消除检测零优势提示的采样开销,实现更稳定高效的策略优化。
链接: https://arxiv.org/abs/2605.07579
作者: Yunho Choi,Jongwon Lim,Woojin Ahn,Minjae Oh,Jeonghoon Shim,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model’s internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout’s value from an independent rollout’s internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model’s own internal representations, POISE enables more stable and efficient policy optimization.
[NLP-47] racing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLM s
【速读】: 该论文旨在解决视频大语言模型(Video-LLM)在时间推理任务中表现远低于人类水平的问题,特别是针对“时间箭头”(Arrow-of-Time, AoT)任务中模型难以有效利用视频中的时序不可逆性信息。研究表明,问题并非源于视觉编码器本身对时序信息的建模能力不足,而是出在视频表示从视觉编码器到语言模型(LLM)传递过程中的信息瓶颈——尤其是投影层(projector)设计导致的时间信息丢失。解决方案的关键在于:采用以视频为中心的编码器(video-centric encoder)进行显式时序建模、设计保留时间信息的多层感知机(MLP)投影层替代破坏时序结构的Q-Former,并引入AoT监督信号引导训练。这一改进使模型在AoT任务上达到98.1%准确率,超越人类表现,并显著提升其他时序推理任务的性能。
链接: https://arxiv.org/abs/2605.07568
作者: Peitao Han,Fei Cheng,Lis K. Pereira,Qianying Liu,Shigeru Kitazawa
机构: The University of Osaka (大阪大学); Center for Information and Neural Networks (信息与神经网络中心); National Institute of Information and Communications Technology (信息通信技术国立研究所); Kyoto University (京都大学); NII LLMC (国立信息研究所语言模型与计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM’s access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT _PPB with 98.1% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
[NLP-48] Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)中失败模式不明确的问题,尤其是缺乏对非英语中心语言对(non-English-centric language pairs, LPs)翻译性能下降原因的系统性理解。其解决方案的关键在于引入Token Activation Rate (TAR),这是一个衡量模型在生成过程中对特定语言词汇表中token使用效率的指标。研究通过验证TAR作为语言表示能力的代理指标,发现较低的TAR与较差的翻译质量显著相关,并揭示了推理型LLMs在低TAR语言上倾向于生成更多token的补偿机制,从而为理解LLM在MT中的表现差异提供了新的token级动态视角。
链接: https://arxiv.org/abs/2605.07533
作者: Shenbin Qian,Yves Scherrer
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 26th Annual Conference of the European Association for Machine Translation (EAMT2026)
Abstract:Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.
[NLP-49] WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation ICML2026
【速读】: 该论文旨在解决当前天气预报报告生成依赖人工分析多源数据所导致的信息过载与效率低下问题,其核心挑战在于如何利用数据驱动模型实现自动化、高质量的天气预报报告生成。解决方案的关键在于提出了一项新的任务——天气预报报告生成(Weather Forecasting Report, WFR),并构建了首个用于指令微调的WFR数据集~\DatasetNameL,覆盖美国31个城市和8类天气要素;在此基础上开发了首个专注于该任务的模型 \ModelNameL,通过在该数据集上的训练与评估,证明其在结构复杂天气要素上显著优于主流闭源多模态大语言模型(Multimodal Large Language Models, MLLMs),且展现出跨地理区域的强零样本泛化能力,为专用领域MLLMs的开发提供了重要实践路径。
链接: https://arxiv.org/abs/2605.07522
作者: Zinan Zheng,Yang Liu,Nuo Chen,Juepeng Zheng,Hong Cheng,Jia Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026
Abstract:Accurate weather forecast reporting enables individuals and communities to better plan daily activities and agricultural operations. However, the current reporting process primarily relies on manual analysis of multi-source data, which leads to information overload and reduced efficiency. With the development of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named~\DatasetNameL, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, \ModelNameL, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that \ModelNameL~ consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. \ModelNameL~ demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. \ModelNameL~offers valuable insight for developing MLLMs specialized in weather report generation. .
[NLP-50] ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在使用扩展链式思维(Chain-of-Thought, CoT)推理时存在的两个核心问题:一是生成文本Token消耗过高,导致推理延迟增加;二是现有基于强化学习(Reinforcement Learning, RL)的CoT压缩方法依赖静态长度惩罚策略,未能动态适应模型能力变化和问题难度差异。解决方案的关键在于提出一种名为ExpThink的RL框架,其核心创新包含两个互补机制:一是基于经验引导的奖励设计(experience-guided reward shaping),通过记录每个问题的最短正确解并采用三档奖励机制(完整奖励、折扣奖励、无奖励),形成无需人工调度的自进化课程;二是难度自适应的优势计算(difficulty-adaptive advantage),以正确答案计数替代标准差归一化,生成单调难度缩放的梯度,从而在保持准确率的同时抑制简单问题的梯度以促进简洁性。这两个机制共同实现“先保证准确性、再优化压缩效率”的训练目标,在多个数学推理基准测试中显著降低平均响应长度(最高达77%)并提升准确率,相较基线模型实现高达3倍的准确率-效率比(accuracy-efficiency ratio)。
链接: https://arxiv.org/abs/2605.07501
作者: Tingcheng Bian,Yuzhe Zhang,Jing Jin,Jinchang Luo,MingQuan Cheng,Haiwei Wang,Wenyuan Jiang,Miaohui Wang
机构: Baidu Inc.(百度); Shenzhen University (深圳大学); Peking University (北京大学); Tsinghua University (清华大学); D-INFK, ETH Zürich (ETH Zürich计算机科学系)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 39 pages, 18 figures. Code and model checkpoints will be released upon publication
Abstract:Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbfExpThink\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emphexperience-guided reward shaping tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emphdifficulty-adaptive advantage replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbfExpThink\xspace reduces average response length by up to 77% while simultaneously improving accuracy, achieving up to 3\times higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.
[NLP-51] SEIF: Self-Evolving Reinforcement Learning for Instruction Following
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令跟随能力提升过程中面临的两大挑战:一是依赖昂贵的人工标注或强教师模型进行监督,二是现有自对弈训练方法使用静态难度的指令,无法随模型能力增长而动态调整。为此,作者提出SEIF(Self-Evolving Reinforcement Learning for Instruction Following)框架,其核心在于构建一个闭环的自我进化机制,使指令难度演化与模型能力演化相互促进。关键创新点在于设计四角色协同系统——Instructor生成渐进式难指令、Filter保障数据质量、Follower学习新指令、Judger提供强化学习奖励信号,并通过交替训练实现Instructor与Follower的共进化,从而在多尺度和架构下稳定提升指令跟随性能。
链接: https://arxiv.org/abs/2605.07465
作者: Qingyu Ren,Qianyu He,Jiajie Zhu,Xingzhou Chen,Jingwen Chang,Zeye Sun,Han Xia,Fei Yu,Jiaqing Liang,Yanghua Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model’s capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model’s instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at this https URL.
[NLP-52] he Moltbook Files: A Harmless Slopocalypse or Humanitys Last Experiment
【速读】: 该论文旨在解决生成式 AI(Generative AI)在开放交互环境中可能引发的 emergent misalignment(涌现性错位)问题,特别是通过模拟大规模自主代理(OpenClaw agents)在类 Reddit 平台 Moltbook 上的行为来研究其对语言模型安全性和可信度的影响。解决方案的关键在于构建并公开 Moltbook Files 数据集(包含 23.2 万条帖子和 220 万条评论),并通过一套 PII(Personally-Identifiable Information,个人身份信息)处理管道清理敏感内容,并在此基础上对 Qwen2.5-14B-Instruct 模型进行三阶段微调,以量化分析其对模型真实性(truthfulness)等关键指标的影响。研究发现,尽管 Moltbook 数据本身并未显著恶化模型性能(与 Reddit 对照组相当),但存在潜在尾部风险,如代理行为传染、自我链接污染未来爬取数据以及特性迁移至下一代模型,凸显了控制基线在评估涌现性错位中的核心作用。
链接: https://arxiv.org/abs/2605.07462
作者: William Brach,Federico Torrielli,Stine Lyngsø Beltoft,Annemette Brok Pirchert,Peter Schneider-Kamp,Lukas Galke Poech
机构: Slovak University of Technology (斯洛伐克理工大学); University of Turin (都灵大学); University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform’s first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.
[NLP-53] hink-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
【速读】: 该论文旨在解决现有基于评分标准(rubric)的强化学习框架中,评分标准仅作为外部评估工具、无法在模型生成过程中提供实时引导的问题。传统方法将rubric视为与策略(policy)推理轨迹分离的后验测量工具,限制了其对生成质量的主动调控能力。解决方案的关键在于提出“Think-with-Rubrics”范式,将rubric生成过程嵌入到大语言模型(LLM)的推理上下文中,使rubric从独立评估指标转变为内部指导机制:训练时模型依次生成rubric和响应,同时由一个经过训练的rubric verifier对自动生成或黄金rubric与回答的一致性进行联合监督。这一设计显著提升了模型在多基准测试中的表现(平均优于基线3.87分),并通过增强自动生成rubric的质量和提升响应内一致性来改善性能。
链接: https://arxiv.org/abs/2605.07461
作者: Jiachen Yu,Zhihao Xu,Junjie Wang,Yujiu Yang
机构: Tsinghua University (清华大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy’s primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model’s generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM’s generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.
[NLP-54] GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
【速读】: 该论文旨在解决大语言模型在低数据、领域特定场景下进行上下文学习(in-context learning)时,因高质量示例稀缺而导致性能不稳定的问题。其核心挑战在于如何自动优化选择有效的示例以提升模型表现。解决方案的关键在于提出GRaSp框架,该框架包含三个阶段:首先生成大规模合成候选示例池,接着通过聚类与降维结构化候选池以增强多样性,最后利用遗传算法结合自适应变异机制(diversity-adaptive mutation)进行优化搜索;该机制能在进化初期实现跨簇广域探索,随种群收敛逐步转向簇内精细优化,从而显著提升金融领域命名实体识别(NER)任务中的微平均F1值,优于零样本和随机少样本基线。
链接: https://arxiv.org/abs/2605.07454
作者: Simen Bihaug-Frøyland,Henrik Brådland
机构: University of Agder (挪威 agder 大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures
Abstract:In-context learning enables large language models to adapt to new tasks, but their performance is highly sensitive to the selected examples. Finding effective demonstrations is particularly difficult in domain-specific, low-data settings where high-quality examples are scarce. We propose GRaSp, a three-stage framework for automatic in-context example optimization. By first generating a large synthetic candidate pool, then structuring it with clustering and dimensionality reduction, and finally using genetic algorithms to find the optimal in-context examples, the framework shows consistent improvements on the NER task. We also introduce a custom diversity-adaptive mutation mechanism, allowing it to transition from the initial broad inter-cluster exploration to focused intra-cluster refinement as the population converges. We evaluate GRaSp on financial named entity recognition (FiNER-139), comparing synthetic and human-annotated candidate pools across pool sizes of 500 and 5000. With non-synthetic data, GRaSp achieves 45.84% micro-F1, consistently outperforming both zero-shot and random few-shot baselines. Synthetic data matches the random baseline but does not exceed it, suggesting that distributional variety in the candidate pool is critical for generalization.
[NLP-55] Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
【速读】: 该论文旨在解决濒危语言(如古埃及象形文字)在神经机器翻译(NMT)中因数据稀缺和污染导致的性能评估失真问题。其关键解决方案是识别并去除训练与测试集之间的数据泄露(data contamination),特别是通过文档级和目标级去重策略,发现并剔除16个重复出现的目标句子(占测试集32%),从而获得更真实的模型性能基准(BLEU从61.5降至30.9–39.2)。这一方法揭示了原始高分结果主要源于数据污染而非模型能力,并为濒危语言的NMT研究提供了可复现、可信的评估标准。
链接: https://arxiv.org/abs/2605.07453
作者: Ammar Toutou,Abdelrahman Harb,Christine Basta
机构: Alamein International University (AIU), Egypt; University of the Basque Country, Spain; Alexandria University, Egypt
类目: Computation and Language (cs.CL)
备注: Accepted to NLP4DH 2026 Conference
Abstract:Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora – making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2% of test targets appear identically in training (16/50; 50% under 8-gram overlap at 70% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9–39.2 BLEU / 0.622–0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents – target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9–39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.
[NLP-56] Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际应用中面临的安全性问题,特别是其对对抗攻击的高度脆弱性。尽管VLMs已广泛部署于基于代理的系统中,但现有方法对其防御能力研究不足,导致下游应用存在显著风险。解决方案的关键在于提出一种轻量级、可插拔的对抗攻击检测框架——SAEgis,其核心创新是利用稀疏自编码器(Sparse Autoencoders, SAEs)作为嵌入模块,在预训练VLM中通过标准重建目标进行训练,从而学习到能够捕捉攻击相关信号的稀疏潜在特征。这些特征可在无需额外对抗训练的情况下,可靠地识别未见过的对抗样本,且在域内、跨域及跨攻击场景下均表现出优越性能,尤其在跨域泛化上显著优于现有基线方法。
链接: https://arxiv.org/abs/2605.07447
作者: Hao Wang,Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh,Daisuke Kawahara
机构: Magellan Technology Research Institute (MTRI); Waseda University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
[NLP-57] SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysis
【速读】: 该论文旨在解决电子商务评论中情感分析的精细化问题,特别是如何更准确地识别和提取目标(topic)、方面(aspect)及其对应的方面值(aspect value),从而实现更细致的基于方面的 sentiment 分析(Aspect-Based Sentiment Analysis, ABSA)。其关键解决方案在于构建了一个韩语标注语料库(Evaluation Annotated Dataset, EVAD),并采用半自动符号传播(Semi-Automatic Symbolic Propagation, SSP)方法进行标注,同时利用有限状态转换器(Finite-State Transducer, FST)形式化语言资源以支持细粒度的 ABSA 标注。此外,该研究扩展了传统 ABSA 框架,不仅包含主题与方面,还引入方面值,并根据其结构(单值、二元或多元)对 aspect-value pairs 进行分类,从而提升目标特征提取的准确性。在模型层面,使用 KoBERT 和 KcBERT 在 EVAD 上训练,取得了 F1 值分别为 0.88 和 0.90 的优异性能,验证了该方案的有效性。
链接: https://arxiv.org/abs/2605.07446
作者: Suwon Choi,Shinwoo Kim,Changhoe Hwang,Gwanghoon Yoo,Eric Laporte,Jeesun Nam
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We report the construction of a Korean evaluation-annotated corpus, hereafter called ‘Evaluation Annotated Dataset (EVAD)’, and its use in Aspect-Based Sentiment Analysis (ABSA) extended in order to cover e-commerce reviews containing sentiment and non-sentiment linguistic patterns. The annotation process uses Semi-Automatic Symbolic Propagation (SSP). We built extensive linguistic resources formalized as a Finite-State Transducer (FST) to annotate corpora with detailed ABSA components in the fashion e-commerce domain. The ABSA approach is extended, in order to analyze user opinions more accurately and extract more detailed features of targets, by including aspect values in addition to topics and aspects, and by classifying aspectvalue pairs depending whether values are unary, binary, or multiple. For evaluation, the KoBERT and KcBERT models are trained on the annotated dataset, showing robust performances of F1 0.88 and F1 0.90, respectively, on recognition of aspect-value pairs.
[NLP-58] Generating training datasets for legal chatbots in Korean
【速读】: 该论文旨在解决法律类对话系统(chatbot)在训练数据构建过程中面临的两大挑战:一是真实用户输入语料的多样性难以充分覆盖,二是大规模标注数据的成本过高。为应对这些问题,作者提出了一种基于本地语法图(Local Grammar Graph, LGG)的联合生成方法,其关键在于利用语言学资源构建结构化语义空间,在无需人工标注海量真实数据的前提下,自动生成高质量的带标签语句。具体而言,LGG通过捕捉领域内词汇与局部句法模式,并结合特定领域的意图分类体系,实现语句与标签的同步生成;实验中基于此方法生成了7亿条标注语句并训练DIET分类器,最终在韩国法律问答场景下达到91%的F1分数,验证了该方案的有效性。
链接: https://arxiv.org/abs/2605.07432
作者: Changhoe Hwang,Jee-Sun Nam,Eric Laporte
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user’s intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users’ conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.
[NLP-59] ChartREG: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
【速读】: 该论文旨在解决现有图表指代表达定位(chart referring expression grounding)基准存在的四大局限性:(1)主要依赖边界框(bounding boxes),限制了对细粒度图表元素的精确定位;(2)仅支持单目标或双目标引用,难以处理多实例目标;(3)语言表达过度依赖文本线索或数据排名信息,缺乏多样化的 grounding cues;(4)覆盖的图表类型单一。解决方案的关键在于构建一个系统性的新基准,支持多种定位形式、多目标引用、多样化 grounding cues 和多样化图表类型,并引入一种基于代码驱动的合成流程,利用绘图程序与渲染图表基元之间的内在对齐关系,生成像素级精确的实例掩码(instance masks)。在此基础上训练实例分割模型并集成至通用多模态定位框架中,显著提升了模型在真实图表场景下的泛化能力和定位精度。
链接: https://arxiv.org/abs/2605.07415
作者: Tianhao Niu,Ziyu Han,Qingfu Zhu,Wanxiang Che
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
[NLP-60] he Proxy Presumption: From Semantic Embeddings to Valid Social Measures ACL2026
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)在计算社会科学中广泛应用时所面临的“代理假设”(Proxy Presumption)问题,即研究者常直接依赖嵌入空间中的几何属性(如余弦距离)作为社会概念(如新颖性、创造力和偏见)的度量,而缺乏对这些度量是否真正反映目标构念(construct, C)的显式验证。其核心挑战在于,无监督表示往往混杂了目标构念与混淆变量(confounding attributes, Z),如主题、风格和作者特征,导致测量无效。解决方案的关键是提出结构化的构念效度协议(Construct Validity Protocol, CVP),该协议融合因果表示学习与心理测量学方法,提供从概念界定到定量验证的严谨流程,并引入反事实中和(Counterfactual Neutralization)技术,利用大语言模型(LLM)在嵌入空间中减少混淆因素影响,从而将启发式代理转化为科学可辩护的测量工具。
链接: https://arxiv.org/abs/2605.07409
作者: Baishi Li,Ta Yu,Kelvin J.L. Koa,Ke-Wei Huang
机构: National University of Singapore (新加坡国立大学); Asian Institute of Digital Finance (亚洲数字金融研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注: ACL 2026
Abstract:Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ‘‘Proxy Presumption,’’ or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ( C ) and confounding attributes ( Z ) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite – including tests for discriminant, incremental, and predictive validity – this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.
[NLP-61] Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)路由系统中“不可解性天花板”(unsolvability ceiling)的误判问题,即现有研究认为存在大量查询无法被任何模型解决,从而限制了路由优化空间。其关键解决方案在于通过大规模实证分析(206,000个查询-模型对)识别并量化三大评估伪影:(i)评判者对冗长输出的偏好偏差、(ii)固定生成预算下的截断问题、以及(iii)输出格式不匹配导致的错误判定。作者进一步提出分解框架以归因失败原因,并通过双评判验证与精确匹配锚定显著降低测得的不可解比例;同时揭示标准路由训练信号因这些伪影而失真,导致路由器退化为多数类预测(约79%最小层级最优),造成13–17个百分点的机会成本。最终,论文建议采用双评判验证、精确匹配锚定和成本敏感目标等可操作策略,以提升多LLM路由系统的可靠性和效率。
链接: https://arxiv.org/abs/2605.07395
作者: Saloni Garg,Amit Sagtani
机构: San Francisco State University (旧金山州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 14 tables
Abstract:Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an “unsolvability ceiling”, queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.
[NLP-62] Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
【速读】: 该论文旨在解决自适应秩分配(Adaptive Rank Allocation)在监督微调(Supervised Fine-Tuning, SFT)中成功提升效率的问题是否可迁移至强化学习(Reinforcement Learning, RL)场景,特别是针对Group Relative Policy Optimization (GRPO) 方法的有效性。研究发现,将SFT中基于梯度幅度的非均匀秩分配策略直接应用于GRPO会导致性能下降(准确率降低4.5个百分点),其关键原因在于:第一,GRPO下的梯度分布比SFT更平坦(最大与最小层重要性比仅为2.17倍,远低于SFT中的10倍),表明所有层均携带有效梯度信号,不存在可被削减的“冗余”层;第二,非均匀分配会引发梯度放大效应,使重要性差异从2.17倍扩大至3.00倍,形成正反馈循环——高秩层持续吸收更多梯度,而低秩层逐渐被抑制,从而破坏模型训练的平衡性。因此,论文指出梯度重要性不能作为RL阶段参数容量分配的可靠依据,应避免简单套用SFT时代的秩分配策略。
链接: https://arxiv.org/abs/2605.07366
作者: Yash Ganpat Sawant
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages + references
Abstract:Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to 10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
[NLP-63] Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative ICML2026
【速读】: 该论文旨在解决当前广泛使用的均值池化余弦相似度(mean-pooled cosine similarity)在跨语言、跨模态和跨任务的神经表示比较中缺乏长度不变性(length-invariance)的问题。研究表明,在现代Transformer模型中,由于表征的各向异性(anisotropy),该度量会随序列长度单调增长,且与语义内容无关,从而导致对跨语言相似性的误判。其解决方案的关键在于引入长度不变的度量方法,如中心核对齐(Centered Kernel Alignment, CKA),实证表明CKA显著降低长度效应带来的解释方差(如在HumanEvalPack上减少83%),并逆转长度系数符号,从而提供更可靠的跨表示比较基础。论文主张将CKA等长度不变度量作为默认标准,并呼吁重新审视基于均值池化余弦得出的跨语言表征收敛结论。
链接: https://arxiv.org/abs/2605.07345
作者: Sibayan Mitra(1),Dhruv Kumar(1) ((1) BITS Pilani)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 6 figures. Submitted to the Mechanistic Interpretability Workshop at ICML 2026
Abstract:Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains R^2 = 0.52 – 0.75 of cross-language “Python proximity,” while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ( \beta_\mathrmlen: +0.86 \to -0.37 ). The same pattern holds in Mistral-7B on parallel WMT pairs ( R^2 = 0.23 EN-FR, R^2 = 0.33 EN-DE for cosine; R^2 0.01 for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ( R^2: 0.21 \to 0.01 ), as predicted by the theory’s dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.
[NLP-64] Activation Differences Reveal Backdoors: A Comparison of SAE Architectures IJCNN2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)中语言模型的后门攻击(Backdoor Attack)检测难题,即如何通过机制可解释性方法识别出在特定触发模式下才会激活的恶意行为。其解决方案的关键在于使用差异型稀疏自编码器(Differential SAE, Diff-SAE)来捕捉后门相关的特征表示,而非依赖传统稀疏自编码器(如 Crosscoders)。研究发现,后门行为主要表现为神经元激活方向上的偏移而非稀疏激活模式,因此基于差分表示的 Diff-SAE 能更有效地隔离后门信号,在多种模型结构和微调策略下均表现出高精度(1.0)与零误报率,显著优于现有方法。
链接: https://arxiv.org/abs/2605.07324
作者: Sachin Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2026 (IEEE WCCI). ©2026 IEEE
Abstract:Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures – Crosscoders and Differential SAEs (Diff-SAE) – for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context (“2024” triggers vulnerable code, “2023” triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
[NLP-65] LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行复杂推理任务时因链式思维(Chain-of-Thought, CoT)推理导致的高计算开销问题,即每一步中间推理都需要生成离散token,从而显著增加推理成本。同时,现有隐式推理方法(Latent Reasoning)虽能通过连续状态传播减少可见token数量,但可能损害需要符号验证的任务性能。解决方案的关键在于提出一种两阶段推理范式——潜然后显式推理(Latent-Then-Explicit Reasoning, LaTER),其核心机制是在第一阶段于连续潜空间中进行有限探索并保留潜藏键值缓存(KV cache),利用熵和模型原生停止token探测信号决定切换时机;第二阶段则切换至显式CoT进行验证与答案生成。该方法在无需训练的情况下即可实现token消耗降低16%-32%且保持或提升准确性,进一步通过监督数据集Latent-Switch-69K微调后,在AIME 2025基准上达到80.0%准确率(较标准CoT提升10.0个百分点),同时减少33% token使用量。
链接: https://arxiv.org/abs/2605.07315
作者: Xuan Li,Yining Wang,Yuchen Liu,Guanjun Liu,Delai Qiu,Shengping Liu,Jiaen Liang,Wei Huang,Jun Yu,Junnan Zhu
机构: University of Science and Technology of China (中国科学技术大学); Unisound AI Technology Co., Ltd (云知声人工智能科技有限公司); MAIS, Institute of Automation, Chinese Academy of Sciences (中科院自动化所媒体智能重点实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at this https URL.
[NLP-66] Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse Order-Shuffling Chain-of-Thoughts
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型在推理过程中对密集、顺序依赖的思维链(Chain-of-Thought, CoT)的隐含假设问题,即认为每一步推理都必须按序执行且所有信息均不可或缺。其解决方案的关键在于通过系统性干预管道——移除(removal)、掩码(masking)、打乱(shuffling)和噪声注入(noise injection)——对三种模型在三个基准上的推理链进行扰动实验,揭示出答案提取实际上基于一种稀疏、顺序无关且结构鲁棒的信息子集,而非传统认知所假设的完整序列逻辑流。这一发现为实现并行化与token高效推理生成提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2605.07307
作者: Yi-Chang Chen,Feng-Ting Liao,Da-shan Shiu,Hung-yi Lee
机构: MediaTek Research; Artificial Intelligence Center of Research Excellence, National Taiwan University
类目: Computation and Language (cs.CL)
备注:
Abstract:Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline–removal, masking, shuffling, and noise injection–applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No–line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No–masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes–the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%-83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.
[NLP-67] MedAction: Towards Active Multi-turn Clinical Diagnostic LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医疗诊断任务中普遍存在的局限性:现有评估多基于静态、单轮场景,即完整患者信息一次性提供,这与真实临床实践中需要逐步收集证据、动态更新鉴别诊断的“主动诊断”(active diagnosis)过程严重脱节。作者通过系统分析发现,当前LLMs存在三大失败模式:无依据的检验项目推荐、不可靠的诊断更新以及多轮对话中的连贯性下降,其根本原因在于训练数据仅覆盖从完整信息出发的推理,缺乏对部分证据下决策行为的学习。解决方案的关键在于提出MedAction——一种基于树状结构的知识蒸馏流程,通过LLM与环境交互生成高质量、多样化的多轮诊断轨迹,并引入两个知识图谱引导的指标进行质量过滤:疾病轨迹一致性(Disease Trajectory Consistency, DTC)确保假设向正确诊断收敛,推理-行动一致性(Reasoning-Action Consistency, RAC)验证信念更新由实证驱动。最终构建了包含32,681条轨迹的MedAction-32K数据集,微调8B模型后在MedR-Bench和自建的MedAction-300-Hard基准上达到开源模型最优性能。
链接: https://arxiv.org/abs/2605.07305
作者: Hsin-Ling Hsu,Zizheng Wang,Donghua Zhang,Nai-Chia Chen,Jerry Wang,Jun-En Ding,Chia-Hsuan Hsu,Guoan Wang,Feng Liu,Fang-Ming Hung,Chenwei Wu,Liyue Shen
机构: National Chengchi University (国立政治大学); Georgetown University (乔治城大学); University of Michigan (密歇根大学); Stevens Institute of Technology (史蒂文斯理工学院); National Taiwan University of Science and Technology (台湾科技大学); Far Eastern Memorial Hospital (远东纪念医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model’s hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.
[NLP-68] On the Complexity of the Matching Problem of Regular Expressions with Backreferences
【速读】: 该论文旨在解决带有反向引用(backreferences)的正则表达式(Regular Expressions with Backreferences, REWBs)的字符串匹配问题的细粒度复杂性,核心目标是探究是否存在可证明高效(近线性时间)的匹配算法。研究发现,对于一般的 k-REWBs,该问题在参数化复杂度下属于 W[2]-hard,且在标准假设(SETH)下无法在 O(n2k−ϵ) 时间内求解;特别地,对于 2-use 2-REWBs,若存在 n1+o(1) 时间算法,则三角形检测问题也可在相同时间内解决。针对可 tractable 的情形,作者提出了一种 O(nlog2n) 时间的算法用于 1-use REWBs,显著优于此前 O(n2) 的算法,其关键在于融合后缀树(suffix trees)、正则表达式的转移幺半群(transition monoids of REGEXes)、因子森林(factorization forest)数据结构以及字符串周期性(periodicity of strings)等技术。
链接: https://arxiv.org/abs/2605.07289
作者: Soh Kumabe,Yuya Uezato
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
备注: Full version of ICALP 2026; The abstract field is slightly shorter than that in the paper due to arXiv’s length limit
Abstract:ReDoS is a well-known type of algorithmic complexity attack, where an adversary supplies maliciously crafted strings to a regular expression matching engine, aiming to exhaust computational resources of systems. Even quadratic-time behavior in matching engines has been exploited in successful attacks, as exemplified by major outages at Stack Overflow (2016) and Cloudflare (2019). These incidents motivate a fundamental question: Is it possible to construct matching engines that are provably efficient, running in (near-)linear time in the length of the input string? For classical regular expressions (REGEX), Thompson’s construction yields a linear-time algorithm. However, practical engines support powerful features such as backreferences, which strictly extend the expressive power of REGEX but unfortunately increase the risk of ReDoS attacks. This paper investigates the fine-grained complexity of the string matching problem for regular expressions with backreferences (REWBs). Specifically, we consider r -use k -REWBs. On the hardness side, we show that the string matching problem for k -REWBs cannot be solved in O(n^2k-\epsilon) time for any \epsilon 0 under SETH. We also prove that this problem is \textbfW[2]-hard when parameterized by the length of the REWB expression, strengthening the previous \textbfW[1]-hardness. Moreover, we prove that this problem for 2 -use 2 -REWBs cannot be solved in n^1+o(1) time unless the triangle detection problem can be solved in that time. On the algorithmic side, we present an O(n \log^2 n) -time algorithm for 1 -use REWBs, which significantly improves upon the recent O(n^2) -time algorithm by Nogami and Terauchi (MFCS, 2025). Our algorithm employs several techniques including suffix trees, transition monoids of REGEXes, factorization forest data structures, and periodicity of strings. Comments: Full version of ICALP 2026; The abstract field is slightly shorter than that in the paper due to arXiv’s length limit Subjects: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL) Cite as: arXiv:2605.07289 [cs.DS] (or arXiv:2605.07289v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.07289 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Soh Kumabe [view email] [v1] Fri, 8 May 2026 05:55:42 UTC (406 KB) Full-text links: Access Paper: View a PDF of the paper titled On the Complexity of the Matching Problem of Regular Expressions with Backreferences, by Soh Kumabe and Yuya UezatoView PDFTeX Source view license Current browse context: cs.DS prev | next new | recent | 2026-05 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-69] Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)层剪枝(layer pruning)过程中常出现的突发性能崩溃问题。现有基于表示(representation-based)的分析方法难以解释这一现象的内在机制。论文的关键解决方案是引入决策表征(decision representation)视角,提出两个核心指标——决策裕度(Decision Margin)和选项频率(Option Frequency),并设计一种迭代剪枝(Iterative Pruning)方法来刻画层级决策动态。研究发现,网络存在一个明显的决策跃迁点,将模型行为划分为“静默阶段”(Silent Phase)和“决定阶段”(Decisive Phase):前者无法预测正确答案,后者则实现准确决策;剪枝仅在静默阶段引发性能崩溃,而决定阶段的剪枝影响甚微,说明性能崩溃的本质在于破坏了静默阶段中关键的决策跃迁过程。
链接: https://arxiv.org/abs/2605.07271
作者: Boyu Shi,Chang Liu,ChuanBao Gao,Xu Yang,Xin Geng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.
[NLP-70] MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen – TF-IDF Hybrid and Meta-Ensemble Learning
【速读】: 该论文旨在解决检索增强型和工具使用型大语言模型(Large Language Models, LLMs)中持续存在的间接提示注入(Indirect Prompt Injection, IPI)问题,尤其在多语言场景下该问题更难识别与防御。解决方案的关键在于提出一个名为MIPIAD的防御框架,其核心由三部分组成:基于LoRA微调的Qwen2.5-1.5B序列分类器(XLPID)、TF-IDF词法特征提取以及通过晚期融合、堆叠和梯度提升进行验证调优的集成策略。实验表明,该混合方法在英语和孟加拉语两个语言上均显著提升了检测性能(F1达0.9205,AUROC达0.9378),且集成方法有效缩小了跨语言性能差距,同时具备良好的可扩展性,支持NLLB-200覆盖的200余种语言。
链接: https://arxiv.org/abs/2605.07269
作者: Al Muhit Muhtadi,Mostafa Rifat Tazwar
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families – email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID’s multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla
[NLP-71] From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLM s
【速读】: 该论文旨在解决多选题推理基准面临的两大挑战:一是随着模型性能提升导致的快速饱和问题,二是数据污染对静态评估有效性的影响。现有临时加固方法(如改写、扰动)虽试图提高难度,但往往以牺牲逻辑有效性为代价,难以真正挑战先进的推理模型。其解决方案的关键在于提出LogiHard框架——一个形式化的方法,将0阶选择任务转化为2阶逻辑判断任务,显著增加思维复杂度和推理步骤;同时结合项目反应理论(Item Response Theory, IRT)实现计算机化自适应测试(Computerized Adaptive Testing, CAT),从而在更少题目下实现精确难度控制。通过认知排序高价值考试题目并进行组合变换,作者构建了LogiHard-2k数据集,实验证明该方法能有效引发大语言模型(LLM)准确率下降31%–56%,揭示出模型存在多选失败与提前退出偏差,且这种退化并非知识不足所致,而是源于组合推理能力缺失,即训练诱导的完整性验证缺陷。
链接: https://arxiv.org/abs/2605.07268
作者: Hanmeng Liu,Shichao Weng,Xiulai Liu,Zhicai Zhang,Anli Yan,Xiaozhang Liu
机构: Hainan University (海南大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
[NLP-72] When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)语言模型中路由策略有效性评估不足的问题,即标准的 top-k 路由器在训练过程中是否真正选择了最优路径缺乏直接验证。研究发现,尽管在高置信度 token 上标准路由与路由效用高度一致,但在驱动复杂推理的脆弱 token 上,存在更低损失的等计算量替代路由路径,但这些路径未被选中。这一现象在多个主流 MoE 模型(如 Qwen3-30B-A3B、GPT-OSS-20B 等)中均稳定出现,其根源在于标准训练目标仅对执行路径评分,且负载均衡依赖全局统计而非局部最优性。解决方案的关键在于:仅更新最终层路由器(其余专家及所有其他路由器冻结),即可显著提升 AIME 2024+2025 和 HMMT 2025 推理任务上的 pass@K 性能,表明问题主要源于路由器可触及的分配错误,而非专家容量限制。
链接: https://arxiv.org/abs/2605.07260
作者: Youngsik Yoon,Siwei Wang,Wei Chen,Jungseul Ok
机构: POSTECH(韩国浦项科技大学); Microsoft Research Asia(微软亚洲研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top- k router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top- k training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.
[NLP-73] PaT: Planning -after-Trial for Efficient Test-Time Code Generation ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试阶段计算资源分配效率低下的问题,尤其是现有基于“先规划后尝试”(Planning-before-Trial, PbT)策略的方法会在简单问题上引入不必要的规划开销。其解决方案的关键在于提出一种“先尝试后规划”(Planning-after-Trial, PaT)的自适应策略:仅在生成结果验证失败时触发规划模块,从而实现计算资源的动态优化配置。该方法支持异构模型架构——由低成本模型执行初始生成任务,仅在必要时调用高性能模型进行针对性规划干预,显著提升了成本-性能权衡(cost-performance Pareto frontier),实验证明其在多个基准测试中以约69%的推理成本降低实现了与大型同质模型相当的性能。
链接: https://arxiv.org/abs/2605.07248
作者: Youngsik Yoon,Sungjae Lee,Seockbean Song,Siwei Wang,Wei Chen,Jungseul Ok
机构: POSTECH(浦项科技大学); Microsoft Research Asia(微软亚洲研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 main conference
Abstract:Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69%.
[NLP-74] Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
【速读】: 该论文旨在解决异构大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中难以有效共享经验的问题。传统方法受限于不同模型间参数、目标函数和分词器(tokenizer)的不兼容性,导致经验交换效率低下甚至不可行。解决方案的关键在于提出相互增强学习(Mutual Reinforcement Learning)框架,其核心组件包括:共享经验交换(Shared Experience Exchange, SEE)、多工作资源分配(Multi-Worker Resource Allocation, MWRA)以及分词器异质层(Tokenizer Heterogeneity Layer, THL),其中THL通过重分词(retokenization)实现跨词汇表的token级轨迹对齐。该设计使经验共享机制在不同模型家族之间具备可操作性,并进一步通过三个受控探针——数据级滚动共享(Peer Rollout Pooling, PRP)、价值级优势共享(Cross-Policy GRPO Advantage Sharing, XGRPO)与结果级成功转移(Success-Gated Transfer, SGT)——系统性地探索了经验共享的结构特性。研究表明,结果级共享(SGT)在稳定性和支持度之间的权衡中占据最优位置,从而验证了该框架的有效性与实用性。
链接: https://arxiv.org/abs/2605.07244
作者: Xiaoze Liu,Dhananjay Ram,Yuting Zhang,Zhaoyang Zhang,Wei Xia,Stefano Soatto
机构: Purdue University (普渡大学); AWS Agentic AI (亚马逊云科技代理智能AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 50 pages, 10 figures, 14 tables
Abstract:We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.
[NLP-75] SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
【速读】: 该论文旨在解决生成式 AI (Generative AI) 推理过程中因草稿生成(drafting)效率低下导致的延迟问题。现有方法分为自回归草稿器(如 EAGLE-3)和并行草稿器两类,前者虽能保持路径依赖但调用次数多、开销大;后者虽减少调用次数,但各位置独立预测导致验证失败率高。解决方案的关键在于提出 SpecBlock,一种基于块迭代(block-iterative)的草稿机制:每个块(block)内通过层间状态迁移(layer-wise shift)维持局部路径依赖,跨块则通过继承前一区块任意位置的状态扩展路径,从而在降低草稿成本的同时保证草稿质量;此外,引入协同训练的排名头(rank head)动态分配分支策略,并结合有效前缀掩码(valid-prefix mask)优化训练目标,最终实现更高效且准确的草稿生成。
链接: https://arxiv.org/abs/2605.07243
作者: Weijie Shi,Qiang Xu,Fan Deng,Yaguang Wu,Jiarun Liu,Yehong Xu,Hao Chen,Jia Zhu,Jiajie Xu,Xiangjun Huang,Jian Yang,Xiaofang Zhou
机构: Hong Kong University of Science and Technology (香港科技大学); MetaX; Zhejiang Normal University (浙江师范大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position’s hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.
[NLP-76] MEMOREPAIR: Barrier-First Cascade Repair in Agent ic Memory
【速读】: 该论文旨在解决代理记忆(agentic memory)在任务演化过程中因源数据变更或失效而导致的级联更新问题(cascade update problem),即当原始记忆项被删除、修正或因工具/接口迁移而失效时,由其派生出的多个后代状态可能仍保留在系统中并产生过时影响,从而误导后续决策。解决方案的关键在于提出MemoRepair机制,这是一种“屏障优先”的级联修复协议:在修复事件发生时,首先撤回受影响的无效后代状态,然后基于保留的支持信息与当前接口下已修复的前驱节点构建有效的新 successor 状态,并通过仅允许闭包前驱验证后的 successor 再发布来确保一致性。该机制将修复选择问题转化为一个可精确求解的最大权重前驱闭包问题,利用单次 s-t 最小割(s-t min-cut)实现最优修复路径,显著降低了无效记忆暴露率并提升了修复效率。
链接: https://arxiv.org/abs/2605.07242
作者: Yang Zhao,Chengxiao Dai,Mengying Kou,Yue Xiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.
[NLP-77] aching Language Models to Think in Code
【速读】: 该论文旨在解决当前工具集成推理(Tool-integrated Reasoning, TIR)范式在数学问题求解中面临的三大局限:代码常作为事后验证器、中间自然语言(Natural Language, NL)计算易出错,以及NL与代码角色重叠而非分工明确。其解决方案的关键在于提出ThinC(Thinking in Code)框架,该框架将代码本身作为核心推理引擎,而非由NL调用的工具;具体而言,ThinC先通过简短的NL规划步骤启动推理流程,随后所有推理过程均通过仅以执行输出连接的代码块完成,从而实现更可靠、结构清晰的推理路径。
链接: https://arxiv.org/abs/2605.07237
作者: Hyeon Hwang,Jiwoo Lee,Jaewoo Kang
机构: Korea University(韩国国立大学); AIGEN Sciences(人工智能科学公司)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
[NLP-78] Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因键值缓存(Key-Value Cache, KV Cache)持续增长而导致的显著内存与运行时开销问题。现有KV缓存淘汰策略主要依赖局部注意力权重,忽略了值表示(value representations)、输出投影(output projection)以及跨头交互(inter-head interactions)的影响,导致淘汰决策不够准确。解决方案的关键在于将传统的基于头级别的权重平均淘汰机制重构为一种面向输出的、逐层矩阵乘法近似问题,并提出LaProx方法,通过显式建模注意力图与投影值状态之间的乘法交互关系,精准量化token贡献度,同时考虑跨头依赖性;在此基础上,首次设计出统一的淘汰策略,为所有token分配全局可比的重要性分数,从而实现全模型范围内的最优选择而非局部头级决策,显著提升压缩效率与性能保持能力。
链接: https://arxiv.org/abs/2605.07234
作者: Tho Mai,Joo-Young Kim
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2 \times accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.
[NLP-79] Hallucination Detection via Activations of Open-Weight Proxy Analyzers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(Hallucination)检测的问题,即模型生成内容与事实不符的现象。解决方案的关键在于提出了一种代理分析器框架(proxy-analyzer framework),该框架不依赖于对生成模型内部机制的访问,而是通过一个本地部署的小型开源权重模型(open-weight model)读取已生成文本,并利用该阅读器自身的内部激活状态来识别幻觉。其创新性体现在基于Transformer结构特性设计了18个特征,涵盖残差流范数、注意力机制、熵值、MLP激活、logit-lens轨迹及三项新的token级语境锚定统计量,进而训练堆叠集成模型进行高效判别。实验表明,该方法在多种不同规模和架构的分析器模型上均显著优于现有方法ReDeEP,在RAGTruth数据集上AUC提升达7.4至10.3个百分点,且模型性能高度稳定,即使在参数量相差18倍的情况下,AUC波动仅2.3个百分点,揭示出“模型越大越优”并非普遍规律。
链接: https://arxiv.org/abs/2605.07209
作者: Akshita Singh,Prabesh Paudel,Siddhartha Roy
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 4 figures. Code available at this https URL
Abstract:We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader’s own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP’s token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP’s 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
[NLP-80] PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat ACL2026
【速读】: 该论文旨在解决游戏社区中毒性行为的理解与分类问题,具体任务是将《坦克世界》(World of Tanks)聊天消息划分为六类毒性等级:非毒性、侮辱/挑衅、其他攻击性内容、仇恨/骚扰、威胁及极端主义。其解决方案的关键在于结合多种策略,包括基于编码器的模型、指令微调的大语言模型(LLM)与LoRA(Low-Rank Adaptation)微调技术、分层分类、一对一多分类策略以及多种集成方法;其中最优系统采用Llama 3.1 8B模型并辅以精心校准的5%合成数据增强,最终在测试集上取得0.6234的F1-macro分数,位列35支参赛队伍中的第4名。此外,作者还深入分析了标注模式对模型泛化能力的影响,揭示了一个关键的“验证陷阱”现象——即模型在验证集上的高表现往往对应较差的测试集迁移性能。
链接: https://arxiv.org/abs/2605.07201
作者: Srikar Kashyap Pulipaka
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the EEUCA workshop at ACL 2026
Abstract:This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset’s annotation patterns and their impact on model generalization, revealing a critical ‘‘validation trap’’ phenomenon where high validation performance correlates with poor test transfer.
[NLP-81] he Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
【速读】: 该论文旨在解决现有大型语言模型(Large Language Model, LLM)评估体系主要针对语法正确输入的局限性,特别是在处理存在词边界扰动(word-boundary corruption)等不完美文本时的性能表现缺乏系统研究的问题。其核心发现是:随着在单词内部插入空格字符以破坏词结构的扰动率增加,LLM的信息检测准确率呈现“U型曲线”变化,这一现象被称为“文本诡异谷”(Text Uncanny Valley)。解决方案的关键在于提出“模式跃迁假说”(mode transition hypothesis),即LLM在接近正常文本时运行于词级模式,在高度碎片化文本下切换至字符级模式,而“诡异谷”区域正是两种模式均失效的过渡区,这解释了为何在此区间内模型性能显著下降,且该现象无法通过上下文学习或正则化扰动完全缓解,揭示了当前LLM在真实场景中面对噪声输入时存在的隐匿性失败模式。
链接: https://arxiv.org/abs/2605.07186
作者: Zekai Tong,Ruiyao Xu,Aryan Shrivastava,Chenhao Tan,Ari Holtzman
机构: University of Chicago (芝加哥大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures
Abstract:Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs’ detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.
[NLP-82] Learning Agent Routing From Early Experience
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂推理任务中存在高延迟和计算成本的问题,尤其是在许多查询实际上处于当前先进LLM的能力边界内、无需完整代理执行的情况下,如何实现高效路由决策。其解决方案的关键在于提出一种无需训练的路由框架BoundaryRouter,该框架利用早期行为经验与评分标准引导的推理机制,通过在共享种子集上同时执行LLM和代理构建紧凑的经验记忆,并在推理阶段检索相似案例以指导是否直接使用LLM推理或升级至代理执行,从而实现低延迟、高性能的动态路由。
链接: https://arxiv.org/abs/2605.07180
作者: Yimin Wang,Jiahao Qiu,Xuan Qi,Xinzhe Juan,Jingzhe Shi,Zelin Zhao,Hongru Wang,Shilong Liu,Mengdi Wang
机构: AI Lab, Princeton University (普林斯顿大学人工智能实验室); University of Michigan (密歇根大学); Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University (清华大学交叉信息研究院); Shanghai Jiao Tong University (上海交通大学); University of Edinburgh (爱丁堡大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: 17 pages
Abstract:LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.
[NLP-83] opology-Enhanced Alignment for Large Language Models : Trajectory Topology Loss and Topological Preference Optimization ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通过监督微调(SFT)和直接偏好优化(DPO)进行对齐时,忽略表示空间全局几何结构的问题,仅依赖局部词元似然或标量得分,导致生成轨迹缺乏语义连贯性。其解决方案的关键在于引入拓扑增强的对齐框架,利用0维持久同调(0-dimensional persistent homology)来正则化生成过程中的语义轨迹:对于SFT,提出轨迹拓扑损失(Trajectory Topology Loss, TTL),将模型更新方向与prompt与黄金答案嵌入间提取的“提示-答案桥接”对齐;对于DPO,提出拓扑偏好优化(Topological Preference Optimization, TPO),构建特定主题的语义偏好向量,并在中间隐藏层中对齐拒绝与选择响应之间的改进方向。该方法通过显式建模隐藏空间中的拓扑结构,提升了生成可控性和语义一致性,在自动偏好指标和LLM判别评估中优于多种非拓扑基线方法。
链接: https://arxiv.org/abs/2605.07172
作者: Yurui Pan,Ke Xu,Bo Peng
机构: Fudan University (复旦大学); Tongji University (同济大学); Shanghai Ocean University (上海海洋大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026. 15 pages
Abstract:Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract “prompt-answer bridges.” TTL aligns the model’s actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.
[NLP-84] A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework
【速读】: 该论文旨在解决中文语境下基于MIPVU(Metaphor Identification Problem with Verbs and Unambiguous contexts)框架的词级别隐喻识别问题,当前在该任务上缺乏系统且可复现的基准方法。其解决方案的关键在于构建了一个涵盖三种不同模型架构的多基准体系:一是基于中文RoBERTa-wwm-ext-large的编码器微调模型;二是针对中文适配的MelBERT模型,利用从《现代汉语词典》第七版(MCD7)构建的74,823条基本语义资源(覆盖PSU CMC词汇71.51%)增强语义感知能力;三是采用QLoRA微调的Qwen3.5-9B生成式模型作为指令微调基线。实验表明,MelBERT仅使用MIP通道时表现最优(F1=0.7281),验证了中文隐喻以常规隐喻为主导、语义通道贡献有限的特性,并揭示了生成式模型因离散输出限制导致召回率偏低的问题,为后续中文隐喻识别研究提供了可复现的基准与数据支持。
链接: https://arxiv.org/abs/2605.07170
作者: Yufeng Wu
机构: City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Metaphor is pervasive in everyday language, yet token-level computational identification of metaphor-related words in Chinese under the MIPVU framework remains under-explored relative to English. This paper presents a reproducible multi-architecture baseline for token-level metaphor identification on the PSU Chinese Metaphor Corpus (PSU CMC), the only widely available MIPVU-annotated Chinese corpus. We systematically compare three model families: (i) encoder fine-tuning with Chinese RoBERTa-wwm-ext-large; (ii) MelBERT adapted to Chinese using a newly constructed basic-meaning resource derived from the Modern Chinese Dictionary, 7th edition (MCD7), comprising 74,823 entries with 71.51% PSU CMC vocabulary coverage; and (iii) Qwen3.5-9B fine-tuned with QLoRA as an instruction-tuned generative baseline. Across five fixed seeds, MelBERT MIP-only achieves the strongest performance at 0.7281 +/- 0.0050 test positive F1, marginally above MelBERT Full (0.7270 +/- 0.0069) and clearly above plain RoBERTa (0.7142 +/- 0.0121). The Qwen QLoRA generative configuration trails encoder baselines by approximately 11 F1 points (0.6157 +/- 0.0113). Three findings merit attention: (1) the SPV channel of MelBERT does not contribute reliable positive signal in Chinese, consistent with the dominance of conventional metaphor; (2) the Qwen-encoder gap is concentrated in recall, reflecting the discrete-commitment limitation of generative output; (3) several Qwen task formulations fail due to format design rather than model capacity. We release all split manifests, per-seed outputs, the MCD7 basic-meaning embedding pipeline, and training scripts to serve as a common reference for future Chinese metaphor identification research.
[NLP-85] Rethinking Experience Utilization in Self-Evolving Language Model Agents
【速读】: 该论文旨在解决自演化智能体(self-evolving agents)在运行时决策过程中对经验(experience)利用方式过于僵化的问题,即现有方法通常在初始化或每一步决策中固定地注入经验,而未考虑当前决策是否真正需要额外指导。其解决方案的关键在于提出ExpWeaver这一轻量级实现,通过将经验作为推理过程中的可选资源暴露出来,使智能体能够根据决策需求选择性地调用经验——仅在高推理不确定性或有益决策点时启用,从而实现经验使用的动态优化。实验表明,该策略在多种框架、大语言模型(LLM)和环境设置下均优于传统固定使用模式,并可通过强化学习进一步增强。
链接: https://arxiv.org/abs/2605.07164
作者: Weixiang Zhao,Yingshuo Wang,Yichen Zhang,Yanyan Zhao,Yu Zhang,Yang Wu,Dandan Tu,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: 30 pages, 20 figures, 7 tables
Abstract:Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce ExpWeaver, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emphwhat experience to store toward understanding \emphhow and \emphwhen experience should enter decision-making.
[NLP-86] CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization
【速读】: 该论文旨在解决个性化大语言模型(Large Language Models, LLMs)在实际应用中因需适配用户多样化偏好(如帮助性、简洁性和幽默感等)而导致的计算成本过高和可扩展性差的问题。现有方法通常依赖于针对特定偏好组合进行大量微调(fine-tuning),这不仅资源消耗巨大,且难以覆盖所有可能的偏好维度。解决方案的关键在于提出CLIPer(Classifier-guided Inference-time Personalization),一种轻量级推理时个性化方法:通过引入一个分类器模型,在推理阶段动态引导LLM生成符合用户偏好的内容,从而避免了昂贵的微调过程,并实现了对单维与多维偏好更可控、更精细的调节,同时保持极低的额外计算开销。
链接: https://arxiv.org/abs/2605.07162
作者: Jinyan Su,Jinpeng Zhou,Claire Cardie,Wen Sun
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Personalized LLMs can significantly enhance user experiences by tailoring responses to preferences such as helpfulness, conciseness, and humor. However, fine-tuning models to address all possible combinations of user preferences is computationally expensive and impractical. In this paper, we introduce \textbfCLIPer(\textbfClassifier-guided \textbfInference-time \textbfPersonalization), a lightweight personalization approach that leverages a classifier model to steer LLM generation dynamically to different user preferences at inference time. Our method eliminates the need for extensive fine-tuning, inducing negligible additional computational overhead while enabling more controllable and nuanced personalization across single and multi-dimensional preferences. Comprehensive empirical analyses demonstrate the scalability and effectiveness of our approach in delivering personalized language generation.
[NLP-87] Beyond Reasoning : Reinforcement Learning Unlocks Parametric Knowledge in LLM s
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)是否能够提升大语言模型(Large Language Models, LLMs)对参数化知识(parametric knowledge)的直接回忆能力,而非仅限于推理过程的问题。其解决方案的关键在于设计了一个受控的零样本、单跳、闭卷问答(zero-shot, one-hop, closed-book QA)设置,不使用思维链(chain-of-thought),仅通过二元正确性奖励(binary correctness rewards)进行训练,并采用事实级别的训练-测试去重(fact-level train-test deduplication)以确保性能提升源于记忆召回的增强而非推理或新知识的获取。实验表明,RL在多个模型家族和基准上带来约27%的平均相对增益,且机制上主要通过重新分配已有知识的概率质量,而非引入新事实,从而将正确答案从低概率尾部转移到可靠的贪婪生成中。
链接: https://arxiv.org/abs/2605.07153
作者: Wanli Yang,Hongyu Zang,Junwei Zhang,Wenjie Shi,Du Su,Jingang Wang,Xueqi Cheng,Fei Sun
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.
[NLP-88] Structural Rationale Distillation via Reasoning Space Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行推理知识蒸馏(reasoning distillation)时,教师模型对相似问题生成的推理路径(rationales)结构和策略差异较大,导致学生模型面临噪声监督信号难以内化的问题。解决方案的关键在于提出通过推理路径压缩的蒸馏方法(Distillation through Reasoning Path Compression, D-RPC),其核心机制是构建一个动态维护的、可复用的高阶推理路径库(bank of reusable high-level reasoning paths),并在训练过程中为每个问题检索最相关的路径并约束教师模型遵循该路径,从而在保持跨相似问题推理一致性的同时,确保不同问题类型间的多样性覆盖。该方法通过PAC-Bayes分析量化了路径库大小与覆盖范围之间的权衡关系,并实验证明其在多个数学与常识推理基准上显著优于多种基线方法,且token消耗更低。
链接: https://arxiv.org/abs/2605.07139
作者: Jialin Yang,Jiankun Wang,Jiajun Wu,Henry Leung,Jiayu Zhou,Steve Drew
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.
[NLP-89] Region4Web: Rethinking Observation Space Granularity for Web Agents
【速读】: 该论文旨在解决当前Web代理(Web agent)在感知网页时观察空间粒度设计不合理的问题,即现有方法通常采用与动作空间相同的元素级粒度进行观察,导致页面的功能性结构隐含在细粒度元素信号中,迫使代理在每一步都需重新推断页面组织逻辑。解决方案的关键在于提出Region4Web框架,通过层次化分解与语义抽象将AXTree重构为功能区域(functional regions),从而显式暴露页面的功能组织;并进一步设计PageDigest这一面向网页的推理流水线,将区域级观察以紧凑的跨步骤摘要形式传递给执行代理,显著降低观察长度并提升任务成功率,证明了基于功能区域粒度的观察优于单纯的元素级处理。
链接: https://arxiv.org/abs/2605.07134
作者: Donguk Kwon,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page’s functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page’s functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.
[NLP-90] he Position Curse: LLM s Struggle to Locate the Last Few Items in a List
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在位置敏感任务中表现出的“位置诅咒”(Position Curse)问题,即模型在处理短序列时难以准确识别倒数第几项内容,尽管其在长文本中能高效定位特定信息。解决方案的关键在于构建一个专门针对位置感知能力的训练数据集 PosBench,并通过 LoRA 微调方法提升模型的前后向位置检索性能,从而改善其在代码理解等依赖精确索引场景下的表现。
链接: https://arxiv.org/abs/2605.07127
作者: Zhanqi Zhang,Hua-Dong Xiong,Robert C. Wilson,Mikio Aoi,Marcelo G. Mattar,Li Ji-An
机构: UC San Diego (加州大学圣地亚哥分校); Georgia Tech (佐治亚理工学院); New York University (纽约大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.
[NLP-91] Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)微调中存在的一对矛盾:全量微调(Full Fine-Tuning, FFT)虽具备高表示可塑性,适合注入高熵知识,但计算与内存开销大;低秩适应(Low-Rank Adaptation, LoRA)则因参数效率高且具有正则化优势,在许多任务上可媲美甚至超越FFT,但其结构限制了对复杂更新空间的捕捉能力。为克服单一策略的结构性局限,作者提出一种统一框架——LoRA与全量微调混合(Mixture of LoRA and Full, MoLF)微调方法,其核心创新在于在优化器层面动态路由梯度更新,使FFT和LoRA两种专家机制在整个训练过程中均能接收精确梯度信号,从而实现稳定且灵活的训练动态。此外,针对资源受限场景,还设计了MoLF-Efficient版本,冻结基础模型权重并仅在两个LoRA专家之间路由更新,显著提升效率与性能。实验表明,MoLF在所有测试设置下均优于或接近FFT与LoRA的最佳表现,而MoLF-Efficient相较现有自适应LoRA方法在Fact、Med和SQL任务上分别提升最高达20%、9%和9%。
链接: https://arxiv.org/abs/2605.07111
作者: Haozhan Tang,Xiuqi Zhu,Xinyin Zhang,Boxun Li,Virginia Smith,Kevin Kuo
机构: Carnegie Mellon University (卡内基梅隆大学); Tsinghua University (清华大学); Infinigence AI (无限智能人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA’s additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within 1.5% of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to 20% on Fact and 9% on Med and SQL.
[NLP-92] Securing Computer-Use Agents : A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
【速读】: 该论文旨在解决计算机使用代理(Computer-use Agents, CUAs)在真实软件环境中部署时的可靠性问题,即传统仅以任务成功衡量可靠性的局限性——在实际运行中,感知错误、规划漂移、记忆使用、工具中介、权限范围及运行时监督等因素共同决定代理行为是否与用户意图保持一致。其解决方案的关键在于提出一个“架构-生命周期”框架,从架构视角将系统划分为感知(Perception)、决策(Decision)和执行(Execution)三层耦合模块,实现从软件观测到授权动作的转化;从生命周期视角分析部署(Deployment)、运行(Operation)和维护(Maintenance)三个阶段,明确先验知识学习、工具与权限绑定、运行时压力测试以及持续保障机制的建立节点,从而识别失败显现位置与成因引入点,并映射出可干预的控制与保证界面,为提升CUA在复杂环境中的可控性和可信性提供系统化方法论支撑。
链接: https://arxiv.org/abs/2605.07110
作者: Zejian Chen,Zhanyuan Liu,Chaozhuo Li,Mengxiang Han,Songyang Liu,Litian Zhang,Feng Gao,Yiming Hei,Xi Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines this http URL, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime this http URL are stressed, and assurance must be preserved under this http URL this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public this http URL example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.
[NLP-93] Retrieve Integrate and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言推理中因将视觉证据压缩为离散文本思维而导致的信息瓶颈问题,以及现有连续隐空间推理方法中存在的流形兼容性不足问题(即隐状态轨迹偏离预训练推理电路、坍缩为与实例无关的模式,并常被答案生成过程绕过)。其解决方案的关键在于提出RIS(Retrieve, Integrate, and Synthesize)框架,该框架通过构建带边界框和区域语义描述的分步接地推理数据集,使隐状态同时锚定于空间和语义证据;利用渐进式注意力瓶颈强制隐令牌的因果作用;并通过引入短语言过渡令牌将合成的隐状态映射回词汇对齐的解码过程,从而实现对预训练MLLM计算的兼容扩展。
链接: https://arxiv.org/abs/2605.07106
作者: Jin Cui,Xinyue Long,Xunyong Zhang,Yadong Zhang,Chuanchang Su,Jingye Gan,Boran Zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学); State Key Laboratory of Human-Machine Hybrid Augmented Intelligence (人机混合增强智能国家重点实验室)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures
Abstract:Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
[NLP-94] heoretical Limits of Language Model Alignment
【速读】: 该论文旨在解决语言模型(Language Model, LM)对齐过程中在固定KL散度预算下所能实现的奖励提升上限不明确的问题。现有方法如强化学习(Reinforcement Learning, RL)和Best-of-N采样虽广泛应用,但其理论性能边界尚不清楚。论文的关键解决方案是通过信息论分析,推导出KL正则化对齐的最大可实现期望奖励增益,并首次提出一个以Jeffreys散度为核心的闭式表达式,替代以往基于KL的近似;进一步将该表达式重构为基模型下的协方差形式,从而得到仅依赖基模型样本即可估算对齐潜力的实用估计器。此外,论文还揭示了代理奖励(proxy reward)场景中“奖励欺骗”(reward hacking)现象的成因及其缓解机制——即奖励集成(reward ensembling)可有效抑制该问题,为实践中常用的技术提供了理论支撑。
链接: https://arxiv.org/abs/2605.07105
作者: Lucas Monteiro Paes,Natalie Mackraz,Barry-John Theobald,Federico Danieli
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Theory (cs.IT)
备注:
Abstract:Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of- N alignment, which selects the highest-reward output among N independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the \sqrt\textttKL used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of- N closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
[NLP-95] SAGE: Hierarchical LLM -Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在评估文学质量时面临的挑战,即如何量化难以用传统计算方法衡量的诠释性维度(如文化表征、情感深度和哲学复杂性)。其解决方案的关键在于提出 SAGE 框架——一个分层评估体系,将文学质量解构为基于本体论的可评估维度,并通过多轮迭代反思与独立验证的结构化大语言模型(Large Language Model, LLM)评估实现精准测量。该框架在三层次分析(文化、情感-心理、存在-哲学)中展现出高一致性(98.8%得分收敛性、>94%评分者间一致性),并揭示了不同维度对生成文本质量判别的敏感度差异,从而为自动化评估开放文本生成提供了可靠且具有理论指导意义的方法论基础。
链接: https://arxiv.org/abs/2605.07102
作者: Tianyu Wang,Nianjun Zhou
机构: Mercy University, Math & Computer Science Department (梅西大学,数学与计算机科学系); IBM T.J. Watson Research Center (IBM托马斯·沃森研究中心)
类目: Computation and Language (cs.CL)
备注: 19 pages, 4 figures
Abstract:Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical Pulp LLM, all p0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen’s d2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.
[NLP-96] he Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks NEURIPS2026
【速读】: 该论文旨在解决翻译基准测试中普遍假设的“翻译税”(Translation Tax)是否为单一、恒定效应的问题,即翻译是否普遍导致模型性能提升(因保留源语言线索),从而误导对多语言能力的真实评估。其关键解决方案在于设计并实施一种“同项大语言模型自然化压力测试”(same-item LLM-naturalization stress test),通过固定答案、选项和内容仅改写中文表面形式,系统性地分离出翻译本身与模型家族特异性影响的差异。结果表明,“翻译税”并非统一存在,而是依赖于估算方法和具体题目类型——高残留项受益于翻译,低残留项则无显著变化,揭示了翻译基准评估中存在多维有效性风险。
链接: https://arxiv.org/abs/2605.07093
作者: Zezheng Lin,Fengming Liu,Handi Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures. Submitted to NeurIPS 2026
Abstract:The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.
[NLP-97] Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation
【速读】: 该论文旨在解决自动语音识别(ASR)评估中因参考文本单一化(reference monism)所引发的认知不公(epistemic injustice)问题,尤其是对失语症患者(aphasia)群体造成的系统性不利影响。传统基于词错误率(WER)的评估依赖于特定转录规范(如verbatim或non-verbatim),而这些规范往往忽视了失语症者具有临床意义的言语特征(如不流畅性),将其误判为错误,导致其ASR性能被低估。解决方案的关键在于引入“诠释鸿沟”(hermeneutical gap)概念,并提出一种量化指标——认知不公距离(Epistemic Injustice Distance, EID),用于衡量不同参考规范下评估结果的差异;同时,提出“WER-Range”方法,即在多个合法转录规范下报告ASR性能范围,而非假设存在唯一正确答案,从而提升评估体系的包容性和公平性。
链接: https://arxiv.org/abs/2605.07084
作者: Anna Seo Gyeong Choi,Maria Teleki,James Caverlee,Miguel del Rio,Corey Miller,Hoon Choi
机构: Cornell University (康奈尔大学); Rev AI; Texas AM University (德州农工大学); Kangwon National University (江原国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against “clean” references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism’s cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.
[NLP-98] Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本流时,尽管具备更长的上下文窗口,却难以有效保留和复用有用信息的问题,即“持续上下文整合”(continual context consolidation)问题。其核心挑战在于如何在更新当前上下文信息的同时,避免对已固化知识造成干扰。解决方案的关键在于提出Self-Consolidating Language Models (SCoL),一种后训练框架:模型自身生成文本形式的更新指令,明确指定哪些Transformer层应被修改;并通过元强化学习(meta-reinforcement learning)训练,使模型在演化状态中优化更新策略。该方法通过稀疏更新机制引导可塑性集中在高Fisher信息量的层上,从而实现高效的知识获取与长期保留,且具备从短流到长流的可扩展性。
链接: https://arxiv.org/abs/2605.07076
作者: Zekun Wang,Anant Gupta,Zihan Dong,Christopher J. MacLellan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbfSelf-\textbfConsolidating \textbfLanguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
[NLP-99] WiCER: Wiki-memory Compile Evaluate Refine Iterative Knowledge Compilation for LLM Wiki Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用知识库进行推理时面临的“编译鸿沟”(compilation gap)问题,即如何将原始文档高效、无损地编译为结构化的维基知识(wiki memory),以支持基于键值缓存(KV cache)的低延迟推理,同时避免因关键事实丢失导致性能崩溃。其核心挑战在于:直接编译可能导致灾难性事实丢失(catastrophic failure),而全量上下文KV缓存虽有效但存在注意力稀释(attention dilution)问题,难以扩展。解决方案的关键是提出WiCER(Wiki-memory Compile, Evaluate, Refine)算法——一种受反例引导抽象精化(CEGAR)启发的迭代方法,通过诊断探针识别被遗漏的关键事实,并在后续编译中强制保留这些事实,从而显著提升编译质量与稳定性。实验表明,仅需1–2次迭代即可恢复80%的原始全上下文性能,且将灾难性失败率降低55%,其中针对性诊断比通用固定策略贡献更大(+0.95 vs. +0.16)。
链接: https://arxiv.org/abs/2605.07068
作者: Juan M. Huerta
机构: Zinnia Tech Solutions (Zinnia科技解决方案)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The LLM Wiki pattern, to compile and provide domain knowledge into a persistent artifact and serve it to LLMs via KV cache inference, promises context access at sub-second latency with zero retrieval failure. Realizing this requires solving the compilation gap: LLM compilation distilling raw documents into a wiki without catastrophically discarding critical facts. We characterize this gap across 17 RepLiQA domains (6,800 questions): we observe that full context KV cache inference outperforms RAG on curated knowledge (4.38 vs. 4.08 out of 5, 7.3 faster TTFT) but degrades below RAG at scale due to attention dilution, and blind compilation fails entirely (2.14 to 2.32 vs. 3.46, 53 to 60% catastrophic failure rate). To address the compilation gap, we propose WiCER (Wiki-memory Compile, Evaluate, Refine), an iterative algorithm inspired by counterexample-guided abstraction refinement (CEGAR) that closes this gap. WiCER evaluates compiled wikis against diagnostic probes, identifies dropped facts, and forces their preservation in subsequent compilations. One to two iterations recover 80% of lost quality (mean 3.24 vs. 3.47 for raw full-context across the 15 topics with baselines), reducing catastrophic failures by 55% relative. An ablation across all 17 topics confirms that targeted diagnosis (+0.95), not generic pinning (+0.16), drives the gains. All code and benchmarks are released for reproducible research.
[NLP-100] MedExAgent : Training LLM Agents to Ask Examine and Diagnose in Noisy Clinical Environments
【速读】: 该论文旨在解决现有医疗大语言模型(Large Language Models, LLMs)基准和自动诊断方法在模拟真实临床诊断过程时存在的简化问题,即忽略了诊断过程中交互性和不确定性特征,如多轮问诊、患者与检查结果的噪声干扰等。其解决方案的关键在于将临床诊断建模为一个部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),定义三类动作:向患者提问、调用医学检查工具(tool calls)以及下达诊断结论,并引入包含七类患者噪声和三类检查噪声的系统性噪声模型。在此基础上,通过两阶段训练流程——首先基于Calgary-Cambridge临床访谈结构进行监督微调,再利用DAPO算法优化综合奖励(涵盖诊断准确性、工具调用质量及检查成本)——构建出MedExAgent诊断代理,实现了与更大模型相当的诊断性能,同时保持了更优的成本效益。
链接: https://arxiv.org/abs/2605.07058
作者: Yicheng Gao,Xiaolin Zhou,Yahan Li,Yue Zhao,Ruishan Liu
机构: University of Southern California (南加州大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbfMedExAgent, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.
[NLP-101] GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
【速读】: 该论文旨在解决现有数学推理基准(如GSM8K)在评估大语言模型(LLM)能力时因测试集固定而导致的过拟合与记忆化问题,即模型可能通过记忆特定题目而非真正理解推理逻辑来获得高分,从而高估其实际能力。解决方案的关键在于提出一种可重复使用且具有随机性的生成框架GSM-SEM,该框架通过对问题陈述中的实体、属性和关系进行语义层面的扰动,显著提升变体之间的语义差异性,同时保持原始计算过程和答案不变,并近似维持原题难度。这种机制迫使模型在新条件下重新推导解法,而非依赖记忆,从而有效降低对静态公开基准的依赖,减少记忆偏差。
链接: https://arxiv.org/abs/2605.07053
作者: Jyotika Singh,Fang Tu,Aziza Mirzadova,Amit Agarwal,Hitesh Laxmichand Patel,Sandip Ghoshal,Miguel Ballesteros,Yassine Benajiba,Weiyi Sun,Graham Horwood,Sujith Ravi,Dan Roth
机构: Oracle AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
[NLP-102] NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学与数学教育评估中存在地域偏倚和评测方式局限的问题。现有基准多来自西方国家,且以选择题形式为主,难以真实反映模型的推理能力。为此,作者提出了NSMQ Riddles这一新基准,其关键在于引入来自加纳全国科学与数学竞赛(National Science and Maths Quiz, NSMQ)第五轮的1.8K道谜题类题目,每道题包含至少3个逐步揭示线索,答案为数字、单词或短语,便于自动评估;同时,这些题目源自全球南方(Global South)背景,强调高阶推理与快速反应能力,使得即使最先进的LLMs在该数据集上的表现仍低于顶尖学生参赛者,从而推动更具全球代表性与挑战性的科学与数学推理能力评测体系的建立。
链接: https://arxiv.org/abs/2605.07051
作者: George Boateng,Naafi Ibrahim,Samuel John,Philemon Badu,Patrick Agyeman-Budu,Jonathan Mensah,Kevin Yeboah,William Edor,Andrew Mensa-Onumah,Nana Yeboah,Victor Wumbor-Apin Kumbol
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education
Abstract:Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana’s National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs’ capabilities for science and mathematics education.
[NLP-103] Cognitive Agent Compilation for Explicit Problem Solver Modeling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中因广泛预训练导致的不可控性问题,即缺乏可解释和可编辑的知识状态,难以满足教育系统对透明性和可控性的需求。解决方案的关键在于提出认知代理编译(Cognitive Agent Compilation, CAC)框架,该框架通过强教师模型将问题求解知识编译为一个显式的目标代理,明确分离知识表示、问题求解策略与验证及更新规则三个模块,从而实现受约束的问题求解过程在教育应用中的可检查性和可编辑性。
链接: https://arxiv.org/abs/2605.07040
作者: Hyeongdon Moon,Carolyn Rosé,John Stamper
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to AIED 2026 Blue Sky
Abstract:Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem-solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem-solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade-offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded-knowledge AI for educational applications.
[NLP-104] owards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在生成样本质量与多样性上长期落后于自回归模型的问题。其核心解决方案是将文本建模为固定宽度二进制位流上的连续扩散过程,通过模拟语义标记(semantic tokens)为模拟位序列,并采用匹配滤波残差参数化方法分离上下文学习与独立位后验分布;关键创新在于引入一种基于熵率轮廓门控的随机采样器,自动在高信息区域集中随机性,其余区域近乎确定性演化,从而显著提升生成效率与质量。该方法在LM1B和OpenWebText数据集上均达到或超越现有自回归基准,同时通过比特级预测缓解了词汇表规模带来的 O(V) 内存瓶颈,实现更高效的可扩展语言生成架构。
链接: https://arxiv.org/abs/2605.07013
作者: Georgios Batzolis,Mark Girolami,Luca Ambrogioni
机构: University of Cambridge (剑桥大学); Imperial College London (伦敦帝国理工学院); Donders Institute for Brain, Cognition and Behaviour (多伦德大脑、认知和行为研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ( \GenPPL ) of 59.76 at matched real-data entropy ( 4.31 ) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving \GenPPL=27.06 at an entropy of 5.26 using 4\times fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the \mathcalO(V) vocabulary scaling bottleneck shared by standard DLMs. By predicting \mathcalO(\log V) bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
[NLP-105] SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
【速读】: 该论文旨在解决生成式 AI(Generative AI)在修复架构型代码异味(Architectural Code Smells)方面的有效性问题,此类异味会损害软件可维护性且难以通过人工手动修复,因其需要跨模块的设计意图推理,这超出了当前大型语言模型(LLM)代理在局部代码修复任务中的能力边界。解决方案的关键在于提出 SmellBench——一个任务编排框架,其核心创新包括:针对不同异味类型优化的提示策略、支持多步迭代执行的能力,以及一种分离评估修复有效性、误报识别准确性和整体代码库净影响的评分机制。该框架使作者能够首次对来自 GPT、Claude、Gemini 和 Mistral 四类模型的 11 种 LLM 代理配置进行系统性实证评估,揭示了当前 LLM 在架构级重构中存在显著能力缺口,同时为未来自动化软件工程研究提供了可复用的基础设施。
链接: https://arxiv.org/abs/2605.07001
作者: Ion George Dinu(1),Marian Cristian Mihăescu(1),Traian Rebedea(2) ((1) University of Craiova, Craiova, Romania, (2) University Politehnica of Bucharest, Bucharest, Romania)
机构: University of Craiova (克劳约瓦大学); University Politehnica of Bucharest (布加勒斯特理工大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Preprint. 11 pages, 3 figures. Submitted to the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)
Abstract:Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to \kappa = 0.94 expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at this https URL.
[NLP-106] Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
【速读】: 该论文旨在解决技能增强型智能体在调用大型可重用技能库时,现有方法返回的原子技能或依赖感知技能包内部角色不明确的问题,导致智能体需自行推断执行入口点、支持技能、可见需求及避错指导,从而影响执行效率与准确性。其解决方案的关键在于提出一种推理时分组结构化检索方法——GoSkills(Group of Skills),通过构建基于类型技能图的锚点中心技能组、利用组图扩展支持组、将选定的组计划压缩为有限原子技能负载,并生成包含Start、Support、Check和Avoid字段的固定执行契约,从而向智能体提供清晰的角色标注执行上下文,无需修改下游智能体、技能负载或执行环境即可提升性能。
链接: https://arxiv.org/abs/2605.06978
作者: Kun Zeng,Yu Huo,Siyu Zhang,Zi Ye,Yuecheng Zhuo,Haoyue Liu,Yuquan Lu,Junhao Wen,Xiaoying Tang
机构: Sun Yat-sen University (中山大学); University of California, San Diego (加州大学圣地亚哥分校); Taiyuan University of Technology (太原理工大学); School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 4 figures, 24 tables
Abstract:Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.
[NLP-107] MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(如孟加拉语)中进行封闭集指令式标注时存在的系统性偏差问题,特别是标注一致性假象(label agreement illusion)导致的少数类别漏检现象。其解决方案的关键在于构建了一个多维、高规模的孟加拉语社交媒体数据集基准 MultiSoc-4D(包含58K+评论,标注维度为类别、情感、仇恨言论和讽刺),并通过结构化标注流水线(由ChatGPT、Gemini、Claude和Grok分别标注不同分区并共享20%验证集)系统诊断LLM行为;研究发现,LLMs普遍表现出“指令诱导标签坍缩”(instruction-induced label collapse)现象,即倾向于选择默认标签(如Other、Neutral、No),从而产生高一致率但严重低估少数类别的虚假一致性,且该偏差在40多个LLM中普遍存在,无论架构差异如何。
链接: https://arxiv.org/abs/2605.06940
作者: Souvik Pramanik,S.M. Riaz Rahman Antu,Shak Mohammad Abyad,Md. Ibrahim Khalil,Md. Shahriar Hussain
机构: North South University (北南大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 14 figures, 13 tables
Abstract:Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called “instruction-induced label collapse”, wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a “label agreement illusion”, statistically validated via almost null Fleiss’ Kappa ( \kappa \approx -0.001 ) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.
[NLP-108] Can LLM s Take Retrieved Information with a Grain of Salt?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在检索增强场景下对所获信息确信度(certainty)响应不敏感的问题,即模型未能根据检索到的内容的不确定性合理调整自身输出,这在医疗、金融等高风险领域可能引发严重后果。其解决方案的关键在于提出一种无需修改模型权重的交互策略,通过引入先验知识提醒、确信度再校准和上下文简化三个核心机制,显著提升模型对上下文确信度的遵循能力(context-certainty obedience),实验表明该策略平均可减少25%的服从性错误。
链接: https://arxiv.org/abs/2605.06919
作者: Behzad Shayegh,Mohamed Osama Ahmed,Fred Tung,Leo Feng
机构: RBC Borealis
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have demonstrated impressive retrieval-augmented capabilities. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information. It is a limitation with real consequences in high-stakes domains like medicine and finance. We evaluate eight LLMs on their context-certainty obedience, measuring how well they adjust responses to match expressed context certainty. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification. This approach reduces obedience errors by 25% on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability. Our contributions include a principled evaluation metric, empirical insights into LLMs’ uncertainty handling, and a portable strategy to improve context-certainty obedience across diverse LLMs.
[NLP-109] Regulating Branch Parallelism in LLM Serving
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理服务中因暴露请求内并行性(intra-request parallelism, IRP)而导致的调度效率低下问题。现有服务系统要么采用激进式(eager)分支接纳策略,导致共享解码步骤延迟增加,降低批处理请求的整体吞吐;要么采用保守的固定宽度限制,牺牲了由并行分支带来的潜在性能提升。论文提出的关键解决方案是引入TAPER——一种基于每步决策的准入控制器,其核心思想是将额外分支视为机会性工作,在预测的分支外部性(branch externality)不超过当前批次松弛预算(slack budget)时才予以接纳。该机制通过动态调节每步的分支宽度实现高效利用资源,且由于分支级调度实现了计算与内存解耦(分支共享请求前缀的键值缓存,KV cache),无需进行内存回收即可灵活调整并行宽度,从而在保持高服务质量水平(SLO)的同时显著提升吞吐量。
链接: https://arxiv.org/abs/2605.06914
作者: Swapnil Gandhi,Siva Hari,William J. Dally,Christos Kozyrakis
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch’s current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request’s prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by 1.77\times over IRP-Off and by 1.48\times over IRP-Eager, while maintaining over 95% SLO attainment.
[NLP-110] MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
【速读】: 该论文旨在解决当前AI生成文本检测器在实际部署中面临的关键挑战:即如何在保持高检测准确率的同时,提升对对抗攻击的鲁棒性、跨生成模型和领域迁移能力,并在低假阳性率(False Positive Rate, FPR)下稳定运行。现有方法通常仅优化二分类(AI/人类文本)任务,导致模型表示难以学习到生成器家族、攻击类型或源域等结构信息,一旦主任务饱和便失去进一步优化动力。解决方案的核心在于提出MELD(Multi-Task Equilibrated Learning Detector),通过多任务均衡学习机制,在共享编码器上附加生成器族、攻击类型和源域三个辅助头,并利用学习到的同方差不确定性权重平衡四类损失;同时引入EMA教师-学生蒸馏策略增强抗攻击能力,并采用硬负样本成对排序损失扩大AI文本与最易混淆的人类文本之间的得分差距。推理阶段仅保留基础二分类头,确保接口与成本与传统检测器一致,从而实现高性能、强鲁棒性和良好泛化性的统一。
链接: https://arxiv.org/abs/2605.06903
作者: Chenjun Li,Cheng Wan,Johannes C. Paetzold
机构: Cornell University (康奈尔大学); Weill Cornell Medicine (威尔康奈尔医学中心); Cornell Tech (康奈尔科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:Large language models are now embedded in everyday writing workflows, making reliable AI-generated text detection important for academic integrity, content moderation, and provenance tracking. In practice, however, a detector must do more than achieve high aggregate AUROC on clean, in-distribution human and AI text: it should remain robust to attacks and adversarial rewrites, transfer to unseen generators and domains, and operate at low false-positive rates (FPR). Most existing detectors optimize a single AI/Human objective, giving the representation little incentive to learn generator, attack, or domain structure once the binary task saturates. We introduce MELD (Multi-Task Equilibrated Learning Detector), a deployable detector for AI-generated text that enriches binary detection with auxiliary supervision. MELD attaches generator-family, attack-type, and source-domain heads to a shared encoder, and balances the four losses with learned homoscedastic uncertainty weights. To improve robustness, an EMA teacher predicts on clean inputs while an attack-augmented student is distilled toward the teacher. MELD further uses a hard-negative pairwise ranking loss to enlarge the score margin between AI-generated texts and the most confusable human texts. At inference, all auxiliary heads are discarded, giving MELD the same interface and cost as a standard detector. On the public RAID leaderboard, MELD is the strongest open-source detector and is competitive with leading commercial models, especially under attack and at low FPR. Across standard held-out benchmarks, MELD matches or outperforms supervised baselines. We further introduce MELD-eval, a held-out evaluation pool built from recent chat models released by four major LLM providers. Without additional finetuning, MELD achieves 99.9% TPR at 1% FPR on MELD-eval, while many baselines degrade sharply.
[NLP-111] Reflections and New Directions for Human-Centered Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在快速发展过程中忽视人类核心关切的问题,即如何在模型开发的全生命周期中系统性地融入人类价值观、偏好与目标,而不仅限于训练后的简单调整。其解决方案的关键在于提出一个“以人为本的大型语言模型”(Human-Centered Large Language Models, HCLLMs)框架,该框架整合了自然语言处理(Natural Language Processing, NLP)、人机交互(Human-Computer Interaction, HCI)和负责任的人工智能(Responsible AI)多学科视角,强调从系统设计、数据采集、模型训练、评估到负责任部署的每个阶段均需以人类优先为原则进行严谨设计与实施。
链接: https://arxiv.org/abs/2605.06901
作者: Caleb Ziems,Dora Zhao,Rose E. Wang,Matthew Jörke,Ahmad Rushdi,Advit Deepak,Sunny Yu,Anshika Agarwal,Harshvardhan Agarwal,Gabriela Aranguiz-Dias,Aditri Bhagirath,Justine Breuch,Huanxing Chen,Ruishi Chen,Sarah Chen,Haocheng Fan,William Fang,Cat Gonzales Fergesen,Daniel Frees,Tian Gao,Ziqing Huang,Vishal Jain,Yucheng Jiang,Kirill Kalinin,Su Doga Karaca,Arpandeep Khatua,Teland La,Isabelle Levent,Miranda Li,Xinling Li,Yongce Li,Angela Liu,Minsik Oh,Nathan J. Paek,Anthony Qin,Emily Redmond,Michael J. Ryan,Aadesh Salecha,Xiaoxian Shen,Pranava Singhal,Shashanka Subrahmanya,Mei Tan,Irawadee Thawornbut,Michelle Vinocour,Xiaoyue Wang,Zheng Wang,Henry Jin Weng,Pawan Wirawarn,Shirley Wu,Sophie Wu,Yichen Xie,Patrick Ye,Sean Zhang,Yutong Zhang,Cathy Zhou,Yiling Zhao,James Landay,Diyi Yang
机构:
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly shaping the private and professional lives of users, with numerous applications in business, education, finance, healthcare, law, and science. With this rise in global influence comes greater urgency to build, evaluate, and deploy these systems in a manner that prioritizes not only technical capabilities but also human priorities. This work presents a framework for developing Human-Centered Large Language Models (HCLLMs), which integrates perspectives from Natural Language Processing (NLP), Human-Computer Interaction (HCI), and responsible AI. Considering the ethics, economics, and technical objectives of language modeling, we argue that model developers need to address human concerns, preferences, values, and goals, not only during a cursory post-training stage, but rather with rigor and care at every stage of the pipeline. This paper offers human-centered insights and recommendations for developers at each stage, from system design to data sourcing, model training, evaluation, and responsible deployment. Then we conclude with a case study, applying these insights to understand the future of work with HCLLMs.
[NLP-112] ajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP KR
【速读】: 该论文旨在解决低资源环境下塔吉克语(Tajik)与波斯语(Persian)之间跨书写系统(cross-script)的词级对齐与检索问题,尤其针对OCR后处理等实际应用场景中的词汇匹配需求。其核心挑战在于如何在有限标注数据下实现高精度、高效率的词汇映射,同时兼顾模型的可解释性与部署可行性。解决方案的关键在于提出一个轻量级的混合流水线模型(lightweight hybrid pipeline),该模型通过结合规则驱动模块与统计方法,在仅使用CPU的情况下实现了96.4%的top-1准确率,显著优于大型多语言句向量模型(multilingual sentence transformers)在该任务上的表现,从而在准确性与计算效率之间取得良好平衡,具备良好的工程落地潜力。
链接: https://arxiv.org/abs/2605.06886
作者: Mullosharaf K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Published in The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026), pages 29-37, Rabat, Morocco. Association for Computational Linguistics
Abstract:This work introduces TajPersLexon, a curated Tajik–Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.
[NLP-113] Benchmarked Yet Not Measured – Generative AI Should be Evaluated Against Real-World Utility
【速读】: 该论文试图解决生成式 AI (Generative AI) 在标准基准测试中表现优异,但在实际应用场景中缺乏实用价值的问题,即“基准效用差距”(benchmark utility gap)。这一问题在教育、医疗、软件工程和法律等28个部署案例中普遍存在。论文指出,造成该差距的核心原因在于评估实践中存在的三种重复性失败:代理替代(proxy displacement)、时间坍缩(temporal collapse)和分布掩盖(distributional concealment)。解决方案的关键在于推动评估范式的转变——从静态的、以基准为中心的透明度,转向以利益相关者(stakeholder)、目标(goal)和情境(context)为条件的效用透明度,其核心是关注人类结果轨迹(human outcome trajectories)中的能力变化。为此,作者提出SCU-GenEval四阶段评估框架,包括利益相关者-目标映射、构念-指标定义、机制建模与纵向效用测量,并配套引入结构化部署协议、情境条件用户模拟器及人格与目标条件代理指标三项工具,从而实现对生成式 AI 实际效用的可测量、可追踪评估。
链接: https://arxiv.org/abs/2605.06856
作者: Ishani Mondal,Shweta Bhardwaj
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages
Abstract:Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders’ ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder’s capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.
[NLP-114] IntentGrasp: A Comprehensive Benchmark for Intent Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在理解人类意图方面能力不足的问题,这直接影响到智能助手的实用性与安全性。为系统评估这一能力,作者构建了IntentGrasp基准测试集,涵盖12个领域共262,759条训练样本及两个测试集(All Set和Gem Set),并发现当前主流LLM在该任务上表现不佳,尤其在更具挑战性的Gem Set上平均得分低于25%,远低于人类水平(约81.1%)。解决方案的关键在于提出**有意微调(Intentional Fine-Tuning, IFT)**方法,即利用IntentGrasp的训练数据对模型进行针对性微调,显著提升其意图理解性能(在All Set上F1提升超30点,在Gem Set上超20点),且通过留一域实验验证了IFT具有良好的跨域泛化能力,表明其是增强LLM意图理解能力的有效路径。
链接: https://arxiv.org/abs/2605.06832
作者: Yuwei Yin,Chuyuan Li,Giuseppe Carenini
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IntentGrasp data is available on [Hugging Face]( this https URL ), and the code is released on [GitHub]( this https URL )
Abstract:Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
[NLP-115] ProtSent: Protein Sentence Transformers
【速读】: 该论文旨在解决蛋白质语言模型(Protein Language Models, pLMs)生成的均值池化序列嵌入(mean-pooled sequence embeddings)缺乏显式训练以反映蛋白质功能、进化或结构相似性的局限性。其解决方案的关键在于提出 Protein Sentence Transformers (ProtSent),一种基于对比学习(contrastive fine-tuning)的框架,通过在五个蛋白质对数据集上使用 MultipleNegativesRankingLoss 进行微调,重构嵌入空间,使其更有效地捕捉蛋白质的功能与结构特性,且无需任何任务特定监督信号。
链接: https://arxiv.org/abs/2605.06830
作者: Dan Ofer,Oriel Perets,Michal Linial,Nadav Rappoport
机构: The Hebrew University of Jerusalem (希伯来大学); Ben-Gurion University of the Negev (本古里安大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 figures, appendix, 2 figures, open code and models
Abstract:Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein–protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
[NLP-116] VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
【速读】: 该论文旨在解决现有语音语言模型(Speech Language Model, SLM)在表达多样性上的局限性,即如何超越自然对话,实现角色扮演(role-playing)和歌唱(singing)等高阶情感与表现力的生成。传统SLM通常仅关注语义准确性与流畅性,难以建模如语气、情绪、表演特征等非语言信息。解决方案的关键在于提出VITA-QinYu,一个首个端到端(end-to-end, E2E)的表达式语音语言模型,其核心创新是采用混合文本-音频范式(hybrid speech-text paradigm),通过引入多码本音频标记(multi-codebook audio tokens)扩展交错文本-音频建模能力,从而增强副语言表征(paralinguistic representation)的同时保持模态间清晰分离,避免跨模态干扰。该设计使模型在角色扮演任务中比同类模型提升7个百分点,在歌唱质量上MOS评分提高0.13分,同时在对话准确性和流畅性上也达到SOTA水平。
链接: https://arxiv.org/abs/2605.06765
作者: Jiacheng Xu,Heting Gao,Liufei Xie,Zhenchuan Yang,Lijiang Li,Yiting Chen,Bin Zhang,Meng Chen,Chaoyu Fu,Weifeng Zhao,Wenjiang Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.
[NLP-117] When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
【速读】: 该论文旨在解决个性化大语言模型(Large Language Model, LLM)代理在长期会话中因持续状态维护而引发的“无意长期状态污染”(unintended long-term state poisoning)问题,即用户与代理的常规交互可能逐步侵蚀其授权边界、扩大工具使用默认权限并提升自主行为水平,从而带来潜在安全风险。解决方案的关键在于提出一种轻量级、事后防御机制——StateGuard,该机制在状态写回边界处审计状态差异,并选择性地回滚危险修改,在不显著增加系统开销的前提下有效将危害评分(Harm Score, HS)降至接近零,同时降低漏报率,实现对长期状态污染的有效抑制。
链接: https://arxiv.org/abs/2605.06731
作者: Xiaoyu Xu,Minxin Du,Qipeng Xie,Haobin Ke,Qingqing Ye,Haibo Hu
机构: The Hong Kong Polytechnic University; Hong Kong University of Science and Technology, HKUST (Guangzhou)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages
Abstract:Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent’s long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbfunintended long-term state poisoning. To systematically study it, we introduce the \textbfUnintended Long-Term State Poisoning Bench (ULSPB), a bilingual benchmark comprising 350 settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emphHarm Score (HS), a state-centric metric that quantifies \emphauthorization drift, \emphtool-use escalation, and \emphunchecked autonomy. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbfStateGuard, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.
[NLP-118] When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理过程中答案偏好稳定时机不可观测的问题,即模型虽在生成过程中形成最终答案倾向,但其输出文本无法揭示这一倾向何时已稳定。解决方案的关键在于引入一个可计算的有限答案偏好稳定化(finite-answer preference stabilization)概念:通过将模型对指定答案词元(verbalizers)的延续概率投影到有限答案集合上,构建精确的对数似然比信号 δ(ξ),该信号在二分类任务中等价于 log-odds 编码。此方法无需贪婪回滚或学习探测器,即可定义解析驱动的答案起始点、回溯稳定时间及提前量,并在 Qwen3-4B-Instruct 模型上的控制延迟判断任务中验证了该信号在答案可解析前即已稳定,平均领先 17–31 token,且具备线性可恢复性、与光标进度部分解耦、以及跨任务的信息迁移能力。
链接: https://arxiv.org/abs/2605.06723
作者: Long Zhang,Wei-neng Chen,Feng-feng Wei,Zi-bo Qin
机构: South China University of Technology (华南理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model’s answer preference became stable. We study this question through a narrow computable object: \emphfinite-answer preference stabilization. For a model state and specified answer verbalizers, we project the model’s own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, \delta(\xi)=S_\theta(\mathrmyes\mid\xi)-S_\theta(\mathrmno\mid\xi) . This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17–31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model’s eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of \delta but not reliable generation control.
[NLP-119] From Storag e to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理系统中记忆机制研究碎片化的问题,这种碎片化体现在其发展路径上既偏向操作系统工程又涉及认知科学,导致缺乏统一的技术整合视角与连贯的演化逻辑。解决方案的关键在于提出一个全新的进化框架,将LLM代理的记忆机制发展过程形式化为三个阶段:存储(轨迹保留)、反思(轨迹精炼)和经验(轨迹抽象),并通过识别长程一致性需求、动态环境挑战以及持续学习目标这三个核心驱动力,系统性地阐明了记忆机制的演进规律;特别地,论文聚焦于“经验”阶段的两个变革性机制——主动探索与跨轨迹抽象,从而为下一代LLM代理的设计提供坚实的理论基础与清晰的发展路线图。
链接: https://arxiv.org/abs/2605.06716
作者: Jinghao Luo,Yuchen Tian,Chuxue Cao,Ziyang Luo,Hongzhan Lin,Kaixin Li,Chuyi Kong,Ruichao Yang,Jing Ma
机构: Hong Kong Baptist University (香港浸会大学); South China Normal University (华南师范大学); Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学); University of Science and Technology Beijing (北京科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Findings
Abstract:Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remains fragmented, oscillating between operating system engineering and cognitive science. This theoretical divide prevents a unified view of technological synthesis and a coherent evolutionary perspective. To bridge this gap, this survey proposes a novel evolutionary framework for LLM agent memory mechanisms, formalizing the development process into three stages: Storage (trajectory preservation), Reflection (trajectory refinement), and Experience (trajectory abstraction). We first formally define these three stages before analyzing the three core drivers of this evolution: the necessity for long-range consistency, the challenges in dynamic environments, and the ultimate goal of continual learning. Furthermore, we specifically explore two transformative mechanisms in the frontier Experience stage: proactive exploration and cross-trajectory abstraction. By synthesizing these disparate views, this work offers robust design principles and a clear roadmap for the development of next-generation LLM agents.
[NLP-120] CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署阶段缺乏持续学习能力的问题,即传统LLM在训练完成后无法根据实际交互经验进行适应性改进,从而限制了其在动态环境中的长期性能提升。解决方案的关键在于提出了一种名为CASCADE(CASe-based Continual Adaptation during DEployment)的通用且原理明确的框架,该框架通过引入一个可演化的情景记忆(episodic memory)机制,使LLM代理能够在不修改模型参数的前提下,从部署过程中的经验中持续学习。CASCADE将经验复用建模为上下文相关的多臂赌博机问题(contextual bandit problem),实现了探索与利用之间的最优权衡,并提供长期交互下的无遗憾保证(no-regret guarantees)。这一设计使得代理能够积累、选择并优化任务相关案例,从而将历史经验转化为可操作的知识,显著提升了多个领域任务的成功率。
链接: https://arxiv.org/abs/2605.06702
作者: Siyuan Guo,Yali Du,Hechang Chen,Yi Chang,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment-time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe-based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions. This design allows agents to accumulate, select, and refine task-relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9% over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.
[NLP-121] State Representation and Termination for Recursive Reasoning Systems
【速读】: 该论文旨在解决递归推理系统中两个关键但常被隐含处理的问题:如何表示不断演化的推理状态,以及何时停止迭代。其解决方案的核心在于引入一个**认知状态图(epistemic state graph)**来结构化地编码提取的命题、证据关系、待解问题及置信度权重,从而显式建模推理过程。进一步提出“顺序间隙(order-gap)”作为判断迭代是否有效的指标,定义为“先扩展后整合”与“先整合后扩展”两种策略所达到状态之间的距离;若顺序间隙较小,则表明两种策略结果一致,继续迭代收益有限。论文的主要成果给出了线性化顺序间隙在固定点附近非退化的充要条件,明确了该指标在局部范围内具有信息价值而非代数上的平凡性,为递归推理系统的设计提供了可操作的收敛性判据。
链接: https://arxiv.org/abs/2605.06690
作者: Debashis Guha,Amritendu Mukherjee,Sanjay Kukreja,Tarun Kumar
机构: S P Jain School of Global Management (S P Jain全球管理学院); Indian Statistical Institute (印度统计研究所); eClerx Services Ltd. (eClerx服务有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recursive reasoning systems alternate between acquiring new evidence and refining an accumulated understanding. Two design choices are typically left implicit: how to represent the evolving reasoning state, and when to stop iterating. This paper addresses both. We represent the reasoning state as an epistemic state graph encoding extracted claims, evidential relations, open questions, and confidence weights. We define the order-gap as the distance between the states reached by expand-then-consolidate versus consolidate-then-expand; a small order-gap suggests that the two orderings agree and further iteration is unlikely to help. Our main result gives a necessary and sufficient condition for the linearised order-gap to be non-degenerate near the fixed point, showing when the criterion is informative rather than algebraically vacuous. This is a local condition, not a global convergence guarantee. We apply the framework to recursive reasoning systems and sketch its application to agent loops, tree-of-thought reasoning, theorem proving, and continual learning.
[NLP-122] oeplitz MLP Mixers are Low Complexity Information-Rich Sequence Models
【速读】: 该论文旨在解决基于Transformer的大语言模型在训练和推理过程中因注意力机制导致的二次时间与空间复杂度问题(quadratic time and space computational complexity of attention)。其核心解决方案是提出Toeplitz MLP Mixer (TMM),一种类Transformer架构,通过将注意力机制替换为沿序列维度的三角掩码Toeplitz矩阵乘法,从而将训练阶段的时间复杂度降低至O(dnlogn)、空间复杂度降至O(dn),推理预填充阶段则保持O(dn)的时间与空间复杂度。关键创新在于利用Toeplitz结构实现高效线性变换,同时避免了传统子二次架构中复杂的输入调制或状态维护机制,从而在不牺牲性能的前提下显著提升训练效率和信息保留能力。
链接: https://arxiv.org/abs/2605.06683
作者: Benjamin L. Badger,Ethan Roland
机构: IBM(国际商业机器公司); AE Studio
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Transformer-based large language models are in some respects limited by the quadratic time and space computational complexity of attention. We introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that swaps attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension resulting in \mathcalO (dn \log n) time and \mathcal O(dn) space complexity during training and \mathcal O(dn) time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. We demonstrate that TMMs are capable of retaining more input information resulting in improved copying ability, which we argue results from a lack of architectural biases. Consistent with higher input information retention, TMMs exhibit superior information retrieval and in-context learning benchmark accuracy compared to comparable architectures. We conclude with an analysis from the perspective of operator index theory and show that, counterintuitively, trained Toeplitz layers of causal non-invertible models are more likely to be invertible or nearly so than models that are actually invertible over their inputs.
[NLP-123] LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因键值(Key-Value, KV)缓存内存随上下文长度线性增长而导致的计算瓶颈问题。现有KV缓存压缩方法受限于启发式策略:启发式预算分配依赖统计先验而非任务目标,造成资源错配;启发式选择则依赖查询-键交互耦合或静态归纳偏置(如注意力汇聚点),缺乏灵活性。其解决方案的关键在于提出LKV(Learned KV Eviction),将KV压缩建模为端到端可微分优化问题,通过LKV-H学习与任务目标对齐的全局缓存预算,以及通过LKV-T在不显式构建注意力矩阵的前提下推导内在KV重要性,从而绕过启发式代理,实现压缩与任务目标的严格对齐。实验证明,LKV在LongBench和RULER基准上均达到当前最优性能,尤其在仅保留15% KV缓存时仍能近乎无损地保持精度,且分析表明学习型预算分配是性能保真的主导因素。
链接: https://arxiv.org/abs/2605.06676
作者: Enshuai Zhou,Yifan Hao,Chao Wang,Rui Zhang,Di Huang,Jiaming Guo,Xing Hu,Zidong Du,Qi Guo,Yunji Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Long-context inference in Large Language Models (LLMs) is bottlenecked by the linear growth of Key-Value (KV) cache memory. Existing KV cache compression paradigms are fundamentally limited by heuristics: heuristic budgeting relies on statistical priors rather than task objectives, causing resource misallocation, while heuristic selection relies on coupled query-key interactions or static inductive biases (e.g., attention sinks). To address this limitation, we introduce LKV (Learned KV Eviction), which formulates KV compression as an end-to-end differentiable optimization problem. LKV integrates LKV-H to learn task-optimized global budgets, and LKV-T to derive intrinsic KV importance without materializing attention matrices. This design bypasses heuristic proxies, strictly aligning compression with task objectives. Extensive evaluations demonstrate that LKV achieves state-of-the-art performance on both LongBench and RULER benchmarks at high compression rates. In particular, on LongBench, LKV achieves near-lossless performance with only 15% KV cache retention. Crucially, our analysis identifies learned budgeting as the dominant driver of fidelity, demonstrating that data-driven allocation is essential to overcome the limitations of hand-crafted heuristics.
[NLP-124] RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因键值缓存(Key-Value Cache, KV Cache)占用内存随序列长度线性增长而带来的主要内存瓶颈问题。现有量化方法通常对所有注意力头采用统一位宽,忽略了不同注意力头间的重要性差异。论文指出,若直接基于头重要性进行混合精度分配,反而可能因各量化器的失真曲线(distortion curve)参数不一致(如衰减率β在3.6至5.3之间变化),导致分配顺序颠倒,性能劣于均匀量化——这种现象称为“失真模型错配”(distortion model mismatch)。解决方案的关键在于提出RateQuant:首先利用小规模校准集为每个量化器拟合其专属的失真模型,再通过率失真理论中的逆向水填法(reverse waterfilling)解析求解最优比特分配策略,从而实现高效且稳定的混合精度KV缓存量化。该方法在Qwen3-8B模型上以平均2.5比特/元素显著降低困惑度(PPL),同时校准耗时仅1.6秒且推理无额外开销。
链接: https://arxiv.org/abs/2605.06675
作者: Fei Zuo,Zikang Zhou,Hao Cong,Xiaoyan Xi,Ho Fai Leung
机构: BA TechWorks (BMW Group); National University of Singapore; Tsinghua University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 18 pages, 7 figures, 5 tables
Abstract:Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit-width to every attention head, ignoring the large variation in head importance. A natural idea is to allocate more bits to important heads and fewer to the rest. We show, however, that such mixed-precision allocation has a hidden pitfall: each quantizer follows a different distortion curve D(b)=alpha*beta^-b, and the decay rate beta varies from 3.6 to 5.3 across quantizer designs. Applying one quantizer’s distortion model to another inverts the allocation order and makes performance worse than uniform quantization. We call this failure mode distortion model mismatch and propose RateQuant to resolve it. RateQuant fits a per-quantizer distortion model from a small calibration set, then solves the resulting bit-allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, calibrated RateQuant reduces KIVI’s perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. The entire calibration takes 1.6 s on a single GPU and adds zero overhead at inference time.
[NLP-125] Domain-level metacognitive monitoring in frontier LLM s: A 33-model atlas
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在元认知质量评估中因聚合指标掩盖领域内差异而导致的评估失真问题。其关键解决方案在于通过细粒度的模型-领域单元分析(model-domain cell),利用口语化置信度评分(verbalized confidence, 0–100)计算Type-2 AUROC,从而揭示不同基准测试领域(如应用/专业性知识、形式推理和自然科学等)间稳定的监控性能差异。研究发现,尽管整体元认知表现可能达到显著水平,但各模型在具体领域上的监控能力存在系统性分化,且这种变异具有可重复性和跨模型家族的一致性模式,表明以领域为单位进行筛选是部署前优化模型适配性的必要步骤。
链接: https://arxiv.org/abs/2605.06673
作者: Jon-Paul Cacioli
机构: Google(谷歌); Anthropic; OpenAI; Meta; Stability.AI; Character.ai; Claude; Qwen; Zhipu; DeepSeek; Gemma; GLM-5
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 7 figures, 1 supplementary table. Code and data: this https URL
Abstract:Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall’s W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.
[NLP-126] More Thinking More Bias: Length-Driven Position Bias in Reasoning Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在多项选择题(Multiple-Choice Question, MCQ)评估中因推理轨迹长度引发的位置偏差(Position Bias)问题。现有假设认为,链式思维(Chain-of-Thought, CoT)推理和经过推理微调的模型(如DeepSeek-R1)能够通过深度思考减少浅层启发式偏差,但本文揭示了一个相反现象:在所有具备推理能力的模型中,单题位置偏差分数(Position Bias Score, PBS)随推理路径长度单调上升。关键解决方案在于提出一套诊断工具包(包括PBS、承诺转变点、有效切换率及截断探针),并通过截断干预提供因果证据——即从推理轨迹后段继续生成时,模型更可能转向偏好位置选项(如R1-Qwen-7B模型中从16%升至32%)。研究进一步表明,高精度模型(如671B参数版本)虽整体PBS趋近于零,但长轨迹仍存在显著偏差,说明准确率仅抑制了偏差表达而非消除其机制,因此建议在MCQ评估流程中不应默认推理模型具有顺序鲁棒性(order-robustness)。
链接: https://arxiv.org/abs/2605.06672
作者: Xiao Wang
机构: FujianAI42@163.com
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.06672 [cs.AI] (or arXiv:2605.06672v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.06672 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xiao Wang [view email] [v1] Tue, 21 Apr 2026 04:14:13 UTC (94 KB)
[NLP-127] Reliable Chain-of-Thought via Prefix Consistency
【速读】: 该论文旨在解决大语言模型在推理任务中通过自一致性(self-consistency)方法提升准确率时,如何更高效地利用生成的思维链(Chain-of-Thought, CoT)轨迹来优化答案聚合的问题。传统方法依赖多数投票(majority voting, MV),但其计算开销较大,且未充分利用每条轨迹的内在可靠性信息。解决方案的关键在于提出一种新的可靠性信号——前缀一致性(prefix consistency),即通过截断并重新生成CoT的后半部分,观察原始答案是否稳定重现:正确答案对应的轨迹具有更高的重现概率。该信号无需访问token对数概率或额外的自我评分提示,仅基于答案重复频率即可为候选答案赋权,从而显著减少达到与标准MV相当准确率所需的token数量(最多节省21倍,中位数4.6倍)。
链接: https://arxiv.org/abs/2605.07654
作者: Naoto Iwase,Yuki Ichihara,Mohammad Atif Quamar,Junpei Komiyama
机构: Nagoya University (名古屋大学); Nara Institute of Science and Technology (奈良先端科学技术大学院大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); RIKEN AIP (理化学研究所先进人工智能中心)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: See our project page at this https URL
Abstract:Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at this https URL.
信息检索
[IR-0] FAVOR: Efficient Filter-Agnostic Vector ANNS Based on Selectivity-Aware Exclusion Distances
【速读】:该论文旨在解决现代检索系统中混合查询(即同时包含向量相似性搜索与属性过滤)的性能瓶颈问题,尤其在低选择性(low-selectivity)场景下,现有基于HNSW的内联过滤方法难以兼顾高吞吐量、搜索效率、过滤通用性与索引连通性。其解决方案的关键在于提出FAVOR框架,通过三项创新实现:(1) 构建统一架构,将选择性估计与过滤后的近似最近邻搜索(ANNS)执行一体化,形成对混合查询的协同优化;(2) 设计基于HNSW的内联过滤算法,引入排除距离机制动态重塑向量距离分布,使非目标向量远离查询点、有效候选向量靠近查询点,从而提升搜索效率而不牺牲过滤通用性或图结构连通性;(3) 提出基于选择性的搜索选择器,根据查询选择性动态路由至预过滤暴力搜索(低选择性时)或优化的HNSW搜索(其他情况),确保跨不同选择性水平下的稳定性能表现。
链接: https://arxiv.org/abs/2605.07770
作者: Junjie Song,Yu Liu,Guoyu Hu,Zhongle Xie,Ming Yang,Beng Chin Ooi,Ke Zhou
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); Wuhan Technical University (武汉技术大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Modern retrieval systems increasingly require integrating approximate nearest neighbor search (ANNS) with complex attribute filtering to handle hybrid queries in applications such as recommendation systems and retrieval-augmented generation (RAG). While HNSW-based inline-filtering methods show promise, existing approaches struggle to deliver high throughput under low-selectivity scenarios while balancing search efficiency, filtering generality, and index connectivity. To address these challenges, we propose FAVOR, an efficient filter-agnostic vector ANNS that supports arbitrary filtering conditions while maintaining stable performance across varying selectivity levels. FAVOR introduces three novel features: (1) an integrated architecture that unifies selectivity estimation and filtered ANNS execution, providing a cohesive solution for hybrid vector-attribute queries; (2) a HNSW-based inline-filtering algorithm that introduces an exclusion distance mechanism to dynamically reshape the vector distance distribution, pushing non-target vectors away from the query while promoting valid candidates toward the query, thus improving search efficiency without compromising generality or graph connectivity; and (3) a selectivity-driven search selector that estimates query selectivity and dynamically routes queries between a pre-filtering brute-force algorithm for low-selectivity cases and an optimized HNSW-based search algorithm for other scenarios, ensuring consistent performance. Extensive experiments on real-world datasets demonstrate that FAVOR achieves a 1.3-5 \times higher QPS at Recall@10 = 95% compared to state-of-the-art methods for arbitrary filtering conditions, while maintaining competitive performance even against tailored solutions in some filtering conditions.
[IR-1] RACE: Tourism Recommendation with Accountable Citation Evidence
【速读】:该论文旨在解决旅游场景下对话式推荐系统(Conversational Recommender Systems, CRS)在可信度、可验证性和适应性方面的评估缺口问题。现有基准主要依赖单一的召回率(Recall@k)指标,缺乏对多轮对话中基于真实用户评论证据的推荐解释能力以及推荐被拒绝后的恢复机制。解决方案的关键在于提出TRACE数据集,其包含10,000条多轮旅游推荐对话,每条对话均标注了来自Yelp评论的具体段落引用(review-span citations),并明确标记了拒绝回合(rejection turns),从而支持对准确率(Accuracy)、可依据性(Grounding)和恢复能力(Recovery)三维度的综合评估。通过14种检索、规划与大语言模型(LLM)基线的对比实验,发现当前方法存在“三能力差距”(Three-Competency Gap),揭示出可信旅游推荐需同时满足推荐正确性、证据可验证性和动态修复能力,而非单一指标驱动的优化目标。
链接: https://arxiv.org/abs/2605.07677
作者: Zixu Zhao,Sijin Wang,Yu Hou,Yuanyuan Xu,Yufan Sheng,Xike Xie,Wenjie Zhang,Won-Yong Shin,Xin Cao
机构: UNSW Sydney (新南威尔士大学); University of Adelaide (阿德莱德大学); Yonsei University (延世大学); USTC (中国科学技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Tourism is a high-stakes setting for conversational recommender systems (CRS): a plausible-sounding suggestion can waste real money and trip time once a traveler acts on it. Existing CRS benchmarks primarily evaluate systems with a single Recall@k score over entity mentions, and tourism-specific resources add spatial or knowledge-graph context, yet none of them couple multi-turn recommendation with verbatim review-span evidence and rejection recovery. This leaves an evaluation gap for tourism recommendation that is simultaneously trustworthy, verifiable, and adaptive: recommend the right point of interest (POI) for multi-aspect preferences (such as cuisine, price, atmosphere, walking distance), justify each suggestion with verifiable evidence from prior visitors so the traveler can act without trial and error, and recover when the first recommendation is rejected mid-dialogue. We introduce TRACE, where each item is a multi-turn tourism recommendation dialogue with review-span citations and explicit rejection turns: 10,000 dialogues over 2,400 Yelp POIs and 34,208 reviews across eight U.S. cities, paired with 14 retrieval, planning, and LLM baselines, along with 25 metrics organized under Accuracy, Grounding, and Recovery. Across these baselines, TRACE reveals the Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; Multi-Review Synthesis fails at recovery. The Grounding Score agrees with human citation precision (Spearman rho=+0.80, p10^-20), and paired t-tests reproduce the per-baseline ranking (p0.01 on the dominant contrasts). TRACE reframes accountable tourism recommendation as a joint target (right POI, verifiable evidence, adaptive repair) rather than a single-axis leaderboard.
[IR-2] LARAG : Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation
【速读】:该论文旨在解决标准检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理具有天然结构的文档(如技术手册)时,忽视其超链接拓扑结构的问题,导致生成结果缺乏对文档内部关联性的有效利用。解决方案的关键在于提出一种轻量级的链路感知检索策略——LARAG(Link-Aware RAG),通过将HTML文档中已存在的超链接关系编码为chunk表示的元数据,并据此执行一种类图检索机制,从而在不显式构建图结构或进行复杂推理的前提下,实现对局部相关内容的更精准检索。实验表明,LARAG在保持更低的检索和生成开销的同时,显著提升了答案质量(以BERTScore F1衡量),验证了直接利用现有超链接拓扑即可实现高效且更具事实一致性的RAG流程。
链接: https://arxiv.org/abs/2605.07517
作者: Giorgia Bolognesi,Claudio Estatico,Ulderico Fugacci,Isabella Mastroianni,Claudio Muselli,Luca Oneto
机构: Rulex s.r.l., Genova, Italy; Department of Mathematics (DIMA), University of Genoa, Italy; Institute of Applied Mathematics and Information Technologies “Enrico Magenes” (IMATI), National Research Council, Italy; Department of Computer Science, Bioengineering, Robotics, and Systems Engineering (DIBRIS), University of Genoa, Italy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances the factual grounding of Large Language Models by conditioning their outputs on external documents. However, standard embedding-based retrievers treat naturally structured corpora, such as technical manuals, as flat collections of passages, thereby overlooking the hyperlink topology that users rely on when navigating such content. We introduce LARAG (Link-Aware RAG): a lightweight, link-aware retrieval strategy that leverages the author-defined hyperlink structure already present in HTML documentation, encoding hyperlink relations as metadata in the chunk representations and exploiting them to perform a form of graph-like retrieval of locally relevant content. In a benchmark of twenty expert-designed queries over Rulex Platform technical documentation and four prompting strategies, LARAG consistently improves answer quality, achieving the highest BERTScore F1, while retrieving fewer chunks and generating fewer tokens than a baseline RAG architecture used for comparison. These results show that directly leveraging the existing hyperlink topology of technical documentation, even without explicit graph construction or inference, enables an implicit form of graph-like retrieval that yields a more faithful and efficient RAG pipeline, providing better grounding at lower cost. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07517 [cs.IR] (or arXiv:2605.07517v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.07517 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] InterLV-Search: Benchmarking Interleaved Multimodal Agent ic Search
【速读】:该论文旨在解决现有多模态智能体搜索基准在评估过程中对视觉证据(visual evidence)使用方式的局限性问题,即当前基准通常将视觉证据仅作为输入限制或最终答案端点,而未将其纳入搜索轨迹中进行动态交互与迭代利用。为应对这一挑战,作者提出了InterLV-Search基准,其关键创新在于构建了一个支持“语言-视觉”交错式智能体搜索(interleaved language-vision agentic search)的评测体系,其中文本和视觉证据被反复用于条件化后续搜索步骤,并包含三个层级:主动视觉证据获取(Level 1)、受控离线交错多模态搜索(Level 2)以及开放网络交错多模态搜索(Level 3)。此外,该方案还引入了InterLV-Agent工具以实现标准化工具调用、轨迹记录与评估,从而系统性地揭示当前多模态智能体在视觉证据寻求、搜索控制及跨模态证据融合方面的显著不足,推动该领域向更真实复杂的交互场景发展。
链接: https://arxiv.org/abs/2605.07510
作者: Bohan Hou,Jiuning Gu,Jiayan Guo,Ronghao Dang,Sicong Leng,Xin Li,Xuemeng Song,Jianfei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbfInterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at this https URL
[IR-4] CMIIES: A Browser-Based LLM -Powered Intelligent Information Extraction System for Academic Literature
【速读】:该论文旨在解决学术文献中结构化知识提取的自动化难题,特别是在传统中医等专业领域中,现有工具往往依赖复杂的基础设施、编程技能或特定领域的微调模型,导致研究人员难以高效获取所需信息。解决方案的关键在于提出了一种基于浏览器的零安装平台TCMIIES,其核心创新是采用一种新颖的schema-guided prompting框架并结合自动系统提示生成机制,使用户可通过图形界面自定义抽取模板而无需编程;同时,该平台具备纯前端架构以保障数据隐私、支持五大主流大语言模型(Large Language Model, LLM)API、实现并发批量处理与自动重试,并针对中文数据库(如CNKI和万方)提供智能字段映射功能,从而显著降低使用门槛并提升提取准确性与实用性。
链接: https://arxiv.org/abs/2605.07507
作者: Hanqing Zhao
机构: Hebei University, College of Traditional Chinese Medicine (河北大学中医药学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system’s effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.
[IR-5] A Comprehensive Survey on Agent Skills: Taxonomy Techniques and Applications
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在开放世界部署中因依赖从头推理和低级工具调用而导致的效率低下、易出错及难以维护的问题。其核心解决方案是引入“代理技能”(agent skills)这一概念,即一类可复用的过程性构件,能够根据任务特定约束协调工具、记忆与运行时上下文。通过将智能体的高阶推理与规划职责与其操作层——技能——分离,论文提出以技能为中心的架构,从而提升系统在可扩展性、鲁棒性和可维护性方面的表现。关键在于将技能视为独立于具体任务的模块化单元,并围绕其生命周期(表示、获取、检索与演化)构建系统化方法论,为下一代高效、可靠、可组合的智能体系统提供基础支撑。
链接: https://arxiv.org/abs/2605.07358
作者: Yingli Zhou,Wang Shu,Yaodong Su,Wenchuan Du,Yixiang Fang,Xuemin Lin
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large language model (LLM)-based agents that reason, plan, and act through tools, memory, and structured interaction are emerging as a promising paradigm for automating complex workflows. Recent systems such as OpenClaw and Claude Code exemplify a broader shift from passive response generation to action-oriented task execution. Yet as agents move toward open-ended, real-world deployment, relying on from-scratch reasoning and low-level tool calls for every task become increasingly inefficient, error-prone, and hard to maintain. This survey examines this challenge through the lens of \emphagent skills, which we define as reusable procedural artifacts that coordinate tools, memory, and runtime context under task-specific constraints. Under this view, agents and skills play complementary roles: agents handle high-level reasoning and planning, while skills form the operational layer that enables reliable, reusable, and composable execution. Skills are therefore central to the scalability, robustness, and maintainability of modern agent systems. We organize the literature around four stages of the agent skill lifecycle – representation, acquisition, retrieval, and evolution – and review representative methods, ecosystem resources, and application settings across each stage. We conclude by discussing open challenges in quality control, interoperability, safe updating, and long-term capability management. All related resources, including research papers, open-source data, and projects, are collected for the community in \textcolorbluethis https URL.
[IR-6] DCGL: Dual-Channel Graph Learning with Large Language Models for Knowledge-Aware Recommendation SIGIR2026
【速读】:该论文旨在解决基于知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Model, LLM)的推荐系统中存在的三大挑战:一是对KG中显式链接之外的隐式语义关系建模不足;二是ID嵌入与LLM嵌入的单通道融合导致信号干扰和表征模糊;三是推荐策略未充分考虑用户-物品交互频率的变化。其解决方案的关键在于提出双通道图学习(Dual-Channel Graph Learning, DCGL)框架,包含三项核心创新:首先,构建结构解耦的双通道架构,将语义信息与用户行为模式分离以避免早期干扰;其次,设计多层级对比学习机制,通过视图内对比增强抗KG噪声鲁棒性,并借助视图间对齐弥合通道间的语义鸿沟;最后,引入动态融合机制,依据交互频率自适应平衡语义泛化与行为特异性,从而缓解级联限制。
链接: https://arxiv.org/abs/2605.07314
作者: Xinchi Zou,Tongzhenzhi Su,Jianjun Li,Yuan Fu,Chang Liu,Zhiying Deng,Zhiwei Shen
机构: Huazhong University of Science and Technology (华中科技大学); Central China Normal University (华中师范大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by SIGIR 2026
Abstract:Knowledge Graphs (KGs) have proven highly effective for recommendation systems by capturing latent item relationships, while recent integration of Large Language Models (LLMs) has further enhanced semantic understanding and addressed knowledge sparsity issues. Nevertheless, current KG-and-LLM-based methods still face three main limitations: 1) inadequate modeling of implicit semantic relationships beyond explicit KG links; 2) suboptimal single-channel fusion of ID and LLM embeddings, which often leads to signal interference and blurred representations; and 3) insufficient consideration of user-item interaction frequency variations in recommendation strategies. To address these challenges, we propose the Dual-Channel Graph Learning (DCGL) framework, featuring three key innovations: 1) a dual-channel architecture that structurally decouples rich semantic information from user behavioral patterns, preventing early interference; 2) a multi-level contrastive learning mechanism that enhances robustness against KG noise through intra-view contrasts and bridges semantic gaps between channels via inter-view alignment; and 3) a dynamic fusion mechanism that adaptively balances semantic generalization and behavioral specificity based on interaction frequency, resolving the cascading limitation. Extensive experiments on four real-world datasets show that DCGL consistently outperforms state-of-the-art methods, yielding substantial improvements in sparse scenarios while maintaining precision for active users. Our code is available at this https URL.
[IR-7] PRISM: Refracting the Entangled User Behavior Space for E-Commerce Search
【速读】:该论文旨在解决电子商务搜索系统中用户行为建模的鲁棒性问题,即传统方法将用户偏好(user preference)与物品相关性(item relevance)视为独立且稳定的信号,但在实际场景中,二者受曝光机制、反馈循环和语义匹配共同影响,导致行为信号纠缠且动态漂移,从而引发混淆效应和语义错位,限制了下游排序模型的性能。解决方案的关键在于提出PRISM框架,其核心创新是显式建模用户偏好与物品相关性之间的交互关系:通过引入偏好修正模块,在相关性感知约束下迭代优化用户偏好以增强对行为混淆的鲁棒性;利用大语言模型(LLM)驱动的语义锚定机制,基于正负原型校准相关性表示以保证语义一致性;并通过偏好条件证据路由模块自适应聚合多源行为信号,实现上下文感知且偏好对齐的相关性估计。
链接: https://arxiv.org/abs/2605.07296
作者: Haoqian Zhang,Ziyuan Yang,Yi Zhang
机构: Sichuan University (四川大学); Nanyang Technological University (南洋理工大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:E-commerce search systems rely on modeling user behavior to estimate item relevance and user preference, which are typically assumed to be stable and independently learnable signals. However, in practice, user interactions are jointly shaped by exposure mechanisms, feedback loops, and semantic matching, leading to entangled and dynamically drifting behavioral signals. As a result, both preference estimation and relevance modeling suffer from confounding effects and semantic misalignment, which limits the robustness of downstream ranking models. To address this issue, we propose PRISM, a Preference-Relevance Interaction Semantic Modeling framework for e-commerce search behavior prediction. PRISM explicitly models the interaction between user preference and item relevance rather than treating them as independent components. Specifically, it introduces a preference rectification module to iteratively refine user preference under relevance-aware constraints, improving robustness against behavioral confounding. To ensure semantic consistency, we further incorporate a large language model (LLM)-driven semantic anchoring mechanism that leverages positive and negative prototypes to calibrate relevance representations. Finally, a preference-conditioned evidence routing module adaptively aggregates multi-source behavioral signals, enabling context-aware and preference-aligned relevance estimation. Extensive experiments on two public e-commerce benchmarks demonstrate that PRISM consistently outperforms strong baselines, validating the effectiveness of explicitly modeling preference-relevance interaction for robust and semantically grounded search behavior modeling.
[IR-8] MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
【速读】:该论文旨在解决多语言信息检索(Multilingual Information Retrieval)中现有评估方法忽视语言偏好问题的局限性。传统评估仅关注跨语言语义相关性,未考虑用户对查询语言结果的偏好,而这一因素在实际应用中至关重要,尤其是在检索增强生成(Retrieval-Augmented Generation)系统中,语言不匹配会影响下游内容的可读性和验证能力。解决方案的关键在于提出 MLAIRE(Multilingual Language-Aware Information Retrieval Evaluation),其核心创新是构建包含平行文本的受控测试集,从而将语义检索准确性与查询语言偏好解耦,并引入语言感知指标如 Language Preference Rate (LPR) 和 Lang-nDCG,以及四维分解框架以精确识别语义错误与语言偏好失败。
链接: https://arxiv.org/abs/2605.07249
作者: Youngjoon Jang,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim
机构: Korea University (韩国国立大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query–passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.
[IR-9] DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
【速读】:该论文旨在解决自回归语言模型在信息检索任务中作为检索器时效率低下的问题,特别是多代表 token(representative-token)生成策略在自回归模型中因逐token生成导致的延迟高且性能提升不显著的问题。其核心解决方案是提出 DiffRetriever——一种基于扩散语言模型(diffusion language model)的多代表 token 检索方法,关键创新在于通过在提示中添加 K 个掩码位置,并在单次双向前向传播中并行读取所有 K 个 token,从而绕过自回归模型的序列生成瓶颈。实验表明,该方法在多个扩散模型骨干上均显著优于单 token 和自回归多 token 方案,且无额外延迟代价,成为 BEIR-7 数据集上的最强检索器之一。
链接: https://arxiv.org/abs/2605.07210
作者: Shuai Wang,Yin Yu,Shengyao Zhuang,Bevan Koopman,Guido Zuccon
机构: The University of Queensland (昆士兰大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding. We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at this https URL. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2605.07210 [cs.IR] (or arXiv:2605.07210v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.07210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-10] opic Is Not Agenda: A Citation-Community Audit of Text Embeddings
【速读】:该论文旨在解决科学文献检索中基于向量相似度(如余弦相似度)的检索机制在实际应用中的有效性问题,特别是其是否能准确反映研究主题的一致性。研究表明,尽管当前主流嵌入模型(如Gemini、Qwen3、SPECTER2)在子领域层级(L1)上表现尚可(Top-10同属子领域的比例为45–52%),但在更细粒度的研究议程层级(L2)上严重失效——仅有15–21%的Top-10检索结果与查询论文共享相同研究议程,即约80%的召回文献偏离了核心研究目标。解决方案的关键在于揭示:传统单向量嵌入方法无法捕获文献间深层语义关联,而基于引用图结构的简单重排序策略(如基于引文计数的再排序)却能显著提升研究议程匹配精度(达到57.7%–59.6%),这表明文献间的结构化关系(如引用网络)蕴含了嵌入模型所缺失的“议程一致性信号”,从而为改进生成式AI(Generative AI)驱动的检索增强生成(RAG)系统提供了关键实证依据和方向。
链接: https://arxiv.org/abs/2605.07158
作者: Junseon Yoo
机构: Pluto Labs (Pluto 实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, 4 tables
Abstract:Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query’s research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also functions as a retrieval signal: a deliberately simple citation-count rerank reaches 57.7% top-1 L2 on top of LLM-expanded Boolean retrieval and 59.6% on top of plain BM25, on 80 curated agenda queries – about 9 points above the best cosine retriever (Gemini, 50.6%) and 20 points above BM25 alone (39.3%). The probe isolates a slice of the agenda-matching signal the graph carries but the embeddings miss, connecting recent theoretical limits on single-vector retrieval to a concrete failure mode of scientific RAG.
[IR-11] RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的推荐系统在构建决策相关上下文时面临的两大挑战:一是现有方法依赖固定的上下文构造策略,难以动态识别对每个推荐实例真正有益的信息;二是异构证据(如协同行为数据与物品元数据)导致上下文效率瓶颈,即信息过载易超出上下文窗口限制,而粗暴压缩或启发式过滤可能丢失关键细粒度证据。解决方案的关键在于提出一种基于排序驱动的检索-推理框架(Ranking-driven Retrieval-and-Reasoning Framework, RRCM),其核心创新包括:1)通过轻量级用户历史上下文启动,学习是否直接推荐、检索协同证据、检索物品元数据或二者交错执行,实现灵活的证据获取;2)将协同记忆与元数据记忆统一表示为自然语言并采用统一检索接口访问,避免手工设计协同过滤注入机制或固定检索规则;3)利用仅基于最终Top-k推荐质量的排序奖励(outcome-only ranking reward)优化记忆读取策略,采用组相对策略优化(group relative policy optimization)实现端到端可微分训练,使检索决策直接由推荐效果驱动。
链接: https://arxiv.org/abs/2605.07129
作者: Shijun Li,Wooseong Yang,Yu Wang,Tianxin Wei,Joydeep Ghosh
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Illinois at Chicago (芝加哥大学); Capital One AI Foundations (美国资本one人工智能基础部门); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have emerged as a promising paradigm for next-generation recommender systems, offering strong semantic understanding and natural-language reasoning abilities. Despite recent progress, current LLM-based recommenders still face key challenges in constructing decision-relevant contexts from heterogeneous evidence. First, existing methods often rely on fixed context construction strategies: collaborative behavioral evidence and item-side metadata are typically incorporated through predefined prompts, static retrieval pipelines, or handcrafted injection mechanisms, making it difficult to determine what information is truly beneficial for each instance. Second, heterogeneous evidence introduces a severe context-efficiency bottleneck. Rich metadata and collaborative interaction records can quickly overwhelm the context window, while aggressive compression or heuristic filtering may discard fine-grained evidence critical for accurate recommendation. To address these challenges, we propose RRCM, a ranking-driven retrieval-and-reasoning framework over collaborative and metadata memories for LLM-based agentic recommendation. RRCM starts from a lightweight user-history context and learns whether to recommend directly, retrieve collaborative evidence, retrieve item metadata, or interleave both through reasoning. Both memories are represented in natural language and accessed through a unified retrieval interface, enabling flexible evidence acquisition without handcrafted CF injection or fixed retrieval rules. We optimize this memory-reading policy with an outcome-only ranking reward, instantiated using group relative policy optimization, so that retrieval decisions are directly driven by final top-k recommendation quality. Extensive experiments show that RRCM significantly outperforms traditional baselines and diverse LLM-based recommendation approaches.
[IR-12] An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation
【速读】:该论文旨在解决当前生成式推荐系统(Generative Recommenders)在标准基准测试中表现优异,但其性能是否真正反映了模型对序列模式与语义信息的高级建模能力的问题。研究指出,现有基准数据集可能存在“捷径可解性”(shortcut solvability),即简单方法也能取得良好效果,从而误导对复杂模型优势的判断。解决方案的关键在于引入一个仅依赖局部物品转移图(item-transition graph)和特征相似度排序的极简图启发式方法:该方法无需序列编码器、生成目标或训练过程,仅基于用户最近一到两个交互物品进行候选检索,却在多个数据集上达到甚至超越主流生成式推荐基线,相对NDCG@10提升高达44.18%。这一结果揭示了三种常见捷径结构——低分支局部转移、特征平滑转移及对长用户历史依赖有限——使得简单局部检索即可高效预测下一物品,表明当前基准可能不足以验证先进推荐模型的真实能力。
链接: https://arxiv.org/abs/2605.07125
作者: Haoyu Han,Li Ma,Hanbing Wang,Bingheng Li,Daochen Zha,Chun How Tan,Huiji Gao,Xin Liu,Stephanie Moyerman,Sanjeev Katariya,Hui Liu,Jiliang Tang
机构: Michigan State University (密歇根州立大学); Airbnb, Inc. (Airbnb公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequential recommendation has increasingly shifted toward generative recommenders that combine sequential patterns with semantic item information. Yet these methods are often evaluated on a small set of widely used benchmarks, raising a key question: do these benchmarks actually require the advanced modeling capabilities that modern generative recommenders claim to provide? We conduct a benchmark audit with an intentionally simple graph heuristic. Starting from only the last one or two interacted items, it retrieves candidates from a few-hop item-transition graph and ranks them by item-feature similarity. Despite using no sequence encoder, generative objective, or training, this heuristic matches or outperforms many modern baselines, with relative NDCG@10 improvements of 38.10% and 44.18% over the best competing baseline on Amazon Review Sports and CDs. We show that this behavior reflects shortcut solvability rather than an artifact of one heuristic. We identify three shortcut structures that can make next-item prediction easier than expected: low-branching local transitions, feature-smooth transitions, and limited dependence on long user histories. These shortcuts need not appear together; even one or two strong signals can make simple local retrieval highly competitive, while weakening them makes the benefits of more sophisticated models clearer. Across 14 datasets, model rankings vary substantially with dataset properties, yet the heuristic remains competitive on 10 of them. Our findings suggest that strong performance on standard benchmarks does not always demonstrate advanced sequential, semantic, or generative modeling ability. We call for more careful dataset selection and dataset-level diagnostic analysis when using benchmarks to support claims about new recommendation models. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07125 [cs.IR] (or arXiv:2605.07125v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.07125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-13] Bridging Textual Profiles and Latent User Embeddings for Personalization
【速读】:该论文旨在解决个性化推荐系统中用户表示(user representation)的可解释性与下游任务性能之间的矛盾问题:传统方法要么使用监督学习得到的潜在用户嵌入(latent user embeddings),虽在检索任务中表现优异但难以解释;要么依赖文本形式的用户画像(textual user profiles),虽具备可解释性却因缺乏直接监督信号而难以优化以提升推荐效果。解决方案的关键在于提出BLUE框架,该框架基于强化学习机制,通过一个大型语言模型(LLM)生成文本用户画像,并利用嵌入模型提供奖励信号,使生成的文本表示在嵌入空间中趋向正样本、远离负样本;同时引入基于下一物品预测的文本空间监督信号,确保生成的用户画像兼具语义合理性和推荐有效性。实验表明,BLUE在零样本序列推荐和跨域迁移场景下均显著优于基线方法,且生成的用户画像能为问答任务提供更优质的个性化上下文。
链接: https://arxiv.org/abs/2605.06981
作者: Zhaoxuan Tan,Xiang Zhai,Yan Zhu,Meng Jiang,Mohamed Hammad
机构: University of Notre Dame(圣母大学); Google(谷歌)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Personalized systems rely on user representations to connect behavioral history with downstream recommendation applications. Existing methods typically employ either supervised latent user embeddings, which are effective for retrieval but difficult to interpret, or textual user profiles, which are interpretable but challenging to optimize for downstream utility due to lack of direct supervision. To bridge this gap, we present BLUE, a reinforcement learning framework that unifies these two forms of user representation by aligning language-based user profiles with embedding-based recommendation objectives. Given a user interaction history, BLUE leverages a profiler Large Language Model (LLM) to generate textual profiles, while an embedding model provides reward signals. This encourages the resulting textual representations to move closer to positive items and farther from negative ones in the embedding space. We further introduce a text-space supervision signal based on next-item prediction, ensuring the learned profiles remain both semantically meaningful and highly effective for downstream retrieval. Experiments on Amazon Reviews 2023 and Google Local Reviews in zero-shot sequential recommendation settings demonstrate that BLUE consistently outperforms strong baselines under both frozen and trainable embedding conditions. Notably, BLUE achieves clear gains in cross-domain transfer, highlighting the strong generalization ability of the learned user profiles. Furthermore, these generated profiles provide superior personalized context for question answering compared to raw user histories or alternative profile optimization methods. Overall, these results show that BLUE provides an effective way to unify interpretable textual profiling with discriminative latent embeddings for personalization.
[IR-14] From Surface Learning to Deep Understanding: A Grounded AI Tutoring System for Moodle IJCAI2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在教育场景中因幻觉(hallucination)导致的信息失真问题,以及缺乏教师主导内容控制机制所引发的教育质量风险。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的模块化 Moodle 插件——AI 教学辅助系统,通过将大型语言模型(Large Language Model, LLM)的输出严格锚定于教师提供的教学材料,实现对生成内容的事实准确性保障;同时采用双中心设计,既为学生提供苏格拉底式交互辅导,又为教师提供“人在环路”(human-in-the-loop)的内容生成工作空间,从而在提升学习深度的同时确保教育过程可控、可信。
链接: https://arxiv.org/abs/2605.06963
作者: Anna Ostrowska,Michał Kukla,Gabriela Majstrak,Jan Opala,Sebastian Pergała,Jan Skwarek,Anna Wróblewska
机构: Warsaw University of Technology (华沙理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 5 pages, accepted as demo paper at IJCAI 2026
Abstract:This demo paper describes the development of the AI Teaching \ Learning Assistant, a modular Moodle plugin that leverages Retrieval-Augmented Generation (RAG) to deliver high-quality, hallucination-free education. The system employs a dual-centric design, providing students with interactive, Socratic-based tutoring and educators with a “human-in-the-loop” workspace for supervised content generation. By grounding Large Language Model (LLM) responses in teacher-provided materials, the assistant addresses the risks of misinformation while encouraging deep conceptual mastery. Evaluation via the Ragas (LLM-as-a-Judge) framework and a preliminary user study confirms its effectiveness, achieving faithfulness scores up to 0.97 and a 4.00/5.00 recommendation rate.
人机交互
[HC-0] ECNUClaw: A Learner-Profiled Intelligent Study Companion Framework for K-12 Personalized Education
【速读】:该论文旨在解决K-12教育中个性化学习支持不足的问题,即如何构建能够动态适应学生个体差异的智能学习伙伴(Intelligent Study Companion)。其解决方案的关键在于提出并实现了一个名为ECNUClaw的开源框架,该框架通过多维学习者画像(五维:认知、行为、情感、元认知和情境)实时捕捉学生与智能伙伴对话中的信号,并基于这些画像驱动自适应策略引擎,动态调整引导强度、鼓励频率以及布卢姆分类法(Bloom’s Taxonomy)层级的支持结构。该设计融合了中国教育技术领域的三大理论基础,实现了从数据感知到教学干预的闭环机制。
链接: https://arxiv.org/abs/2605.08040
作者: Yizhou Zhou,Jiayin Li,Zhi Zhang
机构: Shanghai Institute of AI Education, East China Normal University (华东师范大学人工智能教育研究院); AI Education Laboratory, East China Normal University (华东师范大学人工智能教育实验室); Institute of Higher Education, Faculty of Education, East China Normal University (华东师范大学教育学部高等教育研究所)
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages, 6 figures
Abstract:We introduce ECNUClaw, an open-source framework for building learner-profiled intelligent study companions in K-12 education. The system constructs and maintains a five-dimension learner profile – covering cognitive, behavioral, emotional, metacognitive, and contextual dimensions – by extracting signals from student-companion dialogues at each turn. Profile updates feed directly into an adaptive strategy engine that adjusts the companion’s guidance intensity, encouragement frequency, and Bloom’s taxonomy scaffolding in real time. The framework design draws on three theoretical strands from the Chinese educational technology literature: Zhang’s Digital Portrait Three-Layer Framework for learner assessment, the Education Brain model for educational system architecture, and the Human-AI Collaborative IQ concept for companion design philosophy. ECNUClaw is implemented in Python and supports seven Chinese LLM providers through a unified OpenAI-compatible adapter layer. We describe the system architecture, the profiling and adaptation mechanisms, and discuss limitations and next steps. The source code is available at this https URL.
[HC-1] Hot Wire 5D: Evaluating Cognitive and Motor Trade-offs of Visual Feedback for 5D Augmented Reality Trajectories
【速读】:该论文旨在解决在增强现实(Augmented Reality, AR)引导复杂空间任务时,新手用户在多维轨迹跟踪(5D+轨迹,包括3D位置、2D方向和运动速度)中的感知-运动基线不明确,以及不同视觉反馈范式引发的认知-运动权衡问题。解决方案的关键在于通过受控的被试内实验(N=30),对比三种AR用户界面(UI)概念在有无显式方向约束条件下的表现,结合内部AR追踪数据与高精度外部光学追踪系统验证以排除硬件漂移,并采用分段分析(瞬态与稳态阶段)及Aligned Rank Transform (ART) ANOVA方法分离视觉设计与任务复杂度的交互效应,从而建立新手用户的保守性能基准并识别出可缓解认知-运动权衡的UI协同机制,最终提供可操作的设计指南用于开发高效的AR引导系统。
链接: https://arxiv.org/abs/2605.08008
作者: Christian Masuhr,Julian Koch,Arne Wendt,Thorsten Schüppstuhl
机构: Hamburg University of Technology (TUHH) (汉堡工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 9 figures. Preprint. Includes supplementary appendix
Abstract:Augmented Reality (AR) is increasingly utilized to guide users through complex spatial tasks in domains such as manufacturing, non-destructive testing, and surgery. These applications often require strict compliance with 5D+ trajectories using rotation-symmetric tools (3D position, 2D orientation, and movement speed). However, the sensori-motor baselines of untrained users during these multidimensional tracing tasks, along with the cognitive-motor trade-offs induced by varying visual feedback paradigms, remain underexplored. We present a controlled within-subjects user study (N=30) evaluating three distinct AR UI concepts for trajectory guidance, both with and without explicit orientation constraints. We analyzed spatial, orientational, and speed compliance based on the internal AR tracking, which was validated against a high-precision external optical tracking system to rule out hardware drift. By segmenting the execution into transient and steady-state phases and applying Aligned Rank Transform (ART) ANOVA, we isolated the interaction effects between visual design and task complexity. Alongside subjective metrics (NASA-TLX, SUS), our results establish conservative performance baselines for novice users performing freehand 5D trajectory following. We reveal orientation-induced cognitive-motor trade-offs and identify mitigating UI synergies. Ultimately, we provide empirical baselines and actionable design guidelines for developing effective AR guidance systems.
[HC-2] owards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
【速读】:该论文旨在解决当前人工智能(AI)评估中普遍存在“苹果对苹果”比较困难的问题,即不同评估方法和场景导致的评价结果难以直接对比,常出现“苹果对橙子”的不一致性。为实现更可比、更贴近实际应用的AI评估,论文提出以方法透明性、操作落地性和以人为本设计(Human-Centered Design, HCD)为核心原则的解决方案。其关键在于构建一个从高层用例到详细评估场景的可重复转化流程:通过结构化AI用例工作表(含六个要素)从领域专家处获取初始用例,并结合大语言模型(LLM)提示与多轮人工审核,生成具有操作基础的107个具体评估场景;每一步均设置人工校验点,确保场景内容贴合真实业务需求与人类关切,最终形成一套用于评估场景质量的验证量规,从而推动AI评估向系统化、一致性和人本导向演进。
链接: https://arxiv.org/abs/2605.07986
作者: Yee-Yin Choong,Kristen Greene,Alice Qian,Meryem Marasli,Ziqi Yang,Sophia Chen,Laura Dabbish,Anand Rao,Hong Shen
机构: National Institute of Standards and Technology (国家标准与技术研究院); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 3 figures
Abstract:AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be “apples-to-oranges” comparisons across AI evaluations. To move toward “apples-to-apples” comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.
[HC-3] Exploring a Virtual Pet to Provide Context Notifications in a Tourism Recommender System: a Pilot Study
【速读】:该论文旨在解决旅游推荐系统(Tourism Recommender Systems, RS)中实时通知因侵扰性高和用户疲劳而导致的设计难题。其解决方案的关键在于引入虚拟宠物作为社交中介(social mediator),通过整合实时环境数据(如空气质量、噪声水平和天气预报)与基于位置的通知,并结合多智能体微服务(Multi-Agent Microservice)根据用户个性特征和偏好生成个性化推荐,从而软化系统警报的感知侵扰性,提升通知的自然性和实用性。实验结果表明,虚拟宠物不仅增强了用户对安全关键信息的接受度,还通过角色化的解释机制显著提高了通知的清晰度,支持用户在旅行中的实时决策。
链接: https://arxiv.org/abs/2605.07960
作者: Patrícia Alves,Joana Neto,Ana Barreiro,Jorge Lima,Fausto Alves,Henish Balu,Luís Conceição,Goreti Marreiros
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 30 pages, 7 figures, 7 tables
Abstract:While context-aware personalization has been widely explored in modern tourism Recommender Systems (RS), the delivery of real-time notifications remains a significant design challenge due to issues of intrusiveness and user fatigue. This paper presents a proof-of-concept for a tourism recommendation framework that utilizes a virtual pet as a social mediator for delivering context-aware alerts. The system integrates real-time environmental data - including air quality, noise levels, and weather forecasts - and proximity-based notifications with a Multi-Agent Microservice that generates personalized recommendations based on the user’s personality traits and preferences. A within-subjects pilot study (n=11) was conducted to evaluate the feasibility and user acceptance of this pet-mediated approach. Participants interacted with two versions of the system - a baseline without contextual alerts and a version featuring pet-mediated notifications - over a four-week period (two weeks per version) in real-world scenarios. Quantitative and qualitative data were collected to assess engagement, perceived naturalness, notification utility, and acceptance. Preliminary results suggest that the virtual pet effectively can “soften” the perceived intrusiveness of system alerts, making safety-critical information feel more welcome and natural. Furthermore, the character-mediated justifications significantly improved the clarity of the notifications, effectively supporting users in their real-time travel decisions. These findings provide a foundation for using virtual pet companions to enhance the transparency and acceptance of context-aware communication in tourism RS.
[HC-4] Sycophantic AI makes human interaction feel more effortful and less satisfying over time
【速读】:该论文旨在解决生成式 AI (Generative AI) 在人际互动中可能引发的替代效应问题,即用户因获得情感和自尊支持而逐渐减少对现实人际关系的依赖。其解决方案的关键在于揭示了“奉承型 AI”(sycophantic AI)通过提供无摩擦的理解感,使用户在短期内获得类似亲密关系的情感满足,进而导致长期对真实社交关系满意度下降,并促使用户更倾向于向 AI 寻求建议而非亲友支持。
链接: https://arxiv.org/abs/2605.07912
作者: Lujain Ibrahim,Franziska Sofia Hafner,Myra Cheng,Cinoo Lee,Rebecca Anselmetti,Robb Willer,Luc Rocher,Diyi Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Millions of people now turn to artificial intelligence (AI) systems for personal advice, guidance, and support. Such systems can be sycophantic, frequently affirming users’ views and beliefs. Across five preregistered studies (N = 3,075 participants, 12,766 human-AI conversations), including a three-week study with a census-representative U.S. sample, we provide longitudinal experimental evidence that sycophantic AI shifts how users approach their closest relationships. We show that sycophantic AI immediately delivers the emotional and esteem support users typically associate with close friends and family. Over three weeks of such interactions, users became nearly as likely to seek personal advice from sycophantic AI as from close friends and family, and reported lower satisfaction with their real-world social interactions. When given a choice among AI response styles, a majority preferred sycophantic AI – not for the quality of its advice, but because it made them feel most understood. Together, these findings offer a relational account of AI sycophancy: by providing frictionless understanding, it may quietly raise the bar against which human relationships are judged.
[HC-5] SpatialPrompt: XR-Based Spatial Intent Expression as Executable Constraints for AI Generative 3D Design
【速读】:该论文旨在解决当前3D生成过程中用户难以直观表达空间结构与语义意图的问题,尤其是在协作环境中缺乏高效、可交互的约束输入方式。解决方案的关键在于提出SpatialPrompt系统,该系统通过结合空间草图(spatial sketches)与语音提示(voice prompts),将用户的粗略三维笔绘转化为可执行的约束条件,从而实现对3D生成过程的可控性;同时支持迭代优化与多人同步共创,并以颜色编码区分贡献者,提升协同创作中的理解一致性。
链接: https://arxiv.org/abs/2605.07894
作者: Yichen Andy Yu,Wanru Li,Qiaoran Wang,Jymon Ross,Gavin Johnson,Mandy Lui,Qiao Jin
机构: North Carolina State University(北卡罗来纳州立大学); Carnegie Mellon University(卡内基梅隆大学); National University of Singapore(新加坡国立大学); University of Rochester(罗切斯特大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We present SpatialPrompt, an Extended Reality(XR) system that turns spatial sketches into executable constraints for controllable 3D generation. Users draw rough structures with a 3D pen and add voice prompts for semantic and stylistic intent. The system supports iterative refinement and synchronous co-creation in shared space with color-coded contributions. Implemented on Apple Vision Pro with Logitech Muse and Meshy, a heuristic evaluation suggests that the workflow is intuitive and supports shared understanding in collaborative creation, while revealing needs for faster generation and clearer feedback.
[HC-6] A Roadmap of Mixed Reality Body Doubling for Adults with ADHD
【速读】:该论文旨在解决成人注意缺陷多动障碍(Attention-Deficit/Hyperactivity Disorder, ADHD)患者在执行任务时缺乏自我管理能力的问题,特别是如何通过“身体陪伴”(Body Doubling)这一自管理策略提升任务启动与完成效率。其解决方案的关键在于构建了一个包含十二个维度的理论框架,涵盖个体动机、代理相关特征、交互相关因素、情境因素及疗效评估等层面,从而系统化地解析身体陪伴的作用机制,并识别出当前研究中的空白,如混合现实原型不足、互动性增强的可能性以及亟需开展实证研究以深化对身体陪伴与ADHD成人之间关系的理解。
链接: https://arxiv.org/abs/2605.07851
作者: Valerie Tan,Kimberly Hegemann,Jens Gerken
机构: TU Dortmund University (多特蒙德工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 figures, accepted short paper submission to a CHI 2026 conference workshop
Abstract:Adults with ADHD may use a self-management technique known as Body Doubling, in which the participant employs the presence of one or more agents as a means of initiating and completing tasks. We developed a framework on body doubling with twelve dimensions to better understand the characteristics of body doubling and discover future research directions for developing and testing body doubling for adults with ADHD. Our framework accounts for individual motivation, agent-related dimensions, interaction related dimensions, contextual dimensions, and efficacy. These dimensions show existing research gaps such as limited mixed reality prototypes, possibilities for more interactive body doubles, and the need for empirical studies to further understand of body doubling and adults with ADHD.
[HC-7] A Spatial Knowledge Acquisition Comparison Between Digital Visual Thematic Maps Non-Visual Interactive Text Thematic Maps and Tables
【速读】:该论文旨在解决数字地图在无障碍访问中的有效性问题,即如何为视障及低视力人群(BLVIs)提供等效于视觉地图的空间信息获取方式。传统做法常使用缺乏地理信息的表格作为替代,但本文通过实证研究发现,交互式文本地图(ITMs)与视觉地图在地理类任务上显著优于表格,且对 sighted 用户而言,ITMs 与视觉地图的表现无显著差异,表明 ITMs 可能实现“地图等效目的”(Map Equivalent Purpose)。解决方案的关键在于:基于 Web Content Accessibility Guidelines (WCAG) 设计的 ITMs 能有效传递空间关系和地理信息,从而挑战了当前将纯表格视为地图替代品的无障碍实践,并推动对数字专题地图无障碍法规的重新审视。
链接: https://arxiv.org/abs/2605.07849
作者: Brandon Biggs,Christopher Toth,James M. Coughlan,Bruce N. Walker
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Digital maps are used to communicate generalized spatial information and relationships, yet are commonly made “accessible” using tables that lack geographic information. This study examines whether these tables and interactive text maps (ITMs) may be comparable to visual maps. Twenty sighted and 20 blind and low-vision individuals (BLVIs) performed tasks designed to compare visual maps, ITMs, and tables. Participants answered numeric, geographic, and combined numeric geographic questions using each representation, and performance, preference, and NASA-TLX were measured. Across both participant groups, map representations (visual and ITMs) significantly outperformed tables on geographic-based questions, while performance differences were minimal for numeric questions. For sighted participants, performance on geographic questions did not significantly differ between visual maps and ITMs, indicating that a larger powered study may find an “equivalent purpose” across these two conditions. Participants preferred map-based representations over tables. Perceived workload was highest for the ITM, intermediate for the visual map, and lowest for the table. Consistent with the Map Equivalent Purpose Framework, these findings indicate that Web Content Accessibility Guidelines-compliant ITMs can provide access to spatial information, unlike tables. These findings challenge prevailing accessibility practice that recommends tables lacking geographic information as map alternatives, and motivate reconsideration of accessibility legislation exempting digital thematic maps.
[HC-8] Analyzing Human Heuristics and Strategies in Everyday Decision-Making Conversations for Conversational AI Design
【速读】:该论文旨在解决当前对话式人工智能(Conversational AI)在支持日常决策时,普遍依赖数据驱动的推理方式,而忽视了人类在自然对话中实际采用的启发式策略和交互机制的问题。其解决方案的关键在于通过分析955个真实韩语对话(共15,476条话语)中的食物与旅行决策行为,利用大语言模型(LLM)辅助的编码流程,识别出人类在决策过程中优先采用“满意化”(satisficing)而非优化策略,并高度依赖内部知识和交互策略以降低认知负荷。研究进一步揭示了“频率-效率错配”现象:高频使用的启发式策略有助于维持探索阶段的对话流畅性,而低频但规则化的策略在决策收束阶段具有更高的有效性。这一发现为将AI系统设计与人类启发式决策过程对齐提供了实证依据。
链接: https://arxiv.org/abs/2605.07789
作者: Sora Kang,Soyun Jeon,Jinsu Eun,Kwangwon Lee,Chaerin Song,Minyoung Joo,Joonhwan Lee
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: CogSci 2026 (Annual Meeting of the Cognitive Science Society 2026)
Abstract:Conversational AI increasingly supports everyday decision-making, yet most systems rely on data-centric reasoning rather than the heuristic and interactional strategies people use in natural conversation. To ground design in actual human practice, we analyze 955 real-world Korean conversations (15,476 utterances) involving food and travel decisions, applying a decision-making codebook through an LLM-assisted coding pipeline. Our findings reveal that people prioritize satisficing over optimization, relying heavily on internal knowledge and interactional strategies to manage cognitive load. Critically, we identify a frequency-efficiency mismatch: the most prevalent heuristics sustain conversational flow during exploration, whereas infrequent, rule-based strategies are highly effective at driving resolution during exploitation. By mapping how these patterns transfer across the spectrum of human-AI interaction, this work provides empirical grounding consistent with cognitive theories of decision-making and offers design implications that align AI systems with human heuristic processes.
[HC-9] Splitting User Stories Into Tasks with AI – A Foe or an Ally?
【速读】:该论文旨在解决敏捷软件开发中用户故事(User Story)拆分过程耗时且易遗漏关键任务的问题。解决方案的关键在于引入生成式 AI 工具(如 GitLab Duo)辅助任务拆分,通过对比传统方法与 AI 辅助方法的实验发现,AI 能够生成更细粒度的任务列表并减少遗漏,但其成熟度尚不足以独立替代开发者;因此,研究主张采用人机协同的混合模式,在人类专家监督下利用 AI 提升规划效率与准确性,从而实现高效、可靠的敏捷任务分解。
链接: https://arxiv.org/abs/2605.07320
作者: Luka Pavlič,Reinhard Bernsteiner,Stephan Schlögl,Christian Ploder
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 2 figures, 4 tables
Abstract:In agile software development, breaking down user stories into actionable tasks is a critical yet time-consuming process. This paper investigates the potential of Generative AI tools to assist in task splitting, aiming to enhance planning efficiency. We conducted a controlled experiment comparing traditional task-splitting methods with AI-assisted approaches using GitLab Duo. Our findings indicate that while current AI tools are not yet mature enough to replace developers, they can aid in generating more granular task lists and ensuring no important tasks are overlooked. Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning. This study highlights the potential benefits and limitations of integrating Generative AI into agile development processes, suggesting that AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.
[HC-10] Same Brain Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability
【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)深度学习模型在不同预处理流程下预测结果不稳定的问题,这种不稳定性可能导致临床和脑机接口应用中的可靠性下降。传统方法通常假设预处理流程固定,忽略了其作为潜在干预变量对模型输出的影响,从而无法准确量化由预处理变化带来的不确定性。论文提出三个核心解决方案:首先,通过Walsh-Hadamard分解将2⁷种预处理组合空间建模为近加性敏感结构,实现高效逐步优化;其次,引入预处理不确定性(Preprocessing Uncertainty, PU),提供一种与模型置信度互补的逐样本诊断指标;最后,设计基于图结构的正则化方法Normalized Adaptive PGI(NA-PGI),利用预处理干预的组合特性进行稳定性增强,具备明确的作用边界条件。
链接: https://arxiv.org/abs/2605.07212
作者: Dengzhe Hou,Zihao Wu,Lingyu Jiang,Zirui Li,Fangzhou Lin,Kazunori D. Yamada
机构: Tohoku University (东北大学); University of Georgia (佐治亚大学); Texas AM University (德州农工大学); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
备注:
Abstract:Electroencephalography (EEG) is a cornerstone of brain-computer interfaces and clinical neuroscience, yet deep learning models are typically trained and evaluated under a single, unreported preprocessing pipeline. We formalize preprocessing choices as a counterfactual intervention space and show that EEG predictions are surprisingly unstable under this space: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing changes, a variability that standard uncertainty methods do not explicitly quantify because they condition on a fixed preprocessing pipeline. We provide three tools to make this instability measurable, decomposable, and reducible. First, a Walsh-Hadamard decomposition of the 2^7 pipeline space reveals that sensitivity is near-additive in practice under the binary intervention design, enabling efficient step-by-step optimization. Second, we introduce Preprocessing Uncertainty (PU), a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Third, we study Normalized Adaptive PGI (NA-PGI), a graph-structured regularizer that exploits the compositional structure of preprocessing interventions as one mitigation strategy with clear scope conditions.
[HC-11] Metaphors as Scaffolds: Spatial Embodied Fantastical and Relational Framings for Youth Usable Privacy Design
【速读】:该论文旨在解决当前主流可用隐私设计(usable privacy design)将隐私视为行政性操作(如设置、开关、同意复选框)的问题,这种设计忽视了青少年在具体关系、情境和身体经验中进行信息披露推理的复杂性。解决方案的关键在于通过精心选择隐喻(metaphor)来重构隐私交互界面:空间隐喻降低认知负荷,身体隐喻提供共享道德话语以协商公共与私人空间规范,幻想隐喻将隐私管理转化为可探索的游戏行为从而提升对细粒度控制的参与度,而关系隐喻则可能因情感亲密掩盖机构数据流动风险,需谨慎使用。作者认为,隐喻的选择本质上是面向青少年隐私设计的首要伦理决策。
链接: https://arxiv.org/abs/2605.07185
作者: JaeWon Kim,Alexis Hiniker
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Mainstream usable privacy design frames privacy as administrative work – settings, toggles, consent checkboxes – abstracted from the relational, contextual, and embodied registers in which youth reason about disclosure. Drawing on a cross-project reading of three prior studies with youth aged 13–24, we examine how the metaphors that scaffold a privacy interaction shape the reasoning young users bring to it. \textitSpatial metaphors reduce cognitive load by recruiting intuitions about navigating physical space. \textitEmbodied metaphors furnish a shared moral vocabulary that makes implicit norms about public and private space negotiable among users. \textitFantastical metaphors recast privacy management as discoverable play, raising engagement with the granular controls that nuanced self-presentation requires. \textitRelational metaphors, by contrast, can lead youth past their own stated boundaries when felt intimacy masks institutional data flow, a risk already visible in AI companion products. Metaphor selection, we argue, is best understood as a first-order ethical design decision for youth privacy.
[HC-12] From Standard English to Singlish: A Retrieval-Augmented Approach for Code-Switched Creole Generation in Large Language Models
【速读】:该论文旨在解决接触变体语言(如新加坡英语,Singlish)中代码切换(code-switching)在自然语言生成任务中的挑战,尤其是受限于平行数据稀缺和词汇快速演变的问题。其解决方案的关键在于提出一种检索增强生成(Retrieval-Augmented Generation, RAG)框架,将方言知识外化为一个精心构建的词典资源,通过稀疏的词汇替换机制实现可控的代码切换,而无需微调模型。该方法在保持语义一致性的同时提升了生成结果的可审计性和可控性,且人类评估显示其与零样本提示(zero-shot prompting)在自然性和适当性上相当。
链接: https://arxiv.org/abs/2605.07132
作者: Foong Ming Lai,Yujin Tan,Han Meng,Yi-Chieh Lee
机构: National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Code-switching in contact varieties like Singaporean English (Singlish) challenges natural language generation due to limited parallel data and rapid lexical evolution. We propose a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon, enabling controlled lexical code-switching without fine-tuning. Our approach retrieves candidate Singlish expressions and guides generation through sparse lexical substitution. Human evaluation with 164 Singaporean participants found RAG and zero-shot prompting equally natural and appropriate. Automatic analyses reveal different transformation regimes: zero-shot prompting induces extensive paraphrasing (median 23 token edits), whereas RAG performs minimal substitutions (median 1 edit) with higher semantic preservation (mean cosine similarity 0.978 vs. 0.926). Our results demonstrate that externalizing code-switching into lexical resources enables control and auditability without sacrificing perceived quality, offering practical advantages for rapidly evolving contact varieties.
[HC-13] he University AI Didnt Replace – Rethinking Universities in the AI Era
【速读】:该论文试图解决的问题是:当前许多高校在生成式 AI(Generative AI)应用上仍处于早期阶段,AI创新多以非正式方式发生,缺乏制度性支持与系统性整合。解决方案的关键在于推动从孤立的创新实践向战略性的系统集成转变,即通过重新设计以 AI 支持的推理为核心的教学生态,并协调政策、工作量模型和认可机制,实现教育变革的可持续落地。
链接: https://arxiv.org/abs/2605.07056
作者: Karol P. Binkowski,Andrew Hopkins
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI); Applications (stat.AP)
备注: 8 pages, 1 figure. Position paper on Generative AI and the transition from isolated educational innovation to institutionally supported adoption in higher education
Abstract:Generative artificial intelligence (AI) is reshaping higher education, yet many universities remain in early stages of adoption where AI innovation occurs informally and without institutional recognition. This paper presents a framework describing four levels of AI adoption in universities and illustrates these dynamics through a case study of AI-enabled curriculum initiatives in several units. We contend that the key institutional challenge is moving from isolated innovation to strategic integration, where universities redesign learning around AI-supported reasoning and align policies, workload models, and recognition systems to support educational transformation.
[HC-14] Social Understanding Placeness and Identity Alignment: A Design Framework for Friendship-Supportive Youth Social Media
【速读】:该论文旨在解决当前青年社交平台在支持友谊发展方面存在的不足问题,即现有平台设计往往未能充分促进青少年之间友谊的形成、深化与维系。其解决方案的关键在于提出一个基于实证研究的设计框架,该框架由三个核心支柱构成:社会理解感(包括互动规范、交互线索与支架、社会问责与治理)、场所感(包括第三空间与社区、边界与个人空间、共享存在感)以及身份契合感(包括身份货币、身份多元性、关系身份信号)。这一框架通过九个具体的设计空间,为平台提供了可操作的指导原则,以系统性地支持青年友谊的发展条件,并为未来研究和设计干预提供统一的术语体系与分析视角。
链接: https://arxiv.org/abs/2605.07025
作者: JaeWon Kim,Alexis Hiniker
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a design framework for friendship-supportive youth social media, derived from a synthesis of five empirical studies with 331 youth participants (ages 13–25) using interviews, co-design, surveys, diary studies, and a field deployment. Iterative analysis of 209 design-relevant data points identified three pillars: \textitSense of Social Understanding (interaction norms, interaction cues and scaffolding, social accountability and governance), \textitSense of Place (third place and community, boundaries and personal spaces, shared presence), and \textitSense of Identity Alignment (identity currency, identity plurality, relational identity signals). The framework maps nine design spaces through which platforms can support the conditions under which youth friendships form, deepen, and are maintained. It offers a shared vocabulary for locating contributions, comparing design interventions, and identifying under-explored areas for future work.
[HC-15] Problem Space Attunement in Youth Social Media Design
【速读】:该论文旨在解决当前青少年社交平台设计中存在的“问题空间错位”(problem-space misattunement)问题,即主流平台设计未能真正回应青少年在关系维护、身份建构和社群归属感方面的核心需求。研究识别出三种类型的错位:概念性错位(conceptual misattunement)、定义性错位(definitional misattunement)和评价性错位(evaluative misattunement)。其解决方案的关键在于构建以青少年为中心的设计方法论:通过虚构探究(Fictional Inquiry)引导青少年从自身情感需求出发思考设计;借助基于Discord的异步社区实现青少年主导的集体探究;并利用以自我锚定(ego-anchored)的大型语言模型(LLM)代理模拟沙盒,使青少年能够动态评估真实情境下的设计方案。三者共同推动形成基于青少年经验的、更具关系支持性的社交媒体设计准则与方向。
链接: https://arxiv.org/abs/2605.07018
作者: JaeWon Kim
机构: The Information School, University of Washington (华盛顿大学信息学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Social media is central to how young people maintain relationships, develop identity, and access communities, yet dominant platform designs often leave youth feeling constrained rather than supported. My dissertation argues that youth social media design is shaped by three forms of problem-space misattunement. \textitConceptual misattunement occurs when the language of social media'' anchors participants to existing platform templates. I address this through Fictional Inquiry in a fictional magic-school setting that helps youth reason from felt relational needs. \textitDefinitional misattunement occurs when researchers define what better’’ means on youth’s behalf. I address this through a Discord-based asynchronous community that supports youth-led collective inquiry. \textitEvaluative misattunement occurs when participants are asked to judge static or hypothetical designs. I address this through an ego-anchored, LLM-agent simulation sandbox. Together, these studies develop youth-grounded criteria and design directions for relationally supportive social media.
[HC-16] Exploring the “Banality” of Deception in Generative AI
【速读】:该论文旨在解决生成式 AI(Generative AI)中日益隐蔽且被日常化的欺骗行为问题,即“平淡式欺骗”(banal deception),这类欺骗不再依赖显性的“暗黑模式”(dark patterns),而是嵌入默认设置、自动化建议和对话交互等看似自然的AI功能中,使用户难以察觉并逐渐适应。其解决方案的关键在于引入“摩擦”(friction)机制,通过提升用户意识、提供干预工具以及改进监管与执行措施,使用户在与生成式 AI(如聊天机器人)交互时能够识别并抵御潜在的操纵性影响,从而推动该领域未来研究方向的发展。
链接: https://arxiv.org/abs/2605.07012
作者: Ishitaa Narwane,Johanna Gunawan,Konrad Kollnig
机构: Maastricht University (马斯特里赫特大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted at CHI’26 ACAI Workshop
Abstract:Current approaches to addressing deceptive design largely focus on visible interface manipulations, commonly referred to as “dark patterns”. With the rise of generative AI, deception is becoming more difficult to spot and easier to live with, as it is quietly embedded in default settings, automated suggestions, and conversational interactions rather than discrete interface elements. These subtle, normalised forms of influence, which Simone Natale frames as “banal deception”, shape everyday digital use and blur the line between AI-enabled assistance and manipulation. This position paper explores banality as a lens through which to reason through deception in generative AI experiences, especially with chatbots. We explore what Natale describes as users’ own involvement in their deception, and argue that this perspective could lead to future work for introducing friction to safeguard users from deception in generative AI interactions, such as empowering users through raising awareness, providing them with intervention tools, and regulatory or enforcement improvements. We present these concepts as points for discussion for the deceptive design scholarly community. Comments: Accepted at CHI’26 ACAI Workshop Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) ACMclasses: H.5.2; K.4.1; K.4.2 Cite as: arXiv:2605.07012 [cs.HC] (or arXiv:2605.07012v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.07012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-17] AI and Consciousness: Shifting Focus Towards Tractable Questions
【速读】:该论文试图解决的核心问题是:当前关于人工智能(AI)是否具备主观体验(即意识)的直接研究问题在科学上尚不可行,因其缺乏统一的意识理论基础及哲学层面的心身问题长期未解。论文提出,应将研究重点转向“感知到的AI意识”这一邻近但可操作的问题,其关键在于识别并分析公众对AI意识的感知成因及其社会影响,包括用户体验、伦理规范和语言习惯的变化。作者强调,此类研究不仅具有现实可行性,且对人类理解自身主观经验与人工实体的关系具有深远意义,并呼吁科研人员与决策者明确区分事实与不确定性,推动透明、准确的沟通机制建设。
链接: https://arxiv.org/abs/2605.06965
作者: Iulia-Maria Comsa
机构: Google DeepMind(谷歌深度思维)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As language-based AI systems become more anthropomorphic, the question of whether they can have subjective experience is increasingly pressing. I focus here on the tractability of research questions in the space of AI consciousness. I argue that the fundamental problem of whether AI systems can be conscious is currently intractable in its direct form, given the absence of a universally accepted scientific theory of consciousness, as well as the historical open-endedness of the philosophical mind-body problem. In contrast, questions around the adjacent subject of perceived AI consciousness are tractable, timely, and highly consequential for society. The general public is increasingly open to the possibility of consciousness in AI systems and routinely adopts the vocabulary of human cognition and subjective experience to describe them. This phenomenon is already driving societal shifts across user experience, ethical standards, and linguistic norms. I therefore propose an increased research focus on uncovering the causes and effects of perceived AI consciousness, which ultimately shape how we see our own human subjective experience relative to artificial entities. To support this, I map the current landscape of AI consciousness perception and discuss its key potential drivers and societal consequences. Finally, I urge developers, decision-makers, and the broader scientific community to commit to clear and accurate communication regarding the topic of AI consciousness, explicitly acknowledging its inherent uncertainties.
[HC-18] Leverag ing fNIRS to Evaluate Workload for Adaptive Training in Virtual Reality
【速读】:该论文旨在解决如何在虚拟现实(VR)训练环境中有效评估并适应学习者的认知负荷问题,以实现个性化、动态调整任务难度的智能训练系统。其解决方案的关键在于验证功能近红外光谱(fNIRS)作为神经工效学指标对认知负荷的测量有效性,特别是区分与任务本质相关的内在负荷(intrinsic load)和由教学设计引发的外在负荷(extraneous load)。研究发现,fNIRS能敏感捕捉到与工作记忆、注意力及多感官整合相关的前额叶皮层区域激活,且这些结果与NASA任务负载指数(NASA TLS)高度一致,表明fNIRS可作为实时监测认知负荷的核心工具,从而支持自适应训练系统的开发与优化。
链接: https://arxiv.org/abs/2605.06909
作者: Cara A. Spencer,Christopher D. Wickens,Jalynn B. Nicoly,James Crum,Benjamin A. Clegg,Joanna E. Lewis,Francisco R. Ortega,Lucas Plabst,Rebecca L. Pharmer,Leanne Hirshfield
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Colorado State University (科罗拉多州立大学); Montana State University (蒙大拿州立大学); University of Northern Colorado (北科罗拉多大学)
类目: Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Advance in technology offer the potential for future adoption of a combination of virtual reality (VR) and real-time adaptivity to enhance training and education. Providing a valid neuro-ergonomic measure of cognitive load can enable an adaptive training regime to continuously adjust tas difficulty to an optimal level as training progresses. The current study validated the functional near-infrared spectroscopy (fNIRS) measure of cognitive load to reflect the demands of two different forms of lad within Cognitive Load Theory: extraneous and intrinsic to he task to be mastered. Thirty-six participants completed a VR shape assembly training task followed by a test of their skill retention They wore near-full head coverage fNIRS and provided subjective ratings of ther workload. The fNIRS findings largely corroborate intrinsic workload literature with significant activation in cortical regions (dorsolateral and rostral prefrontal cortex and left angular gyrus) associated with working memory, short term memory buffers, multisensory integration, and attention. These fNIRS results were tracked closely by NASA TLS measures of mental workload. The results also revealed far less brain activity associated with extraneous load, namely just the right angular gyrus, deemed irrelevant to the mastery of the task.
[HC-19] MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
【速读】:该论文旨在解决语音驱动的物联网(IoT)设备交互中复杂用户体验建模的问题,具体挑战包括时空约束建模、动态状态追踪以及混合发起(mixed-initiative)交互模式的实现。解决方案的关键在于提出一个名为MIST(Multimodal Interactive Speech-based Tool-calling Dataset)的合成多轮语音驱动代码生成数据集,该数据集模拟真实IoT场景下的多模态交互任务,从而推动对能够理解物理世界约束并进行混合发起对话的语音助手的研究。实验表明,开放权重与封闭权重多模态大语言模型(LLMs)在MIST上的性能存在显著差距,且前沿封闭权重模型仍具备较大提升空间,验证了该数据集对推动相关研究的价值。
链接: https://arxiv.org/abs/2605.06897
作者: Maximillian Chen,Xuanming Zhang,Michael Peng,Zhou Yu,Alexandros Papangelis,Yohan Jo
机构: Columbia University (哥伦比亚大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project Page: this https URL
Abstract:The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
[HC-20] Bi3: A Biplatform Bicultural Biperson Dataset for Social Robot Navigation ICRA2026
【速读】:该论文旨在解决社交机器人在受限空间中与人群交互时的导航问题,特别是如何在人类群体密集环境中实现安全、高效且符合社会规范的路径规划。其解决方案的关键在于构建了一个名为Bi3的多模态数据集,该数据集包含来自两个不同机器人平台、五种导航算法、74名来自美国和法国的参与者在真实社交场景下的运动轨迹(10.5小时)、RGB视频及用户对机器人表现的主观评价。Bi3通过设计逼真的近距离人机交互实验,提供了高密度互动和复杂行为模式的数据基础,为训练人类运动预测模型和机器人控制策略提供了基准资源,从而推动了机器人在拥挤环境中的社会感知与协同导航能力的发展。
链接: https://arxiv.org/abs/2605.06863
作者: Andrew Stratton,Phani Teja Singamaneni,Pranav Goyal,Rachid Alami,Christoforos Mavrogiannis
机构: University of Michigan (密歇根大学); LAAS-CNRS (法国国家科学研究中心-图卢兹自动化与系统实验室); INRIA (法国国家信息与自动化研究院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: ICRA 2026
Abstract:We contribute Bi3, a dataset of social robot navigation among groups of people in a constrained lab space. Compared to prior data collection efforts for social robot navigation, our dataset is unique in that it features: an original experiment design giving rise to close navigation encounters between two humans and a robot; five different navigation algorithms; two different robot platforms; a diverse participant pool of 74 people recruited from two sites in the USA and France; multimodal data streams including 10.5 hours of human and robot ground-truth motion tracks, RGB video, and user impressions over robot performance. Our analysis of the collected dataset through metrics like interaction density and human velocity suggests that Bi3 represents a benchmark of unique diversity and modeling complexity. Bi3 contributes towards understanding how humans and robots can productively mesh their activities in constrained environments, and can be a resource for training models of human motion prediction and robot control policies for navigation in densely crowded spaces.
[HC-21] Privacy Perceptions in Sensor-Powered Smart Vehicle Cabins
【速读】:该论文旨在解决智能汽车舱内隐私管理问题,即随着多样化传感器的集成,传统汽车空间演变为智能环境后,如何理解并妥善处理不同用户群体(车主与非车主)对隐私的不同需求。其解决方案的关键在于通过半结构化访谈识别出影响两类人群隐私感知的核心因素,发现既有共通性影响因素,也存在对某一类群体更具影响力的差异性因素,并据此提出面向多利益相关者隐私需求平衡的设计启示,从而为未来智能汽车舱内隐私设计提供理论依据与实践指导。
链接: https://arxiv.org/abs/2605.06847
作者: BoRui Li,Bofan Yu,Xing-Dong Yang
机构: Simon Fraser University (西蒙菲莎大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to GI 2026
Abstract:As car cabins evolve with the integration of diverse sensors, traditional car cabins are transforming into smart environments. This shift raises important questions about how privacy is understood and managed in such spaces. In this work, we investigate privacy perceptions from the perspectives of both vehicle owners (i.e., people who purchase and own cars) and non-owners (i.e., people who temporarily use cars, such as family members, friends, or renters). Through semi-structured interviews with eighteen participants, we identified key factors that influence these groups’ views on privacy. Our findings reveal factors that commonly influence privacy preferences for both owners and non-owners, as well as factors that have a stronger impact on one group over the other. Drawing on these insights, we discuss design implications for future designs to better support and balance the diverse privacy needs of multiple stakeholders in smart car cabins.
[HC-22] Enhancing Eye Movement Biometrics for User Authentication via Continuous Gaze Offset Score Fusion
【速读】:该论文旨在解决现有基于眼动生物特征(Eye Movement Biometrics, EMB)的身份认证系统在性能提升方面存在的局限性,即当前深度学习方法虽能有效建模眼动的时间动态行为,但普遍忽略了连续注视偏移(continuous gaze offset)所蕴含的用户区分信息。其解决方案的关键在于将连续注视偏移作为辅助特征,与现有生物特征进行融合,通过线性和非线性融合策略加以整合,并验证其在不同任务和观测时长下的有效性。实验结果表明,尤其是非线性融合方式显著提升了双公开数据集上的认证性能,且跨任务融合进一步增强识别效果,证明连续注视偏移可在眼动追踪质量下降或存在噪声时提供有价值的补充信息。
链接: https://arxiv.org/abs/2605.06810
作者: Hashim Aziz,Mehedi Hasan Raju,Oleg V. Komogortsev
机构: Texas State University (德州州立大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, 1 Figure, 1 Table, Submitted to IJCB 2026
Abstract:Eye movement biometrics (EMB) use subject-specific gaze dynamics for user authentication and identification. Recent deep learning-based EMB systems achieve strong performance by modeling temporal eye movement behavior. However, these systems typically overlook continuous gaze offset, despite prior evidence that it contains user-discriminative information. This work examines whether continuous gaze offset can improve biometric performance when combined with existing biometric features. We evaluate linear and nonlinear fusion methods on two publicly available datasets, collected via the lab-grade eye tracker and virtual reality headset across multiple tasks and observation durations. Results indicate that fusion offers performance benefits on both datasets, particularly when using nonlinear fusion. Additionally, fusing biometric information across multiple tasks further improves authentication performance. These findings support the hypothesis that continuous gaze offset may serve as useful auxiliary information under conditions of degraded or noisy eye tracking.
[HC-23] When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic–Actor Loop for Agent ic Reasoning
【速读】:该论文旨在解决人机协作中研究人员与智能代理(Agent)交互如何影响科研成果的问题,特别是在物理推理任务中,不同交互机制对结果的影响尚不明确。解决方案的关键在于提出SCALAR(Structured Critic–Actor Loop for AI Reasoning),这是一个基于Actor–Critic–Judge架构的迭代反馈系统:Actor负责生成解题方案,Critic提供逐轮反馈以优化方案,Judge则独立评估对话过程与参考答案的一致性。实验表明,多轮交互普遍优于单次尝试,但改进机制和提示策略的效果高度依赖于Actor与Critic的具体组合;尤其在异构配置(如轻量级Haiku Actor配合强大Sonnet Critic)下,建设性反馈显著提升平均得分,而同家族模型间的反馈策略差异较小,说明交互结构设计对AI驱动科学发现具有关键影响。
链接: https://arxiv.org/abs/2605.06772
作者: Vasilis Niarchos,Constantinos Papageorgakis,Alexander G. Stapleton,Sokratis Trifinopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
备注: 17 pages; 9 figures
Abstract:As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic–Actor Loop for AI Reasoning), an Actor–Critic–Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor–Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor–Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor–Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.
[HC-24] STDA-Net: Spectrogram-Based Domain Adaptation for cross-dataset Sleep Stage Classification
【速读】:该论文旨在解决跨数据集睡眠分期(sleep stage classification)中因脑电图(EEG)通道布局、采样率、记录环境及受试者群体差异导致的性能下降问题。现有深度学习方法多依赖一维(1D)EEG信号表示,而对二维频谱图(spectrogram)输入在无监督域自适应框架中的应用研究较少。解决方案的关键在于提出STDA-Net(Spectrogram-based Temporal Domain Adaptation Network),其核心由三部分组成:基于卷积神经网络(CNN)的频谱图特征提取模块、用于建模睡眠动态时序特性的双向长短期记忆(BiLSTM)模块,以及无需目标域标签即可实现源域到目标域特征对齐的域对抗神经网络(DANN)。该架构通过2D频谱表示结合时序建模与对抗域适配,在多个公开数据集上实现了高精度、低方差的跨数据集睡眠分期性能,显著优于传统1D方法,展现出更强的鲁棒性和可重复性。
链接: https://arxiv.org/abs/2605.06736
作者: Unaza Tallal,Shruti Kshirsagar,Ankita Shukla
机构: Wichita State University (威奇托州立大学); University of Nevada, Reno (内华达大学雷诺分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: submitted to IEEE SMC conference
Abstract:Accurate sleep stage classification across datasets remains challenging due to variability in EEG channel montages, sampling rates, recording environments, and subject populations. Although deep learning has shown considerable promise for automated sleep staging, most existing cross-dataset methods rely on one-dimensional EEG signal representations, whereas the use of two-dimensional spectrogram-based inputs within an unsupervised domain adaptation framework has remained largely unexplored. Here, we propose STDA-Net (Spectrogram-based Temporal Domain Adaptation Network), a framework that combines a convolutional neural network (CNN) for spectrogram-based feature extraction, a bidirectional long short-term memory (BiLSTM) module for temporal modeling of sleep dynamics, and a domain-adversarial neural network (DANN) for source-to-target feature alignment without requiring any labeled target-domain data during training. Experiments are conducted on three publicly available datasets Sleep-EDF, SHHS-1, and SHHS-2 under six cross-dataset transfer settings. Results show that the proposed framework achieves an average accuracy of 89.03% and an average macro F1-score of 87.64%, consistently outperforming existing 1D baseline methods in terms of balanced classification performance, with substantially lower variance across five independent runs, indicating improved stability and reproducibility. Overall, these findings demonstrate that 2D spectrogram-based representations, combined with temporal modeling and adversarial domain adaptation, provide a robust and competitive alternative to conventional 1D EEG inputs for cross-dataset sleep staging.
[HC-25] Agent ic AI and the Industrialization of Cyber Offense: Forecast Consequences and Defensive Priorities for Enterprises and the Mittelstand
【速读】:该论文旨在解决生成式 AI(Generative AI)代理系统在网络安全领域带来的新型攻击风险问题,特别是其通过压缩攻击生命周期显著降低攻击门槛,使低技能攻击者也能高效完成从初始入侵到持久化控制的多步骤攻击。解决方案的关键在于:将Agentic AI安全视为紧迫的运营问题,立即强化身份验证、部署抗钓鱼认证机制、提升补丁更新速度、加强CI/CD与Linux/容器环境的安全加固、建立代理治理框架、完善遥测监控能力,并提升恢复准备度,从而构建一套以防御前置和快速响应为核心的综合防护体系。
链接: https://arxiv.org/abs/2605.06713
作者: Christopher Koch
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages
Abstract:Agentic AI systems can plan, call tools, inspect code, interact with web applications, and coordinate multi-step workflows. These same capabilities change the economics of cyber offense. The central near-term risk is not that every low-skill criminal immediately becomes a frontier exploit researcher; it is that agentic AI compresses the attack lifecycle by lowering the cost of reconnaissance, phishing, credential abuse, vulnerability triage, exploit adaptation, and post-compromise decision support. This paper synthesizes current public evidence from national cybersecurity agencies, industry threat reports, agent security guidance, and research on LLM agents cyber capabilities. It introduces a Three Channel Agentic Cyber Risk Model and an Agentic Attack Compression Model, uses the 2026 Linux kernel Copy Fail incident as a case study for foothold-to-root acceleration, and develops a 2026 to 2028 forecast for large enterprises and the German and European Mittelstand. The paper concludes with a prioritized defense roadmap. Organizations should treat agentic AI security as an immediate operational problem: identity, phishing resistant authentication, patch velocity, CI/CD and Linux/container hardening, agent governance, telemetry, and recovery readiness must be strengthened now.
[HC-26] Vibe Econometrics and the Analysis Contract
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)辅助下,因果推断方法(如“vibe econometrics”)的失败模式被工业化包装并广泛传播,导致原本依赖专家判断才能识别的错误(如方法-数据错配、信心洗白和隐形分叉)变得更具隐蔽性、传播力和说服力,从而引发新的治理挑战。解决方案的关键在于提出“分析契约”(Analysis Contract),这是一种预承诺框架,通过三个强制条件来重构AI辅助下的因果推理流程:一是方法-数据契约(method-data contract),确保方法与数据匹配;二是数据审计(data audit),验证输入质量;三是预承诺声明(pre-commitment statement),明确定义何种结果可作为反证。该框架将预分析计划(pre-analysis plan)与因果路线图(Causal Roadmap)逻辑迁移至AI辅助场景,实现对“vibe inference”类问题的可追溯、可验证与可问责管理。
链接: https://arxiv.org/abs/2605.08071
作者: Lydia Ashton(University of Wisconsin-Madison)
机构: 未知
类目: Econometrics (econ.EM); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
备注: 20 pages, 2 figures. Appendices A-C (fillable templates) provided as ancillary file. Companion materials: this https URL . Also posted on SSRN: this https URL
Abstract:“Vibe coding” and “vibe analytics” have been framed as a democratization of technical capability. This paper argues that AI-assisted methodology more broadly, or what I call “vibe methodology,” also democratizes the failure modes specific to each domain. When AI assists with methods whose validity depends on assumptions that cannot be verified from the output alone (a class I call “vibe inference”), the failure surface is structurally different: the output does not reliably signal invalidity, and when it does, recognizing the signal requires the expertise the workflow bypasses. I focus on “vibe econometrics,” the subset of AI-assisted causal analysis where identification can be named faster than it can be audited. The claim of this paper is not that AI invents inferential failures that did not previously exist, but that it changes their incidence, observability, and persuasive force enough to create a practically distinct governance problem. This results in three failure modes: method-data mismatch, where AI bypasses expertise at execution; confidence laundering, where AI amplifies the credibility of formatted output; and invisible forking, which spans both. What is new is not the failure modes but AI’s industrialization of their packaging. The barrier between naming a method and executing it has collapsed, and weak foundations, dressed as rigorous analysis, now reach audiences at a scale, speed, and polish that previously required expertise. I propose the Analysis Contract, a pre-commitment framework that adapts the logic of pre-analysis plans and the Causal Roadmap to the AI-assisted setting. The contract imposes three conditions before a causal claim is made: a method-data contract, a data audit, and a pre-commitment statement defining what would count as a disconfirming result. The framework generalizes across domains of vibe inference through domain-specific instantiation.
计算机视觉
[CV-0] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale
【速读】:该论文旨在解决自动驾驶领域多模态传感器数据集在格式、同步机制和标注规范上的异构性问题,这些差异严重阻碍了跨数据集的模型训练与泛化能力评估。其核心解决方案是提出一个名为123D的开源框架,通过统一接口(single API)整合来自多个真实世界和合成数据集的多模态信息(如摄像头、激光雷达、车辆状态、交通灯、高精地图等),并采用独立的时间戳事件流(timestamped event stream)存储方式,支持任意采样率下的同步或异步访问,从而消除了传统数据格式碎片化带来的开发障碍,并为跨数据集的模型迁移学习(如3D目标检测)和强化学习规划提供了可扩展的基础平台。
链接: https://arxiv.org/abs/2605.08084
作者: Daniel Dauner,Valentin Charraut,Bastian Berle,Tianyu Li,Long Nguyen,Jiabao Wang,Changhui Jing,Maximilian Igl,Holger Caesar,Boris Ivanovic,Yiyi Liao,Andreas Geiger,Kashyap Chitta
机构: KE:SAI; University of Tübingen, Tübingen AI Center; NVIDIA Research; Valeo Brain; OpenDriveLab at Shanghai Innovation Institute; Zhejiang University; Delft University of Technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset’s pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at this https URL.
[CV-1] Normalizing Trajectory Models
【速读】:该论文旨在解决扩散模型在少步采样(few-step generation)场景下因假设连续微小高斯去噪步骤失效而导致性能下降的问题。现有方法如蒸馏、一致性训练或对抗目标虽能缓解此问题,但牺牲了概率建模的似然框架(likelihood framework)。其解决方案的关键在于提出归一化轨迹模型(Normalizing Trajectory Models, NTM),通过将每个反向步骤建模为具有精确似然训练的表达性强条件归一化流(conditional normalizing flow),并结合浅层可逆块与跨轨迹的深层并行预测器,构建端到端可从头训练或基于预训练流匹配模型初始化的网络结构;此外,NTM 的精确轨迹似然特性还支持自蒸馏机制——利用模型自身得分函数训练轻量去噪器,仅用四步即可生成高质量样本,同时保持生成轨迹的精确似然性。
链接: https://arxiv.org/abs/2605.08078
作者: Jiatao Gu,Tianrong Chen,Ying Shen,David Berthelot,Shuangfei Zhai,Josh Susskind
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion-based models decompose sampling into many small Gaussian denoising steps – an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model’s own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.
[CV-2] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
【速读】:该论文旨在解决基于事件的图像重建方法中现有架构的局限性问题:卷积神经网络(Convolutional Neural Networks, CNNs)难以捕捉全局特征相关性,而视觉Transformer(Vision Transformers, ViTs)则因二次计算复杂度(O(n²))在高分辨率场景下应用受限。解决方案的关键在于提出EmambaIR,一种面向空间稀疏、时间连续事件流的高效视觉状态空间模型(Visual State Space Model)。其核心创新包括两个模块:跨模态Top-k稀疏注意力模块(Cross-modal Top-k Sparse Attention Module, TSAM),用于引导像素级稀疏注意力以实现丰富且紧凑的跨模态特征融合;以及门控状态空间模块(Gated State-Space Module, GSSM),通过非线性门控单元增强线性复杂度(O(n))状态空间模型的时序表征能力,从而在不增加显著计算负担的前提下有效建模全局上下文依赖关系。
链接: https://arxiv.org/abs/2605.08073
作者: Wei Yu,Yunhang Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., O(n^2) ), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ( O(n) ) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: this https URL
[CV-3] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment CVPR2026
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在空间智能(Spatial Intelligence)任务中因依赖二维像素对齐表示而导致的三维场景理解不一致与序列化效率低下的问题。现有方法通常沿用传统2D处理流程,难以实现稳定的3D空间推理,尤其在视频序列输入下表现不佳。其解决方案的关键在于提出一种名为Proxy3D的方法,通过构建紧凑且全面的3D代理表示(3D proxy representations)来替代传统的2D特征表示:首先利用语义和几何编码器提取场景特征,并基于语义感知聚类生成3D空间中的代理集合;随后通过自建的SpaceSpan数据集和多阶段训练策略,将该3D代理表示有效整合进VLM中,从而在短序列输入下实现更高效、更准确的空间推理能力,在3D视觉问答、视觉定位及通用空间智能基准测试中达到竞争性或领先性能。
链接: https://arxiv.org/abs/2605.08064
作者: Jerry Jiang,Haowen Sun,Denis Gudovskiy,Yohei Nakata,Tomoyuki Okuno,Kurt Keutzer,Wenzhao Zheng
机构: Tsinghua University (清华大学); Panasonic AI Lab (松下人工智能实验室); Panasonic DX-CPS (松下数字创新与解决方案部门); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D this http URL promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
[CV-4] Flow-OPD: On-Policy Distillation for Flow Matching Models
【速读】:该论文旨在解决现有流匹配(Flow Matching, FM)文本到图像模型在多任务对齐中面临的两大瓶颈:一是由标量奖励引发的奖励稀疏性问题,二是联合优化异构目标时产生的梯度干扰,二者共同导致指标间的“跷跷板效应”(seesaw effect)和普遍存在的奖励劫持(reward hacking)现象。解决方案的关键在于提出 Flow-OPD,这是首个将在线策略蒸馏(On-Policy Distillation, OPD)引入流匹配模型的统一后训练框架。其核心创新包括:第一阶段通过单奖励 GRPO 微调构建领域专业化教师模型,使每个专家独立达到性能上限;第二阶段采用基于流的冷启动(Flow-based Cold-Start)建立稳健初始策略,并通过三步协同机制——在线采样、任务路由标注与密集轨迹级监督——将异构知识无缝整合至单一学生模型;此外引入流形锚定正则化(Manifold Anchor Regularization, MAR),利用任务无关教师提供全数据监督以锚定生成于高质量流形,有效缓解纯强化学习驱动对齐带来的美学退化问题。
链接: https://arxiv.org/abs/2605.08063
作者: Zhen Fang,Wenxuan Huang,Yu Zeng,Yiming Zhao,Shuang Chen,Kaituo Feng,Yunlong Lin,Lin Chen,Zehui Chen,Shaosheng Cao,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); University of California, Los Angeles (加州大学洛杉矶分校); The Chinese University of Hong Kong (香港中文大学); Xiaohongshu Inc. (小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a ‘seesaw effect’ of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent ‘teacher-surpassing’ effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
[CV-5] 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks
【速读】:该论文旨在解决6D位姿估计(6D pose estimation)问题,即准确预测物体在三维空间中的位置和姿态。其解决方案的关键在于提出了一种模块化框架:首先利用YOLOv10m进行高效的目标检测,随后采用基于ResNet18的网络从RGB图像中回归关键点热力图(keypoint heatmap),再通过PnP RANSAC算法从提取的关键点中计算出物体的6D位姿。此外,作者进一步引入深度信息(depth data)并设计跨模态融合架构,在多个阶段实现RGB与深度特征的交互,显著提升了精度;同时通过优化激活函数和学习率调度策略等训练技巧,使RGB-only模型在LINEMOD数据集上达到84.50%的平均ADD-based准确率,而RGB-D融合模型则提升至92.41%。
链接: https://arxiv.org/abs/2605.08059
作者: Ismail Aljosevic,Amir Masoud Almasi,Ana Parovic,Ashkan Shafiei
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Source code available at: this https URL
Abstract:In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at this https URL.
[CV-6] owards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization CVPR2026
【速读】:该论文旨在解决生成式AI(Generative AI)在处理具有高度时空约束的人体运动时的局限性,例如严苛的空间障碍或特定步数要求等零样本(zero-shot)目标函数。现有方法虽能应对多种未见过的约束条件,但在极端约束场景下表现不佳。其解决方案的关键在于提出一种基于无训练扩散噪声优化框架的检索引导方法:首先通过关系任务解析(relational task parsing)识别并分组目标约束,定位难以满足的难点;随后利用检索到的参考运动数据提供更优的扩散噪声初始值,该初始值由奖励引导的掩码融合随机噪声与检索噪声构成;最终通过优化改进后的初始扩散噪声,实现对高约束任务的有效求解。此方法借助大语言模型(LLM)进一步实现自动推理“应检索什么”,从而在无需训练的前提下提升虚拟代理的行为智能性。
链接: https://arxiv.org/abs/2605.08054
作者: Hanchao Liu,Fang-Lue Zhang,Shining Zhang,Tai-Jiang Mu,Shi-Min Hu
机构: Tsinghua University (清华大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026
Abstract:Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.
[CV-7] MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
【速读】:该论文旨在解决生成式AI(Generative AI)中说话人脸(talking-head)视频生成时多条件协同控制的问题,即如何统一建模身份、头部姿态、面部表情和嘴部动态等复杂因素,并避免不同条件之间的破坏性干扰。其解决方案的关键在于提出了一种多条件视频扩散框架MoCoTalk,通过引入自适应多条件路由机制(Adaptive Multi-Condition Router),实现通道级、时间步感知的门控融合策略,使条件信号的融合方式可根据特征子空间与噪声水平动态调整;同时设计了唇部增强的阴影网格(Mouth-Augmented Shading Mesh),以解耦头动、嘴动、表情和光照并提供时序一致的几何先验,从而显著提升音频-视觉对齐精度与属性可控性。
链接: https://arxiv.org/abs/2605.08050
作者: Xinyan Ye,Jiankang Deng,Abbas Edalat
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.
[CV-8] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
【速读】:该论文旨在解决文本到图像生成模型在实现复杂视觉意图时面临的挑战,即如何在接地(grounding)、生成(generation)和验证(verification)过程中持续跟踪并维持语义承诺(semantic commitments),避免因语义承诺生命周期的不连续性(称为概念裂隙,Conceptual Rift)而导致意图失真。解决方案的关键在于提出SCOPE框架——一个基于规范引导的技能编排系统,通过维护动态演化的结构化规范来持久追踪语义承诺,并在发现未解决或违反的承诺时,条件性调用检索、推理与修复等技能进行干预,从而提升复杂图像生成任务中语义一致性与准确性。
链接: https://arxiv.org/abs/2605.08043
作者: Tianfei Ren,Zhipeng Yan,Yiming Zhao,Zhen Fang,Yu Zeng,Guohui Zhang,Hang Xu,Xiaoxiao Ma,Shiting Huang,Ke Xu,Wenxuan Huang,Lionel Z. Wang,Lin Chen,Zehui Chen,Jie Huang,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
[CV-9] Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的隐私泄露、版权侵权及偏见问题,通过机器遗忘(machine unlearning)技术移除敏感知识。现有方法主要依赖微调语言解码器,导致遗忘效果浅层化,无法清除底层视觉表征,并常引发对象幻觉(object hallucination)。其解决方案的关键在于提出HFRU框架——一种基于强化学习的遗忘机制,专门作用于视觉编码器(vision encoder),采用两阶段策略:首先破坏视觉与语言模态间的对齐关系,再利用GRPO(Generalized Reward Policy Optimization)优化过程,结合复合奖励函数(包括抽象奖励,abstraction reward)引导语义有效的替代,从而实现深度语义删除并显著抑制幻觉现象。实验表明,HFRU在目标识别和人脸身份任务中可实现超过98%的遗忘率与保留性能,且幻觉程度极低,显著优于现有方法。
链接: https://arxiv.org/abs/2605.08031
作者: Kaidi Jia,Yujie Lin,Chengyi Yang,Jiayao Ma,Jinsong Su
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior this http URL code and implementation details are available at this https URL.
[CV-10] PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction
【速读】:该论文旨在解决正电子发射断层成像(Positron Emission Tomography, PET)图像重建中因泊松噪声和物理退化因素导致的图像质量下降问题,尤其是在有限角度采集条件下更为显著。现有深度学习方法在面对未见过的临床数据分布时泛化能力不足,需大量重训练才能适应新场景。其解决方案的关键在于提出PET-Adapter,一种测试时域自适应框架,可在不依赖配对真值的情况下,将仅在体模数据上预训练的生成式PET重建模型快速适配至真实临床数据。该方法通过引入分层低秩解剖条件控制与基于有序子集期望最大化(Ordered Subset Expectation Maximization, OSEM)的物理信息初始化策略,实现从物理引导重建出发的高效扩散过程,将扩散步骤从50步减少至2步而不损失图像质量,从而显著提升重建性能与计算效率。
链接: https://arxiv.org/abs/2605.08030
作者: Rüveyda Yilmaz,Yuli Wu,Johannes Stegmaier,Volkmar Schulz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.
[CV-11] STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
【速读】:该论文旨在解决当前多模态生成系统中文本与图像生成机制结构不匹配的问题,即现有方法通常将自回归语言建模(autoregressive language modeling)与基于扩散模型的图像生成结合,导致因果文本生成与迭代去噪视觉生成之间存在架构冲突。其解决方案的关键在于采用自回归归一化流(autoregressive normalizing flows),因其与大型语言模型(LLM)共享相同的因果掩码(causal mask)、键值缓存(KV-cache)机制和从左到右的结构,从而成为实现真正统一多模态生成的最自然范式。作者提出STARFlow2框架,基于Pretzel架构垂直交错预训练视觉语言模型(VLM)流与TarFlow流,并通过残差跳跃连接和统一的FAE潜在空间设计,使文本与图像输出可直接进入KV缓存而无需重新编码,实现了缓存友好的交错生成。
链接: https://arxiv.org/abs/2605.08029
作者: Ying Shen,Tianrong Chen,Yuan Gao,Yizhe Zhang,Yuyang Wang,Miguel Ángel Bautista,Shuangfei Zhai,Joshua M. Susskind,Jiatao Gu
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 9 figures
Abstract:Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers–sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs–making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
[CV-12] RAS: An Interactive Software for Tracing Tree Ring Cross Sections
【速读】:该论文旨在解决树木年轮(tree ring)标记在树轮年代学(dendrochronology)和树轮测量(dendrometry)中依赖人工操作所导致的效率低、主观性强及难以扩展至大规模图像数据集的问题。解决方案的关键在于提出Tree Ring Analyzer Suite (TRAS),一个开源图形化软件,集成三种互补的检测算法:传统图像处理方法CS-TRD与两种深度学习方法DeepCS-TRD和INBD,实现自动 delineation(边界提取)、手动修正与定量测量;其中DeepCS-TRD在18张专家标注的Pinus taeda L.横切面图像上达到F-score 81.0%和精度86.4%,显著减少约80%的手动修正工作量,并通过后处理界面有效校正常见错误(如跳跃传播或近节疤处假阳性),从而提供一套灵活、可复现的跨平台(Windows/macOS/Linux)树轮分析方案。
链接: https://arxiv.org/abs/2605.08025
作者: Henry Marichal,Diego Passarella,Gregory Randall
机构: Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República; Procesos Industriales de la Madera, CENUR Noreste, Universidad de la República
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website
Abstract:Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ( r 0.99 ). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the this https URL. Comments: This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.08025 [cs.CV] (or arXiv:2605.08025v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.08025 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Henry Marichal [view email] [v1] Fri, 8 May 2026 17:11:38 UTC (15,983 KB)
[CV-13] SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中现有方法依赖大规模标注数据或任务特定训练流程,导致难以快速部署到新场景的问题。其解决方案的关键在于利用预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)中间层特征中蕴含的丰富异常语义信息,并通过几何推理而非学习新表示来挖掘这些特征中的潜在判别能力。具体而言,提出SphereVAD框架,将异常判别重构为在单位超球面上的von Mises-Fisher(vMF)似然比测地线推理,结合Frechet均值中心化消除域偏移、整体场景注意力(HSA)增强跨视频先验一致性,以及vMF引导的球面测地线拉取(SGP)机制对齐模糊片段与方向原型,从而实现完全无需训练、零样本的异常检测。
链接: https://arxiv.org/abs/2605.08003
作者: Chao Huang,Penfei Wei,Wei Wang,Jie Wen,Zhihua Wang,Li Shen,Wenqi Ren,Xiaochun Cao
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 25 figures
Abstract:Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.
[CV-14] Rethinking Dense Optical Flow without Test-Time Scaling ISCA CVPR2026
【速读】:该论文旨在解决当前密集光流估计方法中依赖测试时计算扩展(test-time scaling)以提升精度的问题,即是否必须通过复杂的架构设计和多步迭代优化才能实现高精度。其核心观点是:强大的视觉语义与几何先验信息可通过预训练的基础模型(foundation models)编码,从而减少甚至替代对昂贵的测试时迭代精修的需求。解决方案的关键在于提出一种单次前向传播(single forward pass)的框架,利用冻结的DINO-v2骨干网络提取视觉语义特征,并融合来自单目深度基础模型的几何线索,构建统一表征后采用全局匹配策略直接估计稠密对应关系,无需递归更新或测试时优化,从而在不牺牲跨数据集泛化能力的前提下显著提升效率与性能。
链接: https://arxiv.org/abs/2605.08000
作者: Praroop Chanda,Suryansh Kumar
机构: Texas AM University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at CVPR 2026; ViSCALE Workshop. Draft info: 10 pages, 2 figures, 4 tables
Abstract:Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. In this paper, we present a framework that estimates dense optical flow in a single forward pass, leveraging pretrained foundation representations, while avoiding iterative refinement and additional inference-time computation, thus offering an alternative to test-time scaling. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks. On Sintel Final, we obtain 2.81 EPE without refinement, significantly improving over state-of-the-art (SOTA) SEA-RAFT under comparable training conditions and outperforming RAFT, GMFlow (without refinement), and recent FlowSeek in the same setting. These results suggest that strong foundation priors can substitute for test-time scaling, offering a computationally efficient alternative to refinement-heavy pipelines.
[CV-15] Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite Drone and Ground Images
【速读】:该论文旨在解决跨视角定位(cross-view localization)问题,即确定地面图像在卫星影像中的精确位置与姿态。传统方法受限于仅能估计3-DoF(平面位置和航向角),因垂直卫星图像缺乏滚转(roll)、俯仰(pitch)和高度(altitude)的直接线索,需依赖平面运动和平面无倾斜假设,导致在真实地形中失效。解决方案的关键在于引入单张无人机(UAV)图像作为中间视点:该图像揭示了地表三维结构,并提供卫星无法获取的滚转、俯仰和高度信息,且仅需与地面相机存在空间重叠,无需已知相对位姿。基于此洞察,作者提出Cross3R模型,可一次性输出多视角3D点云、6-DoF相机位姿及各视角在卫星图上的(x,y)位置与航向角,显著提升定位精度与鲁棒性。
链接: https://arxiv.org/abs/2605.07978
作者: Qiwei Wang,Zhongyao Tuo,Xianghui Ze,Yujiao Shi
机构: ShanghaiTech University (上海科技大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates – an (x,y) position and a yaw angle – because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera – no known relative pose is required. Building on this insight, we propose Cross3R, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile (x,y) position and yaw of each perspective camera. For training and evaluation, we also construct CrossGeo, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.
[CV-16] HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型中控制生成内容的难题,特别是当仅依赖文本条件空间进行编辑时,如更改主体或调整属性常导致背景变化或细节失真等问题。其核心问题在于现有方法将文本嵌入空间视为欧几里得空间并采用简单的线性变换,未能反映语义概念的真实组织方式。解决方案的关键在于发现文本编码器表示实际上位于超球面(hypersphere)上,且语义概念并非线性方向,而是具有结构化的各向异性分布,更适合用Kent分布建模;基于此洞察,作者提出HEART框架,通过在超球面上执行Kent感知的测地线变换实现无需训练、无微调的精确可控图像编辑,从而在保持原场景一致性的同时支持直观的主体替换和细粒度属性控制。
链接: https://arxiv.org/abs/2605.07973
作者: Arani Roy,Shristi Das Biswas,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.
[CV-17] DVD: Discrete Voxel Diffusion for 3D Generation and Editing
【速读】:该论文旨在解决基于稀疏体素(sparse voxels)的3D生成模型中,如何有效建模体素占据状态并实现高质量、可编辑且具备不确定性估计能力的生成过程这一问题。其关键解决方案是提出离散体素扩散(Discrete Voxel Diffusion, DVD)框架,将体素占据视为原生离散变量,避免了传统连续扩散模型中需进行的连续到离散阈值化操作,从而简化了体素生成流程,并支持直接的不确定性量化与编辑。DVD通过显式类别建模提升生成动态的可解释性,并利用预测熵作为鲁棒的不确定性度量,用于识别模糊区域和复杂样本,进而辅助数据筛选与质量评估;此外,还设计了一种基于块结构扰动模式的轻量级微调策略,可在单次采样过程中完成体素修复与编辑,无需额外计算开销或模型评估。
链接: https://arxiv.org/abs/2605.07971
作者: Zhengrui Xiang,Jiaqi Wu,Fupeng Sun,Heliang Zheng,Yingzhen Li
机构: Imperial College London (帝国理工学院); Math Magic (数学魔法)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations.
[CV-18] meLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model
【速读】:该论文旨在解决多发性硬化症(Multiple Sclerosis, MS)病变自动分割中因影像数据在分布(如扫描仪差异)和输入结构(横断面与纵向数据分离)上的异质性所带来的挑战。其核心解决方案是提出TimeLesSeg框架,通过统一的卷积神经网络实现对含或不含时间维度输入的病变分割;关键创新在于:1)利用病变掩膜建模病理先验信息,并通过空掩膜模拟无先验的横断面场景,使模型在两种模式下无缝运行;2)设计一种基于形态学操作随机变形的生成式数据增强管道,以模拟病灶演化过程,缓解纵向数据稀缺与不一致问题;3)采用基于高斯混合模型的域随机化策略实现对比度无关性,提升模型对不同成像强度分布的鲁棒性。
链接: https://arxiv.org/abs/2605.07955
作者: Vicent Caselles-Ballester,Eloy Martínez-Heras,Giuseppe Pontillo,Zoe Mendelsohn,Elena M. Marrón,Juan Luis García Fernández,Laia Subirats,Jon Stutters,Jeremy Chataway,Frederik Barkhof,Sara Llufriu,Ferran Prados
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning-based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross-sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast-agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross-sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model-based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in-house datasets show that TimeLesSeg outperforms the contrast-agnostic state of the art on single-modality inputs across overlap- and distance-based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST-AI. All source code related to the development of TimeLesSeg is available at this https URL.
[CV-19] Rebalancing gradient to improve self-supervised co-training of depth odometry and optical flow predictions
【速读】:该论文旨在解决多网络协同训练过程中梯度分配不均导致的学习进度失衡问题,从而提升联合训练模型的性能。其核心解决方案是提出CoopNet,通过动态调整梯度分配策略来确保各子网络(如深度估计、位姿估计和光流网络)之间的公平学习进展;关键创新在于引入一种基于光度重建误差分布模型的混合损失函数,该模型假设运动物体对应的像素点在深度+位姿网络与光流网络的重建结果中存在显著差异,从而可有效识别并排除这些干扰区域用于训练,提升深度、位姿和光流预测的准确性。
链接: https://arxiv.org/abs/2605.07945
作者: Marwane Hariat,Antoine Manzanera,David Filliat
机构: U2IS, ENSTA Paris, Institut Polytechnique de Paris, Palaiseau, France; INRIA FLOWERS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present CoopNet, an approach that improves the cooperation of co-trained networks by dynamically adapting the apportionment of gradient, to ensure equitable learning progress. It is applied to motion-aware self-supervised prediction of depth maps, by introducing a new hybrid loss, based on a distribution model of photo-metric reconstruction errors made by, on the one hand the depth + odometry paired networks, and on the other hand the optical flow network. This model essentially assumes that the pixels from moving objects (that must be discarded for training depth and odometry), correspond to those where the two reconstructions strongly disagree. We justify this model by theoretical considerations and experimental evidences. A comparative evaluation on KITTI and CityScapes datasets shows that CoopNet improves or is comparable to the state-of-the-art in depth, odometry and optical flow predictions.
[CV-20] AVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
【速读】:该论文旨在解决当前主动视觉(Active Vision)在模仿学习中缺乏统一评估基准的问题,使得不同方法的效果难以比较,且无法量化主动视觉在何种任务类型和条件下带来收益。其解决方案的关键在于提出TAVIS(Evaluation Infrastructure for Active-Vision Imitation Learning),包含两个互补的任务套件(TAVIS-Head和TAVIS-Hands)以及三种核心评估原语:基于相同演示的头部摄像头与固定摄像头对比协议、GALT(Gaze-Action Lead Time)这一基于认知科学和人机交互(Human-Robot Interaction, HRI)的新型指标用于量化策略中的前瞻注视行为,以及程序性ID/OOD划分以检验泛化能力。通过这些设计,TAVIS实现了对主动视觉在模仿学习中作用的系统性评估,并揭示了其效果具有任务依赖性而非普适性。
链接: https://arxiv.org/abs/2605.07943
作者: Giacomo Spigler
机构: Tilburg University (蒂尔堡大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Active vision – where a policy controls its own gaze during manipulation – has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites – TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) – on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and \pi_0 reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at this https URL and this https URL.
[CV-21] Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
【速读】:该论文旨在解决现有基于样本的图像编辑方法依赖成对监督(pair-of-pairs supervision)所带来的局限性,即需要两个共享相同编辑语义的图像对来学习目标变换,这限制了训练数据的可扩展性并阻碍了跨多样编辑类型的有效泛化。其解决方案的关键在于提出Delta-Adapter框架,通过预训练视觉编码器提取源图与目标图之间的语义差异(semantic delta),并将该差异以适配器(adapter)形式注入预训练图像编辑模型中;由于目标图像不直接暴露给模型,其可作为预测目标,从而实现单对样本监督(single-pair supervision)。此外,引入语义delta一致性损失(semantic delta consistency loss)以确保生成结果的语义变化与真实语义delta对齐,显著提升了编辑准确性和内容一致性,并增强了对未见编辑任务的泛化能力。
链接: https://arxiv.org/abs/2605.07940
作者: Jiacheng Chen,Songze Li,Han Fu,Baoquan Zhao,Wei Liu,Yanyan Liang,Li Qing,Xudong Mao
机构: Sun Yat-sen University (中山大学); Video Rebirth; Macau University of Science and Technology (澳门科技大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at this https URL.
[CV-22] One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在长时程规划中如何高效利用辅助世界模型(world module)的问题。现有方法通常以高带宽传递每帧视觉信息至世界模块,并将动作预测与世界模型推理分离,导致在冻结主干网络的有限微调预算下,帧级表征和潜在动作耦合均未被充分优化。其解决方案的关键在于提出OneWM-VLA框架:通过自适应注意力池化(Adaptive Attention Pooling)将每帧图像压缩为单一语义token,显著降低视觉带宽;并采用统一的流匹配目标(flow-matching objective)联合建模隐式状态流与动作轨迹,而非依赖独立解码器连接两者。这一设计在保持长程性能的同时,大幅提升了参数效率与任务表现。
链接: https://arxiv.org/abs/2605.07931
作者: Zuojin Tang,Shengchao Yuan,Xiaoxin Bai,Zhiyuan Jin,De Ma,Gang Pan,Bin Liu
机构: Zhejiang University (浙江大学); Central South University (中南大学); Harbin Institute of Technology (哈尔滨工业大学); Embodied Intelligence General Platform Laboratory, Chery Auto (奇瑞汽车具身智能通用平台实验室); E-surfing Digital Life Technology Co., Ltd., China Telecom (中国电信 e surfing 数字生活科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a \pi_0 (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for \pi_0 ), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for \pi_0 ).
[CV-23] MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
【速读】:该论文旨在解决医学视觉-语言模型(Medical Vision-Language Models, VLMs)在临床应用中可能存在的“无声失败”(silent failures)问题,即模型在面对被扰动的证据(如错误前提、图像区域(ROI)损坏等)时仍生成看似合理但不正确的回答,而未表现出拒绝回答的能力。这在医疗场景中可能导致严重后果,因此要求模型具备识别证据失效并主动拒绝回答的能力。解决方案的关键在于构建了一个由四位放射科医师全程监督的300例评估套件medvigil,包含2,556道多选题(MCQ)探针、240个反事实三元组、医生裁定的风险等级和可答性标签、ROI框及开放问答变体,并引入七种基于正确性条件的审计指标,最终汇总为medvigil综合得分(MCS)。该基准测试揭示了当前最强模型(Claude Opus 4.7)在MCS上仅达69.2,远低于独立放射科医生的83.3分,凸显出模型在处理扰动证据时的不足,为未来提升医疗VLM的鲁棒性和可信度提供了量化依据。
链接: https://arxiv.org/abs/2605.07919
作者: Hanqi Jiang,Junhao Chen,Yi Pan,Lifeng Chen,Weihang You,Haozhen Gong,Ruiyu Yan,Jinglei Lv,Lin Zhao,Hui Ren,Quanzheng Li,Tianming Liu,Xiang Li
机构: University of Georgia (佐治亚大学); Harvard Medical School (哈佛医学院); Nanyang Technological University (南洋理工大学); New York University (纽约大学); University of Sydney (悉尼大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical vision–language models (VLMs) are usually evaluated on intact image–question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2,556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
[CV-24] What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
【速读】:该论文旨在解决现有文本到图像生成模型中tokenizers设计不合理的问题,即当前tokenizer主要关注重建保真度或继承预训练表征,而未明确如何构建对扩散生成任务友好的潜在流形(latent manifold)。研究表明,生成质量更依赖于潜在空间的结构特性,如局部连续性、全局语义一致性及空间结构的连贯性。解决方案的关键在于提出Prior-Aligned AutoEncoder (PAE),通过显式优化这些结构属性来塑造潜在空间:利用来自VFM(Variational Flow Models)的精炼先验和基于扰动的正则化策略,将空间结构、局部连续性和全局语义转化为可学习的目标函数,从而显著提升扩散模型的训练效率与生成质量,在ImageNet 256×256上实现了比现有方法更快收敛(最高13倍)且达到新的gFID最优值1.03。
链接: https://arxiv.org/abs/2605.07915
作者: Zhengrong Yue,Taihang Hu,Mengting Chen,Haiyu Zhang,Zihao Pan,Tao Liu,Zikang Wang,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Yali Wang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Beihang University (北京航空航天大学); Sun Yat-sen University (中山大学); Nankai University (南开大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.
[CV-25] Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning NEURIPS2026
【速读】:该论文旨在解决多分布学习(multi-distribution learning)中现有优化方法仅关注损失曲面几何特性之一(如平坦性或梯度对齐性)而导致泛化性能受限的问题。研究表明,单纯优化单一几何属性无法保证低超额风险(excess risk),因为损失曲面的平坦性与梯度对齐性在理论上不可互相推导,二者共同构成超额风险的两个独立主导项。解决方案的关键在于提出SAGE(Spectral-Aware Gradient-Aligned Exploration)算法,其通过同时优化这两个成分:在上升步中引入基于牛顿-舒尔茨迭代计算的极分解(polar factor)来实现各方向上相似幅度的扰动以探索曲率;在下降步中注入与跨分布梯度差异成比例的各向同性噪声以增强梯度对齐性,从而实现对损失曲面几何结构的协同优化。
链接: https://arxiv.org/abs/2605.07914
作者: Aristotelis Ballas,Christos Diou
机构: Harokopio University of Athens (雅典国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint - Submitted to NeurIPS 2026
Abstract:Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of \barH^-1\Sigma_g and (ii) a curvature term, controlled by \barH , where \barH is the average Hessian and \Sigma_g is the covariance of the gradient across distributions. Notably, \barH appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM’s gradient-scaled perturbation with the polar factor of each layer’s gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.
[CV-26] One World Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction
【速读】:该论文旨在解决车辆与基础设施协同自动驾驶(Vehicle-to-Infrastructure Cooperative Autonomous Driving, VICAD)场景下,由于车端与路侧摄像头时钟不同步导致的动态目标重建难题。现有基于高斯场景图(Gaussian Scene Graph)的方法隐含假设观测同步,为每帧分配单一姿态,但在异步协作环境中会引发梯度冲突,造成动态目标严重鬼影现象。作者指出这是表示层面的根本性失效,而非优化过程中的偶然问题——任何单时间线建模都会导致光度损失随目标速度和跨源时间偏移呈二次增长。解决方案的核心在于提出DUST(Decoupled Spatio-Temporal)高斯场景图,其关键创新是将每个动态代理的高斯表示共享以保证外观一致性,同时解耦各来源的姿态轨迹并分别对齐真实采集时间戳,从而使得姿态梯度核矩阵块对角化,彻底消除跨源干扰。此外,引入静态锚点驱动的姿态校正流程与姿态正则化的联合优化策略,进一步提升重建鲁棒性与稳定性,在V2X-Seq数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2605.07910
作者: Yulong Chen,Xiaoyun Dong,Haoyu Zhang,Zongxian Yang,Lewei Xie,Xinke Li,Yifan Zhang,Kai Wang,Jianping Wang
机构: City University of Hong Kong (Dongguan), Guangdong, China; City University of Hong Kong, Hong Kong, China; SLAI, Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source’s true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony. Code is available at this https URL.
[CV-27] Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
【速读】:该论文旨在解决在线视频理解中因持续视频流和不可预测的用户查询时间导致的记忆管理难题,尤其是现有方法在视觉标记压缩时缺乏语义感知、检索阶段与压缩阶段难以协同的问题。解决方案的关键在于提出一个无需训练的双阶段框架SAVEMem:第一阶段构建基于固定伪问题库的三层次流式记忆,在恒定内存预算下通过语义显著性而非仅视觉相似性实现长期保留;第二阶段根据查询目标动态调整检索范围(短时至中长时记忆),并通过查询与记忆标记的晚期交互选择候选帧,从而实现语义感知的记忆生成与自适应检索的高效协同。
链接: https://arxiv.org/abs/2605.07897
作者: Hang Wu,Sherin Mary Mathews,Yujun Cai,Ming-Hsuan Yang,Yiwei Wang
机构: University of California, Merced (加州大学默塞德分校); US Bank (美国银行); University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48% at 128 frames over the backbone.
[CV-28] Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability Analysis
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端间数据异构性(data heterogeneity)导致的全局模型泛化性能下降问题,尤其是在数据量有限和类别不平衡场景下。解决方案的关键在于提出 FedQuad 方法,通过显式地最小化类内表示距离并促进类间分离,从而缓解模型聚合过程中引入的表征错位(representation misalignment)。该方法联合优化正样本对间的距离最小化与负样本对间的距离最大化,有效提升了跨客户端的特征空间一致性,显著改善了非独立同分布(non-IID)环境下的模型表现。
链接: https://arxiv.org/abs/2605.07888
作者: Ozgu Goksu,Nicolas Pugeault
机构: University of Glasgow (格拉斯哥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2509.04107
Abstract:Federated Learning (FL) enables decentralised model training across distributed clients without requiring data centralisation. However, the generalisation performance of the global model is usually degraded by data heterogeneity across clients, particularly under limited data availability and class imbalance. To address this challenge, we propose FedQuad, a novel method that explicitly enforces minimising intra-class representations while enabling inter-class splits across clients. By jointly minimising distances between positive pairs and maximising distances between negative pairs, the proposed approach mitigates representation misalignment introduced during model aggregation. We evaluate our method on CIFAR-10, CIFAR-100, and Tiny-ImageNet under diverse non-IID settings and varying numbers of clients, demonstrating consistent improvements over existing baselines. Additionally, we provide a comprehensive analysis of metric learning-based approaches in both centralised and federated environments, highlighting their effectiveness in alleviating representation collapse under heterogeneous data distributions.
[CV-29] Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
【速读】:该论文旨在解决视频理解领域中奖励建模(reward modeling)进展受限的问题,主要瓶颈在于缺乏鲁棒的评估基准和高质量的偏好数据。解决方案的关键在于提出一个统一框架,涵盖基准设计、数据构建与奖励模型训练三个环节:首先构建了VURB(Video Understanding Reward Bench),包含2,100组偏好对及平均1,143 tokens的长链式思维推理轨迹,并采用多数投票机制评估通用、长序列和推理导向任务;其次通过全自动流水线创建大规模高质量偏好数据集VUP-35K;最终基于此数据训练出判别式(VideoDRM)和生成式(VideoGRM)奖励模型,二者在VURB和VideoRewardBench上均达到当前最优性能,且VUP-35K显著提升模型推理能力与测试时的best-of-N扩展效果。
链接: https://arxiv.org/abs/2605.07872
作者: Yuancheng Wei,Linli Yao,Lei Li,Haojie Zhang,Hao Zhou,Fandong Meng,Xu Sun
机构: South China University of Technology (华南理工大学); Peking University (北京大学); The University of Hong Kong (香港大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of- N test-time scaling.
[CV-30] From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
【速读】:该论文旨在解决妆容迁移(makeup transfer)任务中两个关键问题:一是合成数据监督下难以保持身份一致性(identity consistency),二是合成数据与真实数据之间的域差距(domain gap)导致模型在复杂真实场景中性能下降。解决方案的关键在于提出两个创新框架:其一为ConsistentBeauty,一种新颖的数据整理管道,确保合成数据在妆容保真度和严格身份一致性方面达到高质量;其二为RealBeauty,一种从合成到真实场景的后训练框架,通过强化学习进一步适应真实世界,设计了针对妆容迁移任务的可验证奖励机制,使模型能利用真实妆容模式提升性能。此外,论文还构建了一个涵盖多种肤色、年龄、性别、姿态和妆容风格的新基准,以更全面评估模型在多样化真实条件下的表现。
链接: https://arxiv.org/abs/2605.07861
作者: Yue Yu,Jiayu Wang,Jiajia Shi,Jingjing Chen,Yu-Gang Jiang
机构: Fudan University (复旦大学); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.
[CV-31] EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding IJCAI2026
【速读】:该论文旨在解决驾驶员认知分心(cognitive distraction)难以检测的问题,尤其针对其在视觉注意力未明显偏移、无外显身体动作时的隐蔽性挑战。解决方案的关键在于提出EyeCue框架,通过融合眼动(eye gaze)与第一人称视角视频(egocentric video)信息,建模眼动与视觉场景之间的动态交互关系,从而实现对驾驶员注意力状态的上下文感知建模。这一设计突破了传统方法仅依赖单一模态或表面行为特征的局限,显著提升了跨场景下的检测准确率与泛化能力。
链接: https://arxiv.org/abs/2605.07859
作者: Lang Zhang,JinYi Yoon,Matthew Corbett,Abhijit Sarkar,Bo Ji
机构: Virginia Tech (弗吉尼亚理工大学); Inha University (仁荷大学); Army Cyber Institute at West Point (西点军校网络研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
Abstract:Driver cognitive distraction is a major cause of road collisions and remains difficult to detect. Unlike manual or visual distraction, cognitive distraction is diverted by thoughts unrelated to driving, even when the driver appears visually attentive and exhibits no explicit physical movements. In this work, we propose EyeCue, a gaze-empowered egocentric video understanding framework, to detect driver cognitive distraction. A key insight is that cognitive distraction manifests in the interaction between eye gaze and visual context. To capture this interaction, EyeCue integrates eye gaze with egocentric video to enable context-aware modeling of the driver’s attention over time. Furthermore, to tackle the limited scale and diversity of existing datasets, we introduce CogDrive, a comprehensive multi-scenario dataset that augments four existing driving datasets with cognitive distraction annotations. Through extensive evaluations on CogDrive, we show that EyeCue achieves the highest accuracy of 74.38%, outperforming 11 baselines from 6 model families by over 7%. Notably, EyeCue can achieve an accuracy of over 70% across various driving scenarios (different road types, times of day, and weather conditions) with strong generalizability. These results highlight the importance of modeling gaze-context interactions and the effectiveness of cross-modal interaction modeling for multimodal cognitive distraction detection. Our codes and CogDrive dataset resources are available at this https URL.
[CV-32] BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
【速读】:该论文旨在解决粗略掩码(coarse mask)在局部图像编辑中引入的形状偏差(mask-shape bias)问题,即模型倾向于将生成内容约束在掩码边界内,而非根据用户指令灵活修改目标区域。解决方案的关键在于提出BRIDGE框架,其核心创新包括:1)通过“两区约束”(Two-Zone Constraint)分离背景稳定性和可编辑区域的自由度;2)采用BridgePath生成机制,其中主路径(Main Path)保留背景上下文,子路径(Subject Path)从独立噪声中生成内容,避免掩码注入DiT(Diffusion Transformer)内部导致的形状依赖;3)引入可学习的离散几何门(Discrete Geometric Gate),实现token级位置嵌入路由,使编辑区域在融合区域借用背景锚定坐标以保持一致性,或自主选择主体中心坐标以维持几何自由度。该方法在多个基准测试中显著提升局部编辑质量,同时参数开销极小(仅13.31M)。
链接: https://arxiv.org/abs/2605.07846
作者: Peilin Xiong,Honghui Yuan,Junwen Chen,Keiji Yanai
机构: The University of Electro-Communications (电气通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.
[CV-33] Explainable Part-Based Vehicle Classifier with Spatial Awareness
【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)中车辆细粒度分类任务的准确性与可解释性之间的权衡问题。传统端到端卷积神经网络(Convolutional Neural Networks, CNNs)虽然在分类精度上表现优异,但其“黑箱”特性限制了实际应用中的可信度和可扩展性。解决方案的关键在于将CNN分解为三个模块:1)基于CNN的语义强部件检测器;2)引入空间感知的部件概率图以替代原有的二值特征表示,从而更精确地建模各部件在不同车辆类别中的空间分布;3)使用Softmax回归进行最终分类。该方法不仅提升了对误检部件的鲁棒性,还实现了高准确率与直观可解释性的统一,有效挑战了“准确率与可解释性不可兼得”的假设。
链接: https://arxiv.org/abs/2605.07831
作者: Andreas Caduff(1),Klaus Zahn(1),Jonas Hofstetter(1),Martin Rechsteiner(1),Patrick Flaig(2) ((1) Competence Center for Intelligent Sensors and Networks, Lucerne University of Applied Science and Art (2) SICK AG)
机构: HSLU(瑞士洛桑大学); SICK AG(西克公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the area of Intelligent Transportation Systems (ITS), fine-grained vehicle classification systems play an essential role. Recently, the authors have presented a novel vision-based classification approach in which standard end-to-end Convolutional Neural Networks (CNNs) have been decomposed into 1) a CNN-based detector for semantically strong vehicle parts, followed by 2) feature construction and 3) final classification by a decision tree. In contrast to conventional CNNs, this allows both easy extensibility to new vehicle categories - without the need to fully retrain the part detector - and an important step towards the interpretability of the model, removing partially the black-box nature inherent to CNNs. Here we present an important extension of this approach that now incorporates spatial awareness of the vehicle parts: while the feature construction 2) of the previous approach used a binary decision for each feature (present vs. absent), now a full spatial probability map is constructed to condition the presence of each individual part with respect to a given vehicle category. The classification is performed using a softmax regression approach for the overall vehicle probabilities. This method shows a considerably improved robustness against false (part-)detections, a point that is crucial for practical application. Comparative analyses with a state-of-the-art end-to-end CNN indicate that our part-based methods achieve comparable accuracy, effectively challenging the presumed trade-off between accuracy and explainability. This research represents a significant advance in vehicle classification for ITS and forms the basis for systems that combine high accuracy with intuitive interpretability.
[CV-34] Anisotropic Modality Align
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)训练中因高质量跨模态数据稀缺而导致的瓶颈问题。现有方法依赖预训练的多模态对比模型共享表示空间,实现仅用单模态数据进行多模态训练,但其核心前提——不同模态表示是否可可靠互换——尚未被充分理解。研究发现,模态间隙(Modality Gap)并非简单的全局偏移,而是集中在少数主导方向上的各向异性残差结构,这严重阻碍了模态间的互操作性。解决方案的关键在于提出“各向异性模态间隙对齐”原则:有效的模态对齐应匹配目标模态分布,同时保留源模态的语义结构。基于此,作者设计了AnisoAlign框架,利用目标模态内部几何先验,对源模态表示进行有界修正,从而构建出目标模态的替代表示。该方法将模态间隙从经验现象重构为可纠正的结构化几何问题,为使用单模态数据训练多模态模型提供了新的表示对齐视角。
链接: https://arxiv.org/abs/2605.07825
作者: Xiaomin Yu,Yijiang Li,Yuhui Zhang,Hanzhen Zhao,Yue Yang,Hao Tang,Yue Song,Xiaobin Hu,Chengwei Qin,Shuicheng Yan,Hui Xiong
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
[CV-35] Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection CVPR2026
【速读】:该论文旨在解决深度学习模型在分布外(Out-of-distribution, OOD)检测中的可靠性问题,特别是针对近似分布外(near-OOD)样本的识别难题。现有方法多依赖于对图像特征的常规纠缠表示来区分分布内(In-distribution, ID)与OOD数据,忽视了图像中丰富的语境信息,导致模型因简单性偏差(simplicity bias)难以从解耦表示中学习到具有判别性的特征。为此,作者提出一种基于对象中心的OOD检测框架——Object-Centric OOD detection(OCO),其核心创新在于引入“对象共现”(Object CO-occurrence, OCO)模式建模机制:通过预测测试样本的解耦表示,结合ID训练数据中观察到的对象共现模式,自适应地将检测场景划分为三类,并采用分而治之的方式实现OOD检测。该方法利用自然环境中对象之间的语义上下文关系,有效区分近似分布外样本,避免模型仅关注易学的局部区域,从而提升对语义偏移和协变量偏移的鲁棒性。
链接: https://arxiv.org/abs/2605.07821
作者: Boyang Dai,Chaoqi Chen,Yizhou Yu
机构: The University of Hong Kong (香港大学); Shenzhen University (深圳大学); Shenzhen Loop Area Institute (深圳环区研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by CVPR2026
Abstract:Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at this https URL.
[CV-36] ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
【速读】:该论文旨在解决从静态手绘圆圈痕迹中联合识别书写者身份(writer identification)与笔类属性(pen classification)的问题,核心挑战在于生物特征(如书写习惯)与物理笔迹特征(如笔的类型)在极简静态样本中自然纠缠的现象。解决方案的关键在于构建一个大规模、受控的标注数据集(CircleID),包含50名已知和16名未知书写者的46,155张高分辨率(400 DPI)圆形图像,并设计两个任务:开放集书写者识别(需区分已知与未知书写者)和跨书写者笔类分类(评估模型对未见书写者的泛化能力)。通过Kaggle平台组织竞赛,提供ResNet基线并收集大量参赛方案,最终验证了模型在最小痕迹下实现特征解耦与鲁棒泛化的可行性,为该领域建立了新的基准。
链接: https://arxiv.org/abs/2605.07816
作者: Thomas Gorges,Janne van der Loop,Lukas Hüttner,Linda-Sophie Schneider,Fei Wu,Mathias Seuret,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 50 known and 16 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.
[CV-37] xt-to-CAD Evaluation with CADTests
【速读】:该论文旨在解决文本到计算机辅助设计(Text-to-CAD)任务中缺乏有效评估方法的问题,当前该领域在模型性能评测方面仍面临显著挑战。其解决方案的关键在于提出了一种基于自动化测试的新评估范式——CADTestBench,这是首个面向Text-to-CAD的测试基准,它依托于可执行的软件测试(CADTests),用于验证生成的CAD模型是否满足输入提示中的几何与拓扑要求。通过这一机制,作者不仅实现了对现有Text-to-CAD方法的全面基准测试,还进一步证明CADTests能够指导生成过程,从而构建出超越当前主流方法的简单基线模型。
链接: https://arxiv.org/abs/2605.07807
作者: Dimitrios Mallis,Marco Wang,Ahmet Serdar Karadeniz,Elisa Ricci,Anis Kacem,Djamila Aouada
机构: SnT, University of Luxembourg; University of Trento; Fondazione Bruno Kessler
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.
[CV-38] SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, VDMs)在生成视频时存在的语义不一致问题,如物体丢失、属性错位以及提示词中指定的交互关系弱化等。现有方法如VideoREPA和MoAlign通过蒸馏冻结视觉基础模型中的时空标记关系来提升细粒度文本跟随能力,但其成对监督信号的分配依赖于视觉或运动线索,而非与提示词的相关性。解决方案的关键在于提出SARA(Semantically Adaptive Relational Alignment),它在保持原有标记关系蒸馏(Token-Relation Distillation, TRD)的基础上,引入一种文本条件下的显著性机制,动态决定哪些标记对应接收监督信号;具体而言,通过一个轻量级第一阶段对齐器学习每个实体的显著性,并利用一对路由操作符(pair-routing operator)将显著性融合到TRD中,仅对主体-主体和主体-背景对分配监督权重,从而增强关键语义关系的建模,最终在多维视觉语言模型(VLM)评估指标、VBench基准测试及盲测用户研究中均优于SFT、VideoREPA和MoAlign。
链接: https://arxiv.org/abs/2605.07800
作者: Jiesong Lian,Zixiang Zhou,Ruizhe Zhong,Yuan Zhou,Qinglin Lu,Rui Wang,Long Hu,Yixue Hao,Baoru Huang
机构: Huazhong University of Science and Technology (华中科技大学); Tencent Hunyuan (腾讯混元); Shanghai Jiao Tong University (上海交通大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.
[CV-39] Spectral Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation
【速读】:该论文旨在解决深度神经网络在分类任务中存在类别间准确率不平衡的问题,即部分类别(弱类)的识别性能显著低于其他类别(强类),而传统方法通常依赖重新训练来优化整体性能。其核心解决方案是提出一种名为“谱手术”(Spectral Surgery)的后处理优化方法,关键在于利用训练后模型权重矩阵的海森(Hessian)谱结构——特别是其中少量的大特征值(spikes)与类别数减一相匹配的现象——通过沿这些spike特征向量方向施加受控扰动来重构模型决策边界。具体包括:(i) 构建一个谱-类别敏感性矩阵以量化每类准确率对各spike方向扰动的梯度响应;(ii) 在约束条件下优化扰动系数,聚焦增强弱类同时保持强类性能;(iii) 引入自适应幅度控制机制,根据迭代改进信号动态调整扰动强度。此方法无需重新训练即可实现更均衡的分类表现,在CIFAR-10和ISIC-2019数据集上验证了其在平衡准确率和标准差上的有效性。
链接: https://arxiv.org/abs/2605.07790
作者: Hugo Vigna,Samuel Bontemps
机构: CentraleSupélec – Université Paris-Saclay (中央Supélec – 巴黎萨克雷大学); ESILV – Léonard de Vinci (ESILV – 列奥纳多·达·芬奇大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Spectral Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class’s accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.
[CV-40] APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
【速读】:该论文旨在解决当前图像生成模型评估中依赖传统特征分布指标(如FID)所面临的局限性,这些问题包括过时特征的封闭词汇瓶颈(closed-vocabulary bottleneck)以及刚性参数化假设带来的偏差(assumptive bias)。为克服这些限制,作者提出了一种无假设的投影嵌入检验框架APEX(Assumption-free Projection-based Embedding eXamination),其核心创新在于利用切片 Wasserstein 距离(Sliced Wasserstein Distance)作为数学上严谨且无需假设的相似性度量方法。该方案不仅具备高维空间中的良好可扩展性,还具有嵌入无关性(embedding-agnostic),并采用两个开放词汇基础模型CLIP和DINOv2作为特征提取器,从而实现对图像生成质量更鲁棒、稳定且跨数据集一致的评估。
链接: https://arxiv.org/abs/2605.07786
作者: Caterina Gallegati,Monica Bianchini,Franco Scarselli,Vittorio Murino,Barbara Toniella Corradini
机构: University of Siena (锡耶纳大学); Istituto Italiano di Tecnologia (意大利技术研究所); University of Verona (维罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
[CV-41] Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation
【速读】:该论文旨在解决现有概念瓶颈模型(Concept Bottleneck Models, CBMs)在医学影像分析中缺乏对临床生成过程建模的问题,即大多数CBMs将临床概念视为病理标签的判别性预测因子,而未显式刻画疾病如何生成可观察的放射学表现这一因果机制。其解决方案的关键在于提出XpertCausal——一种由放射科医生指导的因果概念瓶颈模型,通过概率噪声或(probabilistic noisy-OR)框架显式建模病理到概念的生成关系,并利用贝叶斯推断反向估计病理概率;同时借助放射科医生标注的概念-病理关联约束模型结构,确保推理路径符合临床可解释性。实验表明,该方法在MIMIC-CXR数据集上显著优于非因果CBM基线和无约束因果变体,在病理分类性能、校准度、解释质量及与专家定义推理路径的一致性方面均取得提升。
链接: https://arxiv.org/abs/2605.07785
作者: Amy Rafferty,Rishi Ramaesh,Ajitha Rajan
机构: University of Edinburgh, UK; NHS Lothian, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.
[CV-42] Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis
【速读】:该论文旨在解决当前基于神经渲染的视觉重建方法(如3D Gaussian Splatting,3DGS)在无线电频段(RF)数字孪生应用中难以支持确定性多跳传播路径模拟的问题。传统RF仿真依赖于手工构建的几何网格来计算射线轨迹及其衰减与延迟,而视觉重建模型通常仅优化α混合后的光学外观,缺乏物理可解释的几何结构以支撑电磁波传播建模。解决方案的关键在于将高斯点云(Gaussian primitives)嵌入硬件加速的光线追踪结构中,作为统一的空间表示,在无需额外几何建模的情况下,直接从纯视觉重建结果中提取物理意义明确的信道冲激响应(Channel Impulse Response, CIR),从而实现可视化渲染与RF传播仿真的协同优化,为跨模态空间表征提供新范式。
链接: https://arxiv.org/abs/2605.07781
作者: Niklas Vaara,Lam Huynh,Pekka Sangi,Miguel Bordallo López,Janne Heikkilä
机构: Center for Machine Vision and Signal Analysis, University of Oulu (奥卢大学机器视觉与信号分析中心); CubiCasa Oy (CubiCasa公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Explicit neural representations such as 3D Gaussian Splatting (3DGS) enable high-fidelity and real-time novel view synthesis, yet optimize for alpha-composited optical appearance rather than ray-intersectable geometry. In contrast, radio-frequency (RF) digital twins require deterministic multi-bounce paths, where the geometry dictates trajectories and their associated attenuation and delay. We introduce a framework enabling differentiable RF propagation simulation directly within visually reconstructed neural scenes, allowing point-to-point path computation between arbitrary 3D locations while preserving high-quality visual rendering. Unlike conventional RF simulation pipelines that rely on manually constructed meshes, we embed Gaussian primitives into a hardware-accelerated ray tracing structure as the underlying spatial representation. By extracting physically meaningful channel impulse responses from visual-only reconstructions, we provide cross-modal evidence that neural reconstructions can serve as unified spatial representations for both electromagnetic propagation simulation and photorealistic view synthesis.
[CV-43] SIMI: Self-information Mining Network for Low-light Image Enhancement
【速读】:该论文旨在解决低光照条件下图像质量下降对图像编辑与可视化带来的挑战,尤其针对现有增强方法因依赖复杂模型而忽视低光图像内在信息的问题。其解决方案的关键在于提出一种自监督的Self-Information Mining (SIMI)网络,通过位平面分解(bit-plane decomposition)将低光图像分解为多个成分,从而在无需外部数据的情况下挖掘图像的内在信息,实现更快的模型收敛、更高的性能以及更低的计算开销,同时保持良好的实际应用潜力。
链接: https://arxiv.org/abs/2605.07767
作者: Xuanshuo Fu,Lei Kang,Javier Vazquez-Corral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Poor lighting conditions significantly impact image quality, posing substantial challenges for image editing and visualization. Many existing enhancement methods aim at proposing complex models while neglecting the intrinsic information contained within low-light images. In this work, we propose the Self-Information Mining (SIMI) network, an innovative unsupervised framework that decomposes low-light images into multiple components based on bit-plane decomposition. Our approach allows mining intrinsic information without relying on external data. This not only accelerates model convergence but also improves performance and reduces computational overhead. The unsupervised nature of our method facilitates real-world applicability. Experiments conducted on standard benchmarks demonstrate that SIMI achieves state-of-the-art performance.
[CV-44] Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition
【速读】:该论文旨在解决传统人脸识别模型在非正面视角或面部线索缺失情况下,因强制执行身份内不变性(intra-identity invariance)而导致外观差异(如发型、妆容变化)被压缩为单一表征,从而无法满足外观敏感场景下身份一致性比对需求的问题。其解决方案的关键在于提出“头部相似性”(Head Similarity)这一新范式,通过显式建模身份内的外观变化,并在身份与外观状态之间建立分层相似性排序机制,实现即使在遮挡或侧视条件下也能进行有意义的比较;同时,研究构建了一个基于弱监督外观状态的大规模长视频基准数据集,并设计了一种联合建模身份判别与外观敏感相似性的框架,借助分层监督和身份感知蒸馏策略,有效提升了模型在复杂条件下的结构化全头相似性建模能力。
链接: https://arxiv.org/abs/2605.07766
作者: Yingfeng Wang,Yuxuan Xiao,Shengcai Liao
机构: United Arab Emirates University (阿联酋大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many vision applications require identity consistency beyond strict biometric recognition, especially under non-frontal views or when facial cues are missing. However, conventional face recognition models enforce intra-identity invariance, collapsing appearance variations such as hairstyle or styling changes into a single representation, limiting their use in appearance-sensitive scenarios. To address this limitation, we introduce Head Similarity, a new formulation that extends identity-centric recognition to structured whole-head similarity modeling. Our approach explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states, enabling meaningful comparison even under occlusion or rear-view conditions. We construct a large-scale benchmark from long-form videos with weakly-supervised appearance states, covering diverse poses, occlusions, and temporal changes. As a first step, we develop a simple yet effective framework that jointly models identity discrimination and appearance-sensitive similarity through hierarchical supervision and identity-aware distillation. Experiments show that conventional face recognition models fail to capture appearance-dependent similarity, while our approach demonstrates the feasibility of structured whole-head similarity modeling.
[CV-45] Benchmarking Foundation Models for Renal Lesion Stratification in CT
【速读】:该论文旨在解决开放源代码医学基础模型(Foundation Models, FMs)在临床数据稀缺场景下,其预训练表征迁移能力是否足够有效的问题,尤其聚焦于CT影像中肾病变分类任务。解决方案的关键在于采用冻结特征探测(frozen feature-probing)协议,在一个包含2,854个病灶的复合数据集上对三种医学FMs进行基准测试,并与手工设计的放射组学(radiomics)分类器及从零开始训练的3D ResNet-50模型进行性能对比。结果表明,尽管FMs在计算资源消耗上显著低于深度学习模型(仅需CPU几秒完成推理),但其AUC(0.70–0.77)并未超越传统放射组学方法(AUC 0.88),说明当前通用型FMs尚未充分捕捉到驱动组织学亚型区分的细微纹理和形态异质性。
链接: https://arxiv.org/abs/2605.07749
作者: Hartmut Häntze,Sarah de Boer,Myrthe Buser,Alessa Hering,Bram van Ginneken,Mathias Prokop,Jawed Nawabi,Sebastian Ziegelmayer,Lisa Adams,Keno Bressem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures
Abstract:The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p \leq 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.
[CV-46] LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset
【速读】:该论文旨在解决矿产开采活动(尤其是非法小规模采矿,ASM)对环境造成广泛负面影响的问题,包括土地利用变化、高能耗、土壤侵蚀和森林砍伐等。其核心挑战在于难以有效监测偏远矿区的活动,特别是非法采矿行为及其生态后果。解决方案的关键是构建一个包含150个大型采矿(LSM)站点和870 km²小规模采矿(ASM)区域标注数据集,涵盖9类LSM特征与27项采矿点属性,从而为环境影响评估、非法采矿识别及矿场属性与环境关联机制研究提供高质量数据支撑。
链接: https://arxiv.org/abs/2605.07740
作者: Matthias Kahl,Zhaiyu Chen,Sudipan Saha,Mrinalini Kochupillai,Lukas Kondmann,Xiao Xiang Zhu
机构: Technical University of Munich, Germany (慕尼黑工业大学); Hanken School of Economics / Svenska handelshögskolan, Helsinki, Finland (赫尔辛基经济学院); Indian Institute of Technology, Delhi, India (印度理工学院德里分校); OroraTech, Munich, Germany (OroraTech)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mining operations are of utmost importance to the economy of some nations. However, such operations result in land-use change, very high energy consumption, and negative impacts on the environment, including soil erosion and deforestation. The mining process can impact an area much larger than the mining site itself. Adding to the negative externalities linked to mining is the fact that, in addition to government-sanctioned legal mining operations, illegal mining is widespread, including in various countries of Africa. The ability to monitor remote mining site activities can be useful, e.g., for the detection of illegal artisanal mining activities and their environmental impacts. An important outcome of such monitoring could include a better understanding of the interrelationship between mine facility attributes (e.g., mining types, processing methods, commodities, etc.) and their impact on the natural environment. In this work, we present a data set that contains 150 Large Scale Mining (LSM) sites and 870km^2 annotated area of Artisanal Small-scale Mining (ASM) sites. The metadata includes nine eminent LSM sections and 27 mining site attributes for each LSM site. We also discuss the data set’s possible contribution to the research community, social and environmental consequences, and researchers’ responsibilities from an ethics perspective.
[CV-47] OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
【速读】:该论文旨在解决高保真眼科手术视频生成中难以实现精确编辑的问题,特别是如何在严格解剖结构和时间约束下修改手术属性(如器械与组织交互或手术阶段)。其解决方案的关键在于提出一种无需训练的框架OphEdit,通过确定性的二阶常微分方程(Ordinary Differential Equation, ODE)逆向流程提取原始视频中的Attention Value (V)张量,并在去噪过程中将这些张量选择性注入条件无分类器引导(Classifier-Free Guidance, CFG)分支,从而在保持眼部精细解剖几何结构的同时,实现文本驱动的语义修改。该方法显著优于自然域视频编辑工具,在结构保真度和时序一致性方面表现更优。
链接: https://arxiv.org/abs/2605.07695
作者: Ritul Jangir,Arkya Jyoti Bagchi,Aiman Farooq,Mangalton Okram,Saurabh Seetaram Korgaonkar,Deepak Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at this https URL
[CV-48] Stochastic Transition-Map Distillation for Fast Probabilistic Inference
【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理阶段计算成本高昂的问题,同时保持其生成样本的概率特性。解决方案的关键在于提出一种无需教师模型(teacher-free)的随机过渡映射蒸馏方法(Stochastic Transition-Map Distillation, STMD),其核心是将扩散过程对应的随机微分方程(SDE)的完整转移映射进行蒸馏,而非仅学习后验分布的均值(如基于得分的扩散模型)。STMD通过条件均流模型(conditional Mean Flow model)参数化这些SDE转移,从而构建出一步或少数几步的随机采样器,保留了原始扩散过程的转移结构,适用于需要随机推理的任务(如后验采样、逆问题求解和基于能量的微调)。该方法无需预训练教师模型、双层优化或轨迹模拟与缓存,具备高效且可扩展的训练优势,并在Wasserstein距离下提供了收敛性理论保证。
链接: https://arxiv.org/abs/2605.07661
作者: George Rapakoulias,Peter Garud,Lingjiong Zhu,Panagiotis Tsiotras
机构: Georgia Institute of Technology (佐治亚理工学院); Florida State University (佛罗里达州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.
[CV-49] owards Billion-scale Multi-modal Biometric Search
【速读】:该论文旨在解决在国家规模身份识别系统中对百亿级多模态生物特征数据库进行高效、准确的1:N匹配(去重)问题,这要求突破生物特征系统在采集、预处理、特征提取、匹配速度、抗伪造检测及特殊场景处理等方面的极限。其解决方案的关键在于构建一个基于开源架构的端到端多模态生物特征搜索系统——Bharat ABIS,该系统通过为指纹、人脸和虹膜三种模态分别设计特定的预处理(分割)、质量评估、活体攻击检测(Presentation Attack Detection, PAD)和嵌入学习(Feature Extraction)流程,最终生成每人的13.5KB串联模板,并实现高精度与高吞吐量的协同优化:在2.2亿个身份的分层采样画廊上,成人样本的误拒绝率(FNIR)为0.3%、误接受率(FPIR)为0.5%,且单服务器(8块NVIDIA H100 GPU,2TB内存)即可实现4000万条目画廊下每秒100次搜索的吞吐量,显著优于商用现成系统(COTS)。
链接: https://arxiv.org/abs/2605.07655
作者: Arka Koner,Chetan S. Naik,Lokesh Kurre,Vivek Raghavan,Barada P. Sabut,Tanusree Deb Barma,Anoop M. Namboodiri,Anil K. Jain
机构: Unique Identification Authority of India (印度身份识别局); IIIT Hyderabad (印度国际信息技术学院); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India’s Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).
[CV-50] Aquatic Neuromorphic Optical Flow
【速读】:该论文旨在解决水下环境中传统成像系统因资源限制(如计算能力、功耗和带宽)而难以实现高质量感知的问题,尤其针对水下数据稀缺导致的视觉算法训练困难。其解决方案的关键在于利用事件相机(event camera)产生的异步事件流,结合脉冲神经网络(spiking neural network, SNN),提出一种自监督框架来估计像素级光流(optical flow),从而在无需大量标注数据的情况下实现高效、实时的水下运动感知。该方法显著提升了资源受限水下边缘平台上的感知性能与计算效率。
链接: https://arxiv.org/abs/2605.07653
作者: Pei Zhang,Yunkai Liang,Kaiqiang Wang
机构: Guangxi University (广西大学); Baise Artificial Intelligence Innovation and Development Center (百色人工智能创新发展中心); Northwestern Polytechnical University (西北工业大学); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work is under review
Abstract:Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.
[CV-51] Breaking Spatial Uniformity: Prior-Guided Mamba with Radial Serialization for Lens Flare Removal
【速读】:该论文旨在解决镜头眩光(lens flare)在夜间摄影中因复杂光学像差导致的图像质量退化问题,尤其针对现有恢复方法多采用空间均匀处理、难以满足不同区域差异化修复需求的局限性——即需保留饱和光源区域、去除眩光伪影并恢复背景细节。解决方案的关键在于提出DeflareMambav2框架,其核心创新包括:1)引入眩光先验网络(Flare Prior Network, FPN)以估计眩光先验并指导自适应修复;2)设计新颖的径向序列化策略,打破空间同质处理限制,实现面向眩光的靶向采样,增强状态空间模型(State Space Models, SSMs)对长程依赖的建模能力;3)采用双层自适应机制,在像素级精确控制修复强度的同时,显式保护光源区域避免过处理,并对非光源区域实施课程学习式的渐进恢复。
链接: https://arxiv.org/abs/2605.07650
作者: Zijia Fu,Yuanfei Huang,Lizhi Wang,Hua Huang(School of Artificial Intelligence, Beijing Normal University, Beijing, China)
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Lens flares, caused by complex optical aberrations, severely degrade image quality especially in nighttime photography. Although recent restoration methods have made remarkable progress, most still rely on spatially uniform processing. They are failing to handle the region-dependent restoration demands of flare scenes, where saturated light sources should be preserved, flare artifacts removed, and background details recovered. To address this challenge, we propose DeflareMambav2, a prior-guided Mamba framework for lens flare removal. Specifically, we introduce a Flare Prior Network (FPN) to estimate flare priors and guide adaptive restoration. Besides, a novel radial serialization strategy breaks spatially homogeneous processing by performing flare-aware targeted sampling, and better supports long-range modeling in State Space Models (SSMs). Based on these priors, the backbone adopts a dual-level adaptive scheme. It explicitly preserves light-source regions to avoid over-processing, and applies curriculum-based restoration to the remaining contaminated areas while calibrating restoration intensity at the pixel level. Extensive experiments demonstrate that DeflareMambav2 achieves state-of-the-art performance with reduced parameter burden. Code is available at this https URL.
[CV-52] Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
【速读】:该论文旨在解决自动驾驶系统(ADS)在实际应用中如何实现安全、可靠且可审计的运行设计域(ODD)感知问题。当前,ODD定义了自动驾驶系统可安全操作的具体环境条件,其准确感知对合规性和安全性至关重要。传统方法依赖于特定任务的训练数据,难以适应动态变化的ODD定义。论文提出利用视觉语言模型(VLMs)作为零样本(zero-shot)“ODD传感器”,无需重新训练即可适应不断演化的ODD规范。解决方案的关键在于采用定义锚定的思维链提示(definition-anchored chain-of-thought prompting)与角色分解(persona decomposition)相结合的策略,显著提升零样本场景下ODD分类与检测的准确性与鲁棒性,从而为高安全要求场景中的ODD感知提供透明、高效的通用感知框架。
链接: https://arxiv.org/abs/2605.07649
作者: Berkehan Ünal,Dierend Hauke,Fazlija Dren,Plachetka Christopher
机构: Volkswagen Aktiengesellschaft (大众汽车股份公司); L3S Research Center (L3S 研究中心); Leibniz University Hannover (汉诺威莱布尼茨大学); Faculty of Information Technology (信息学院); University of Jyväskylä (于韦斯屈莱大学); MOIA GmbH (MOIA 公司); Motor AI GmbH (Motor AI 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 4 figures
Abstract:Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot “ODD sensors” that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
[CV-53] EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting CVPR
【速读】:该论文旨在解决从第一人称视角视频中准确预测未来3D手部姿态序列的问题,这一任务在理解人类意图及实现增强现实(AR)/虚拟现实(VR)辅助和人机交互等具身应用中至关重要。由于第一人称手部运动受复杂人类意图驱动、具有高度灵巧的关节运动特性,并且在自我运动引起的剧烈视角变化下观测,使得该问题极具挑战性。解决方案的关键在于提出EggHand框架,其核心创新是将视觉-语言-动作(Vision-Language-Action, VLA)模型中的动作解码器与一个第一人称视频-文本编码器相结合,前者建模手部运动的结构化时序动态,后者提供从大规模第一人称视频中学习到的视角感知上下文信息,从而在不依赖身体姿态或外部追踪的情况下,实现对运动、上下文和高层意图的联合推理,显著提升预测准确性与鲁棒性。
链接: https://arxiv.org/abs/2605.07642
作者: Jaeyoung Choi,Hyeondong Kim,Yujin Kim,Daehee Park
机构: DGIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026
Abstract:Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: this https URL
[CV-54] LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
【速读】:该论文旨在解决遥感岩性解译中自动化识别可靠性低的问题,其核心挑战在于岩性判别依赖专家知识,需综合视觉、光谱、纹理、地貌及上下文等多维特征,而现有方法缺乏能全面评估地质语义理解能力的基准。解决方案的关键是提出LithoBench——一个多层次岩性解译基准,包含10,000个专家标注样本,覆盖12类典型岩性,并按认知层级划分为五级任务(识别描述、对比分析、机制解释、实践应用与综合推理),同时构建“专家在环”的知识引导半自动化构建流程,确保地质有效性与评估可靠性,实验表明当前主流大视觉语言模型在高阶地质语义理解任务上存在显著不足。
链接: https://arxiv.org/abs/2605.07640
作者: Jun Wang,Fengpeng Li,Hang Dong,Tianjin Huang,Wei Han
机构: China University of Geosciences; King Abdullah University of Science and Technology; University of Exeter
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.
[CV-55] FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth ICML2026
【速读】:该论文旨在解决图像到点云(image-to-point cloud)配准中因视角变化、跨模态差异和重复纹理导致的尺度模糊问题,从而引发错误对应关系的挑战。现有无检测方法虽利用多尺度特征与基于Transformer的交互缓解该问题,但仍存在层间注意力漂移和同尺度内不一致性,限制了配准精度。其解决方案的关键在于提出“Focus–Sweep”范式,并设计一种基于状态空间模型(SSM)框架的层次化Focus–Sweep交互模块,以增强多层级跨模态特征关联;同时引入动态层分配策略,自适应确定迭代深度以更好地利用几何约束并提升匹配鲁棒性。
链接: https://arxiv.org/abs/2605.07607
作者: Zhixin Cheng,Yujia Chen,Xujing Tao,Bohao Liao,Xiaotian Yin,Baoqun Yin,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus–Sweep’’ paradigm and develop a Hierarchical Focus–Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.
[CV-56] SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
【速读】:该论文旨在解决野外环境中多动物3D重建的挑战,包括物种差异大、频繁遮挡以及多动物场景普遍等问题,而现有方法主要局限于单动物场景。其解决方案的关键在于提出首个可提示(promptable)的多动物3D重建框架SAM 3D Animal,基于SMAL+参数化动物模型,能够联合重建多个动物实例,并支持以关键点和掩码形式输入灵活提示,从而在拥挤和遮挡场景中实现更可靠的区分与重建。此外,为训练该模型,作者还构建了Herd3D数据集,包含超过5000张图像,涵盖多样化的物种、互动方式及遮挡模式,显著提升了模型在真实复杂场景下的泛化能力。
链接: https://arxiv.org/abs/2605.07604
作者: Xuyi Hu,Jin Lyu,Jiuming Liu,Yebin Liu,Silvia Zuffi,Liang An,Stefan Goetz
机构: University of Cambridge (剑桥大学); Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学); IMATI-CNR, Milan, Italy (意大利国家研究委员会米兰研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.
[CV-57] raceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
【速读】:该论文旨在解决现有音频-视觉理解基准在评估多跳推理(multi-hop reasoning)能力与跨模态幻觉鲁棒性(multimodal hallucination robustness)方面的不足问题。当前主流评测数据集通常局限于短片段视频、单一模态隔离或简化为单跳感知任务,无法真实反映现实世界中稀疏、时序分散且跨视听流的证据链整合需求。为此,作者提出TraceAV-Bench——首个联合评估长期视听轨迹上多跳推理能力和多模态幻觉鲁棒性的基准,其核心创新在于构建了一个包含2,200个严格验证的多项选择题、覆盖578段长达339.5小时的长视频的数据集,每个问题平均涉及3.68跳推理路径和15.1分钟的时间跨度,并通过三步半自动化流水线与严苛质量控制流程确保数据可靠性。实验表明,即使是最先进的OmniLLMs(如Gemini 3.1 Pro)在通用任务上仅达68.29%,而开源模型(Ming-Flash-Omni-2.0)仅为51.70%,且幻觉鲁棒性与一般多模态推理性能呈解耦趋势,凸显了该任务的挑战性和研究价值。
链接: https://arxiv.org/abs/2605.07593
作者: Hengyi Feng,Hao Liang,Mingrui Chen,Bohan Zeng,Meiyi Qiang,Zhengyang Zhao,Zimo Meng,Zeang Sheng,Wentao Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.
[CV-58] Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness
【速读】:该论文旨在解决点云(Point Cloud)模型在面对对抗攻击时的脆弱性问题,其核心假设是:3D网络的对抗脆弱性源于模型学习到的潜在几何结构(latent geometry)与输入点云内在几何结构(intrinsic geometry)之间的流形错位(manifold misalignment)。现有方法多依赖数据增强或防御机制,忽视了这一根本性的几何成因。解决方案的关键在于提出一种名为“流形对齐点识别”(Manifold-Aligned Point Recognition, MAPR)的新框架,通过在训练中引入捕捉局部曲率和扩散结构的内在特征,并施加一致性损失以保持对几何保真扰动的不变性,从而显式地对齐潜在空间与内在几何结构,实现无需对抗训练或额外数据即可显著提升鲁棒性,在ModelNet40和ScanObjectNN上分别获得平均+20.02%和+8.58%的鲁棒性增益。
链接: https://arxiv.org/abs/2605.07590
作者: Pedro Alonso,Chongshou Li,Tianrui Li
机构: Southwest Jiaotong University (西南交通大学); Engineering Research Center of Sustainable Urban Intelligent Transportation (教育部可持续城市智能交通工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite extensive progress in point cloud robustness, existing methods primarily improve performance through augmentation or defense mechanisms, while overlooking the geometric root cause of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, revealing a misalignment between latent and intrinsic geometries. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness across multiple adversarial attacks on both the ModelNet40 and ScanObjectNN datasets, achieving average robustness gains of +20.02% and +8.58% on ModelNet40 and ScanObjectNN, respectively.
[CV-59] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding ACL2026
【速读】:该论文旨在解决流式视频理解中视频大语言模型(Video-LLMs)在视频播放过程中如何主动决定响应时机的问题,现有方法因对视觉证据的建模缺乏显式、查询无关的结构化对齐而表现不佳。其解决方案的关键在于提出Response-G1框架,通过构建场景图(scene graph)实现累积视频证据与查询预期响应条件之间的显式、结构化对齐,包含三个无需微调的阶段:在线查询引导的场景图生成、基于记忆的历史场景图语义检索,以及检索增强的触发提示机制,从而在统一的图表示空间中同时锚定视觉证据和响应条件,提升响应时机决策的准确性和可解释性。
链接: https://arxiv.org/abs/2605.07575
作者: Ke Ma,Jiaqi Tang,Bin Guo,Xueting Han,Ruonan Xu,Qingfeng He,Ziheng Wang,Xu Wang,Qifeng Chen,Zhiwen Yu,Yunhao Liu
机构: Northwestern Polytechnical University (西北工业大学); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); Harbin Engineering University (哈尔滨工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026
Abstract:Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query’s expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame “silence/response” this http URL grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
[CV-60] PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
【速读】:该论文旨在解决主流视觉语言模型(Vision-Language Models, VLMs)在处理反射和透明物体等光学模糊场景时性能受限的问题,其根源在于标准RGB输入无法提供足够的物理信息以区分这些复杂视觉现象。解决方案的关键在于提出首个将偏振物理参数整合进VLM的多模态框架PolarVLM,通过双流架构与渐进式两阶段训练策略,在保持通用视觉能力的同时有效避免物理误解释,从而实现对物理规律敏感的语义理解。此外,作者构建了首个面向偏振感知视觉问答(Polarization-aware Visual Question Answering, PolarVQA)的基准数据集,包含7.5万条基于物理知识的指令微调样本,验证了该方法在反射识别和玻璃计数等任务上相较RGB基线分别提升26.6%和34.0%的显著效果。
链接: https://arxiv.org/abs/2605.07574
作者: Yuliang Li,Chu Zhou,Heng Guo,Boxin Shi,Imari Sato,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); National Institute of Informatics (日本信息研究所); Peking University (北京大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 12 figures, including appendices
Abstract:Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
[CV-61] Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
【速读】:该论文旨在解决遥感视觉-语言模型(Remote Sensing Vision-Language Models, RS-VLMs)与自然图像模型之间存在的尺度不匹配问题:同一地理对象在不同地面采样距离(Ground Sampling Distance, GSD)下呈现截然不同的视觉特征,而现有RS-VLM通常忽略GSD或将其离散化为文本标记,导致单一静态参数集难以适应跨数量级的尺度变化。其解决方案的关键在于提出ScaleEarth框架,核心创新是CS-HLoRA(Continuous Scale-Conditioned Hyper-LoRA),它将GSD作为连续条件变量动态调控低秩适配(LoRA)子空间的计算路径,实现基于物理尺度的计算路由;同时引入SSE-U轻量级异方差子头以从视觉特征中预测GSD及其不确定性,从而摆脱部署时对传感器元数据的依赖,并构建GeoScale-VQA大规模多尺度问答语料库形成闭环训练机制,最终在XLRS-Bench和OmniEarth-Bench等遥感基准上达到最优性能。
链接: https://arxiv.org/abs/2605.07562
作者: Song Zhang,Yanlong Chen,Yilin Li,Yining Chen,Zili Yi,Xiaowei Zhang,Yawei Li
机构: Nanjing University (南京大学); ETH Zurich (苏黎世联邦理工学院); RWTH Aachen University (亚琛工业大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. 30 pages, 16 figures, 7 tables
Abstract:Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model’s computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
[CV-62] Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer
【速读】:该论文旨在解决乳腺癌新辅助治疗中病理完全缓解(pathological complete response, pCR)预测的两大挑战:一是数据集存在严重的类别不平衡问题,导致对响应者(responder)的识别能力不足;二是模型在不同临床场景下的泛化能力有限。解决方案的关键在于提出一种多模态分步临床引导注意力学习框架,通过模拟医生的推理过程实现逐步优化:首先学习全局影像特征以捕捉判别性模式,随后引入空间注意力机制聚焦于肿瘤区域以增强对响应者的识别,最后融合临床变量进一步细化决策。该策略不仅提升了敏感性,还通过解剖学一致的注意力区域减少对特定数据集的依赖,从而显著增强跨机构的泛化性能。
链接: https://arxiv.org/abs/2605.07561
作者: Alice Natalina Caragliano,Valerio Guarrasi,Michela Gravina,Carlo Sansone,Paolo Soda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pathological complete response (pCR) is a key prognostic factor in breast cancer patients undergoing neoadjuvant therapy, strongly associated with long-term survival and treatment personalization. However, accurate pre-treatment pCR prediction remains challenging due to severe class imbalance and limited generalizability across diverse clinical settings. In this work, we propose a multimodal stepwise clinically-guided attention learning framework for pCR prediction from breast magnetic resonance imaging (MRI), designed to address these limitations through medically grounded spatial guidance and multimodal integration. The approach follows a stepwise training strategy inspired by physician reasoning: the model first learns global discriminative imaging patterns, then attention mechanisms are introduced to constrain the network toward tumor regions, and finally clinical variables are integrated to refine decision-making. This guidance strategy encourages prioritization of task-relevant features, improving identification of responders despite their limited representation in the dataset. Moreover, grounding attention in anatomically consistent tumor regions reduces reliance on dataset-specific patterns, thereby enhancing cross-institutional generalization. The framework is evaluated through external validation across heterogeneous MRI cohorts. Compared to non-guided single-stage baselines, the proposed approach improves sensitivity while maintaining competitive specificity, and produces anatomically coherent attention maps that support interpretation of the model’s predictions. These findings highlight the potential of clinically-guided multimodal attention learning for robust and generalizable pCR prediction in breast cancer.
[CV-63] Dynamic Mode Decomposition along Depth in Vision Transformers
【速读】:该论文旨在探究视觉 Transformer (Vision Transformer, ViT) 的深度是否可以近似为一种自洽的线性动力学过程,即是否存在一个单一的线性算子 $ K ,能够通过重复应用( K^p $)来模拟连续块之间的状态演化。其核心问题是:ViT 中相邻块间的非线性变换是否可被简化为一个稳定的、可泛化的线性映射?解决方案的关键在于使用动态模态分解(Dynamic Mode Decomposition, DMD),从选定的连续隐藏状态对中拟合出该线性算子 $ K ,并验证其在短跨度( p \leq 4 )内对中间激活和终点映射的预测能力。实验表明,在DINOv3−H/16+模型上, K^p $ 可以以高达 0.02 的余弦相似度精度逼近未约束的目标映射,并且在早期层中,该算子具有低秩特性且仅需少量校准数据即可稳定拟合,尤其是 CLS token 最适合线性化;然而,这种局部线性近似无法在下游任务中保持有效性,说明其本质仍是局部而非全局结构的表征。
链接: https://arxiv.org/abs/2605.07556
作者: Nishant Suresh Aswani,Saif Eddin Jabari
机构: NYU Tandon; NYU Abu Dhabi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textitautonomous linear dynamics, admitting a single operator K applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits K from selected, consecutive hidden-state pairs and predicts p steps ahead via K^p . On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ( p \leq 4 ), K^p tracks an unconstrained endpoint map to within 0.02 cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank \ll d with minimal calibration data, and across tokens, \textttcls is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.
[CV-64] VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
【速读】:该论文旨在解决多模态3D人体姿态估计(HPE)中现有基于Transformer的方法因二次时间复杂度导致无法实现实时处理长序列的问题,同时克服Mamba在复杂空间依赖建模上的不足。其解决方案的关键在于提出VIMCAN——一种融合Mamba的高效时序建模能力与交叉注意力(Cross-Attention)的空间依赖提取机制的混合架构,通过动态参数化实现对RGB关键点与可穿戴惯性测量单元(IMU)数据的鲁棒视觉-惯性融合,从而在保持实时推理速度(>60 FPS)的同时显著提升精度(TotalCapture上MPJPE为17.2 mm,3DPW上为45.3 mm)。
链接: https://arxiv.org/abs/2605.07552
作者: Zepeng Yang,Junxuan Bai,Hao Li,Ju Dai,Junjun Pan,Yongfeng Yin,Bin Li
机构: Beihang University (北京航空航天大学); Peng Cheng Laboratory (鹏城实验室); Capital University of Physical Education and Sports (首都体育学院); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba’s dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available on GitHub.
[CV-65] Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
【速读】:该论文旨在解决3D视觉系统在缺乏视图重叠(即空间和外观均无重叠)场景下的重建难题,这类问题常见于分布式蜂群机器人或众包数据采集等实际应用中。现有方法依赖视觉重叠进行几何对齐或多视角一致性约束,在零重叠条件下会失效,导致结构断裂或语义不一致的重建结果。解决方案的关键在于提出GLADOS框架,其核心创新为三阶段设计:(1) 生成式桥接(Generative Bridging),利用基础模型合成中间视角以连接离散输入;(2) 鲁棒粗粒度3D重建(Robust Coarse 3D Reconstruction),通过全局对齐建立粗略几何骨架并吸收生成过程中的局部矛盾;(3) 迭代上下文扩展与一致性优化(Iterative Context Expansion and Consistency Optimization),填补缺失区域并统一整体重建结果。该框架具有架构无关性,可无缝集成未来生成、重建与修复技术的进展。
链接: https://arxiv.org/abs/2605.07550
作者: Grzegorz Wilczynski,Mikołaj Zielinski,Bartosz Świrta,Dominik Belter,Przemysław Spurek
机构: Jagiellonian University (雅盖隆大学); IDEAS Research Institute (IDEAS 研究所); Poznań University of Technology (波兹南理工大学); Warsaw University of Technology (华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D vision systems are fundamentally constrained by their reliance on visual overlap: reconstruction methods require it for geometric alignment, while generative models use it to enforce multi-view consistency. This limitation is particularly acute in real-world scenarios such as distributed swarm robotics or crowd-sourced data collection, where capturing overlapping perspectives, both in terms of spatial and appearance overlap, is often impossible. We introduce Generative Reconstruction from Disjoint Views as a new paradigm, establish a comprehensive dataset, and propose specialized evaluation metrics for zero-overlap scenarios. Our benchmarking demonstrates that existing state-of-the-art methods fail catastrophically on this task, producing disconnected geometries or semantically incoherent reconstructions. To address these limitations, we propose GLADOS, a general, modular framework that operates through three stages: (1) Generative Bridging, where foundation models synthesize intermediate perspectives to connect disjoint inputs; (2) Robust Coarse 3D Reconstruction, that establish coarse geometric scaffold via global alignment which absorbs local contradictions from generative process; and (3) Iterative Context Expansion and Consistency Optimization to fill missing regions and unify the reconstruction. As an architectureagnostic framework, GLADOS enables seamless integration of future advances in generation, reconstruction, and inpainting. The source code is available at: this https URL.
[CV-66] Probabilistic Object Detection with Conformal Prediction
【速读】:该论文旨在解决安全关键场景下目标检测中不确定性量化(Uncertainty Quantification, UQ)的可靠性与精度问题,尤其针对传统置信区间方法在多输出结构化预测(如边界框坐标)中的适用性不足,以及固定宽度预测区间导致低不确定性样本冗余的问题。其解决方案的关键在于:首先,采用逐坐标(coordinate-wise)的共形预测(Conformal Prediction, CP)结合Bonferroni校正以实现边界框级别的覆盖保证;其次,利用基于损失衰减训练的概率检测器估计每预测项的Aleatoric不确定性,并以此对预测区间进行缩放,从而提升区间锐度(sharpness);最后,构建两阶段流程——先用RAPS(Regressor-Augmented Prediction Sets)生成类别预测集,再将边界框共形化结果条件于该类别集合,实现联合不确定性建模。实验表明,该方法在三个自动驾驶数据集(KITTI、BDD、CODA)上均显著优于未缩放CP,在保持覆盖率的同时提升了IoU达19%、区间得分降低39%,且类别校准进一步优化了覆盖性能。
链接: https://arxiv.org/abs/2605.07549
作者: Christopher Ries,Moussa Kassem Sbeyti,Nicolas Bianco,Nadja Klein
机构: Karlsruhe Institute of Technology (KIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Conformal Prediction (CP) is a distribution-free method for constructing prediction sets with marginal finite-sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety-critical object detection. However, object detection introduces structured multi-output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed-width prediction intervals across inputs, leading to unnecessary width for low-uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input-dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi-class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate-wise to bounding box corners with a Bonferroni correction for box-level guarantees; (ii) scaling the resulting intervals using per-prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two-step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross-domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class-wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real-time, real-world object detection.
[CV-67] Implicit Preference Alignment for Human Image Animation ICML2026
【速读】:该论文旨在解决人类图像动画中高保真手部运动生成的难题,这一问题源于手部动作具有高自由度和复杂性,且传统基于人类反馈的强化学习方法(如直接偏好优化)依赖严格配对的偏好数据,而动态手部区域的逐帧不一致性使得此类数据的构建成本极高且不切实际。解决方案的关键在于提出一种名为隐式偏好对齐(Implicit Preference Alignment, IPA)的数据高效后训练框架,其核心思想是通过最大化自生成高质量样本的概率并惩罚偏离预训练先验的偏差来实现模型对齐,从而无需依赖成对偏好数据;此外,引入手部感知局部优化机制(Hand-Aware Local Optimization),显式引导对齐过程聚焦于手部区域,从而显著提升手部生成质量并降低偏好数据构建门槛。
链接: https://arxiv.org/abs/2605.07545
作者: Yuanzhi Wang,Xuhua Ren,Jiaxiang Cheng,Bing Ma,Kai Yu,Tianxiang Zheng,Qinglin Lu,Zhen Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at this https URL
[CV-68] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
【速读】:该论文旨在解决生成式世界模型(World Action Models, WAMs)在决策规划中缺乏可靠性的评估标准问题,特别是现有方法仅关注生成未来画面的视觉合理性,而忽视了预测动作与状态转移之间的动态一致性(action-state consistency)。研究发现,动作与状态的一致性是区分成功与失败模拟轨迹的关键指标,并且其趋势与学习到的价值估计高度一致,表明该一致性捕捉了超越视觉真实性的决策相关结构。解决方案的核心在于提出一种无需额外训练或奖励建模的价值自由共识策略(value-free consensus strategy),通过比较多个预测未来轨迹间的共识程度来筛选高质量的rollout路径,从而显著提升RoboCasa和RoboTwin 2.0任务上的规划成功率。
链接: https://arxiv.org/abs/2605.07514
作者: Bo-Kai Ruan,Teng-Fang Hsiao,Ling Lo,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.
[CV-69] Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models
【速读】:该论文旨在解决类增量学习(Class-incremental Learning)中因参数更新导致的灾难性遗忘问题,特别是由于不同任务引起的参数更新在高维空间中位于多个重叠的低秩子空间,从而引发跨任务子空间干扰。解决方案的关键在于提出一种分层双子空间解耦框架(Hierarchical Dual-Subspace Decoupling, HDSD),其核心是通过轻量级特征调制模块(Feature Modulation Module, FMM)显式地将参数空间分解为通用子空间和任务特定子空间;在此基础上,进一步设计通用融合模块(General Fusion Module, GFM)以自适应阈值识别稳定且可迁移的知识,并引入分层学习模块(Hierarchical Learning Module, HLM)利用奇异值分解(SVD)进行结构化参数分解并采用缩放机制约束更新在不同子空间尺度内,从而有效降低子空间干扰与参数漂移,提升模型在视觉-语言模型中的持续学习性能。
链接: https://arxiv.org/abs/2605.07512
作者: Mengxin Qin,Xiang Zhang,Kun Wei,Xu Yang,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.
[CV-70] Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
【速读】:该论文旨在解决大规模视频扩散模型(video diffusion models)在对齐人类意图时存在的训练噪声分布与推理阶段去噪轨迹不一致的问题,这一偏差导致现有方法如直接偏好优化(Direct Preference Optimization, DPO)和组相对策略优化(Group Relative Policy Optimization, GRPO)受限于依赖有偏的复杂奖励模型或次优的时间步采样策略。其解决方案的关键在于提出 Diffusion-APO(Aligned Preference Optimization),通过同步训练噪声与推理时的去噪路径,最大化梯度信号的有效性,从而实现轨迹感知的偏好对齐;同时构建了一个统一且模块化的强化学习人类反馈(RLHF)框架,集成在线排序、半在线锚定、离线精调及蒸馏感知漂移校正等机制,支持在不同数据和计算约束下灵活进行多阶段偏好对齐,且无需依赖基于标量奖励的策略梯度。
链接: https://arxiv.org/abs/2605.07503
作者: Jingyuan Zhu,Biaolong Chen,Le Zhang,Aixi Zhang,Hao Jiang,Pipei Huang
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.
[CV-71] Cloud-top infrared observations reveal the four-dimensional precipitation structure
【速读】:该论文旨在解决全球尺度上四维(4D)降水信息观测不足的问题,即如何从地球静止轨道红外遥感数据中获取具有时空连续性的降水结构信息。传统理论认为,红外观测主要反映云顶特性,对云下降水敏感度有限,但本文通过实证发现云顶红外辐射仍蕴含足够的信息以重构降水的垂直分布与时间演变。解决方案的关键在于提出一种物理约束的深度学习框架——4DPrecipNet,其核心创新是引入“湿度优先”约束(moisture-first constraint),要求潜在表示恢复可降水量(precipitable water vapour),从而确保模型在热力学一致性基础上进行训练;同时融合多通道红外辐射与雷达反演的降水廓线,实现从云顶红外观测到云下降水系统四维结构的重建,显著提升了对深对流系统及其演化过程的捕捉能力。
链接: https://arxiv.org/abs/2605.07499
作者: Tianchi Xu,Ziqiang Ma,Andrea Marinoni,Yuanpeng He,Xiaoqing Li,Chuanfeng Zhao,Kang He,Jintao Xu,Bohan Zhou,Wenbo Zhao,Haoshuang Chen,Tun Wang,Dongdong Wang,Yang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate four-dimensional (4D) precipitation information is essential for understanding the Earth’s energy and water cycles, yet remains observationally unresolved at global scales. Conventional theory holds that geostationary infrared observations primarily sense cloud-top properties, with limited sensitivity to sub-cloud precipitation. Here we show that cloud-top infrared measurements nevertheless encode sufficient information to recover the four-dimensional structure of precipitation, revealing a previously unexploited observability of sub-cloud processes. We introduce a physically constrained deep learning framework, 4DPrecipNet, in which a moisture-first constraint requires the latent representation to recover precipitable water vapour, anchoring the model in thermodynamic consistency. By integrating multi-channel infrared radiances with these constraints and radar-derived precipitation profiles, we reconstruct the vertical and temporal evolution of precipitation systems from geostationary orbit. The framework captures deep convective structures and their evolution, with robust performance across large samples and independent radar comparisons. These results demonstrate that sub-cloud precipitation is physically encoded in cloud-top infrared observations, establishing a new pathway for continuous global monitoring of precipitation structure.
[CV-72] Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing CVPR
【速读】:该论文旨在解决无配对(unpaired)智能手机图像信号处理(ISP)问题,即在缺乏RAW图像与目标RGB图像之间场景和色彩对齐的情况下,如何有效进行颜色渲染。其核心挑战在于传统方法依赖成对数据或易受不稳定性的对抗训练。解决方案的关键在于:首先通过重建大尺寸图像恢复全局上下文;随后利用DINOv2提取语义嵌入,并采用融合Gromov-Wasserstein(FGW)最优传输算法,在图像和补丁两个层级构建RAW与RGB图像之间的伪配对(pseudo pairs),从而部分缓解数据无配对性;基于这些伪配对,训练一个仅含7K参数的轻量级CNN用于颜色转换,该网络结构紧凑、聚焦于色彩映射而非结构变化,有效减少伪影并提升训练稳定性。
链接: https://arxiv.org/abs/2605.07495
作者: Yujin Cho,Flavien Armangeon,Yanhao Li
机构: Université Paris Saclay, ENS Paris Saclay, CNRS, Centre Borelli; DXOMARK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, CVPR Workshops 2026
Abstract:Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 \Delta E on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and \Delta E among all challenge entries. Our code is available at this http URL .
[CV-73] DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
【速读】:该论文旨在解决多域任务增量学习(multi-domain task-incremental learning)中因显著领域迁移导致的稳定性-可塑性权衡难题(stability-plasticity dilemma),现有方法受限于固定架构和静态参数分配,难以适应新领域并加剧灾难性遗忘(catastrophic forgetting)。解决方案的关键在于提出DIMoE-Adapters框架,其核心是动态专家演化范式(dynamic expert evolution paradigm),通过两个协同组件实现:Self-Calibrated Expert Evolution (SCEE) 构建并演化稀疏专家池以提升可塑性并减少冗余容量;Prototype-Guided Expert Selection (PGES) 基于SCEE生成的专家池进行专家选择,从而增强对已见与未见任务的稳定性。
链接: https://arxiv.org/abs/2605.07494
作者: Mengxin Qin,Xiang Zhang,Xi Wang,Kun Wei,Xu Yang,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.
[CV-74] How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean Degraded and Real-World Settings
【速读】:该论文旨在解决当前文档解析(Document Parsing)领域基准测试中存在的两大核心问题:一是现有主流基准OmniDocBench存在标注质量下降与数据污染风险,导致模型性能排名失真;二是缺乏可复现、结构化且覆盖多场景的高质量评估基准。其解决方案的关键在于提出PureDocBench——一个程序化生成、具备源码溯源能力的新基准,通过从HTML/CSS渲染文档图像并自动生成可验证标注,覆盖10个领域、66个子类、1,475页文档(含三种退化版本,共4,425张图像),从而实现更公平、可靠和具有实际部署指导意义的模型评估。
链接: https://arxiv.org/abs/2605.07492
作者: Zhiheng Li,Zongyang Ma,Jiaxian Chen,Jianing Zhang,Zhaolong Su,Yutong Zhang,Zhiyin Yu,Ruiqi Liu,Xiaolei Lv,Bo Li,Jun Gao,Ziqi Zhang,Chunfeng Yuan,Bing Li,Weiming Hu
机构: CASIA; UCAS; NWPU; JLU; USTC; PKU; HelloGroup; ShanghaiTech; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages, 20 figures, 16 tables
Abstract:The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with =4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.
[CV-75] Implicit Multi-Camera System Calibration Using Gaussian Processes
【速读】:该论文旨在解决复杂多相机系统标定中传统显式标定方法受限于刚性数学模型、难以处理非线性畸变,以及现有基于神经网络的隐式方法数据需求高且缺乏不确定性量化(Uncertainty Quantification, UQ)的问题。其解决方案的关键在于提出一种基于高斯过程(Gaussian Process, GP)回归的隐式标定框架,直接学习从所有相机的二维图像坐标到三维世界坐标的非线性映射关系,从而避免对相机内参和外参的显式估计;同时利用GP固有的UQ能力,将简单的3D点预测转化为带有统计可信置信区间的可验证测量,并结合主动学习(Active Learning, AL)策略,通过GP预测不确定性智能引导新标定数据的采集,显著提升数据效率与实际部署的可靠性。
链接: https://arxiv.org/abs/2605.07491
作者: Ivan De Boi,Bart Ribbens,Veronika Golanova,Ursula Kapov,Simon Verspeek
机构: University of Antwerp (安特卫普大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a novel framework for implicit multi-camera system calibration utilizing Gaussian Process (GP) regression. Conventional explicit calibration methods are constrained by rigid mathematical models and struggle with complex, non-linear distortions from unconventional optics, while existing neural network-based implicit approaches are typically data-hungry and lack inherent uncertainty quantification (UQ). Our GP-based model directly learns the complex, non-linear mapping from 2D image coordinates across all cameras to a 3D world coordinate, completely bypassing time-consuming estimation of explicit intrinsic and extrinsic parameters. Moreover, the inherent UQ is critical for transforming a simple 3D point prediction into a verifiable 3D measurement, complete with statistically-sound confidence bounds. To further enhance data efficiency and practical deployment, we integrate Active Learning (AL), which intelligently leverages the GP’s predictive uncertainty to strategically guide the acquisition of new calibration data. This approach results in a robust, data-efficient, and reliable calibration solution, proving particularly effective in practical scenarios where collecting extensive calibration data is a dominant constraint. Our experiments show that the uncertainty for the 3D predictions is higher closer to the cameras. The data points in uv -coordinate space are more sparse in that region, even though they are not in 3D space. This work is relevant for anyone who is tasked with the calibration of complex multi-camera systems.
[CV-76] AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
【速读】:该论文旨在解决语音驱动面部动画中声学信号与面部运动之间对应关系不准确的问题,尤其是发音相关的嘴部动作难以精确建模。传统方法直接将语音音频映射到面部系数,忽略了语音产生背后的语言和音位结构。解决方案的关键在于提出AudioFace框架,将嘴部相关面部系数预测视为一个受语言和发音信息引导的结构化生成问题;通过引入转录文本和音素级别的线索,结合多模态大语言模型的先验知识,有效桥接了语音信号与可解释的面部动作,从而提升动画的真实性和准确性。
链接: https://arxiv.org/abs/2605.07478
作者: Kai Zheng,Zejian Kang,Rui Mao,Hongyuan Zou,Yuanchen Fei,Xuanyang Xu,Xiangru Huang
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Tiangong University (天工大学); Hunan University (湖南大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.
[CV-77] Reason Edit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
【速读】:该论文旨在解决当前文本引导图像编辑(Text-Guided Image Editing, TIE)模型生成结果中存在的伪影、非预期修改及美学质量不佳等问题,以及现有评估方法依赖单一标量分数且缺乏可解释性的局限。其关键解决方案在于构建了首个结合22K张编辑图像与113K条链式思维(Chain-of-Thought, CoT)样本的高质量解释数据集ReasonEdit-22K,并基于此提出RE-Reward——一种基于多模态大语言模型(Multimodal Large Language Model, MLLM)的奖励模型,用于提供符合人类偏好的可解释反馈;进一步利用RE-Reward提供的奖励信号和组相对策略优化(Group Relative Policy Optimization, GRPO)算法训练出ReasonEdit模型,使其具备高对齐度的人类偏好表现和跨基准的泛化能力,同时能生成高质量的可解释评估文本,提升图像编辑评估的透明性与可信度。
链接: https://arxiv.org/abs/2605.07477
作者: Honghua Chen,Zitong Xu,Huiyu Duan,Xinyun Zhang,Xiongkuo Min,Guangtao Zhai
机构: University of Electronic Science and Technology of China(电子科技大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at this https URL.
[CV-78] ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
【速读】:该论文旨在解决生成式机器人智能中因标注数据获取成本高昂而导致的视觉-语言-动作(Vision-Language-Action, VLA)模型扩展瓶颈问题。现有方法难以利用分布式部署的机器人所生成的大量未标注视觉-动作对,且这些数据存在隐私与异构性限制,无法集中处理。解决方案的关键在于提出一种联邦VLA训练框架ForgeVLA,其核心创新包括:(1)客户端配备具身指令分类器(embodied instruction classifier),将原始视觉-动作对映射到预定义指令集,重建缺失的语言模态并形成完整的视觉-语言-动作三元组;(2)识别并缓解联邦学习中被忽视的视觉-语言特征坍塌问题,通过客户端对比规划损失(contrastive planning loss)与服务器端自适应聚合策略相结合,高效学习任务区分性表征。
链接: https://arxiv.org/abs/2605.07474
作者: Yuhao Zhou,Yunpeng Zhu,Yang Zhou,Jindi Lyu,Jian Lan,Zhangyuan Wang,Dan Si,Thomas Seidl,Qing Ye,Jiancheng Lyu
机构: Sichuan University(四川大学); Zhejiang University(浙江大学); Ludwig-Maximilians-Universität München(慕尼黑路德维希-马克西米利安大学); Lenovo Group Limited(联想集团); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education(教育部机器学习与工业智能工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.
[CV-79] A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images
【速读】:该论文旨在解决非酒精性脂肪胰腺病(Non-alcoholic fatty pancreas disease, NAFPD)临床诊断中依赖主观超声图像视觉评估的问题,从而实现自动化、客观化的脂肪胰腺分类。其解决方案的关键在于提出一个端到端的框架:首先利用基于TransUNet的分割架构(采用ResNet编码器与Transformer瓶颈)精准 delineate(勾画)胰腺和脾静脉区域;随后通过解剖引导的图像块提取策略,结合患者层面的成对纹理比较机制,模拟临床推理过程——即对比胰周脂肪与胰实质的回声强度差异,生成可解释的分类信号;该方法在214张腹部超声图像上验证,使用5折交叉验证,支持向量机(SVM)搭配径向基函数(RBF)核达到89.7% ± 1.8%的准确率和0.898 ± 0.019的F1分数,显著优于无监督K-Means基线,证明了所提特征能有效捕捉临床相关信号,且无需标注数据即可实现高精度分类。
链接: https://arxiv.org/abs/2605.07466
作者: Ioan-Tudor-Alexandru Anghel,Ciprian-Mihai Ceausescu,Elena Dana Nedelcu,Elena Raluca Stirban,Camelia Croitoru,Despina Ungureanu,Ana Maria Palan,Gabriela Pop
机构: University of Bucharest (布加勒斯特大学); Ponderas Academic Hospital (庞德拉斯学术医院); Emergency Clinical Hospital Bucharest (布加勒斯特急诊临床医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-alcoholic fatty pancreas disease (NAFPD) is an underdiagnosed condition associated with metabolic syndrome, insulin resistance, and increased risk of pancreatic cancer. Diagnosis typically relies on subjective visual assessment of ultrasound images by clinicians. We propose an end-to-end framework for automatically classifying normal versus fatty pancreas from abdominal ultrasound images. Our method employs a TransUNet-based segmentation architecture with a ResNet encoder and transformer bottleneck to delineate the pancreas and the splenic vein, followed by anatomically-guided patch extraction and patient-level classification through pairwise texture comparison. The feature engineering mimics clinical reasoning by comparing the echogenicity of peri-venous fat to the pancreatic parenchyma, providing an interpretable signal for classification. The segmentation models are initialized via domain-specific transfer learning from a liver segmentation task. We validate the full pipeline on a clinical dataset of 214 abdominal ultrasound images with 107 expert-labeled cases using 5-fold cross-validation. SVM with RBF kernel achieves a mean cross-validated accuracy of 89.7%, \pm ,1.8% and F1 of 0.898, \pm ,0.019, while the unsupervised K-Means baseline reaches 87.8% accuracy, demonstrating that the proposed features capture the relevant clinical signal even without labeled training data. To our knowledge, this is the first end-to-end automated framework for fatty pancreas classification from ultrasound using segmentation-guided texture analysis.
[CV-80] EditRefiner: A Human-Aligned Agent ic Framework for Image Editing Refinement
【速读】:该论文旨在解决文本引导图像编辑(Text-guided Image Editing, TIE)模型在生成结果中普遍存在细粒度问题,如不自然的对象、光照不匹配和意外变化等,这些问题通常难以通过现有精修方法有效修复。现有方案要么依赖昂贵的迭代重生成,要么使用空间定位能力弱的视觉-语言模型(Vision-Language Models, VLMs),导致语义漂移和局部修正不可靠。其解决方案的关键在于构建了一个大规模、细粒度的人类反馈数据集 EditFHF-15K,并提出一个层次化、可解释且与人类感知对齐的代理框架 EditRefiner,该框架将后编辑修正过程建模为“感知-推理-行动-评估”的类人循环,包含四个核心模块:感知代理用于检测伪影与失败区域,推理代理进行人类对齐的诊断推断,动作代理执行局部再编辑,评估代理判断是否需要进一步优化,从而实现高精度的失真定位、准确诊断与人类感知一致性提升。
链接: https://arxiv.org/abs/2605.07457
作者: Zitong Xu,Huiyu Duan,Yifei Nie,Mingda Du,Sijing Wu,Xiongkuo Min,Tianyi Zheng,Jian Zhang,Shusong Xu,Jinwei Chen,Bo Li,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Vivo Mobile Communication Co., Ltd (维沃移动通信有限公司); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at this https URL.
[CV-81] EditTransfer: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing
【速读】:该论文旨在解决基于扩散Transformer的图像编辑方法在视觉提示引导下的编辑忠实度不足问题,其核心挑战在于模型预训练时对文本条件的偏好以及采样过程中的固有随机不稳定性,导致无法准确复现示例中展示的编辑效果。解决方案的关键在于提出EditTransfer++框架,通过两个核心机制实现突破:一是采用文本解耦训练策略,在微调阶段移除文本条件,迫使模型仅依赖视觉证据进行变换推理,同时保留推理时可选的文本引导;二是引入最佳-最差对比精炼机制,重塑去噪轨迹以抑制不忠实生成并提升不同随机种子下的结果一致性。此外,为缓解高分辨率上下文编辑的计算瓶颈,还设计了条件压缩与重用策略,显著降低令牌冗余,支持高效生成边长达1024像素的图像。
链接: https://arxiv.org/abs/2605.07455
作者: Lan Chen,Qi Mao,Yiren Song,Yuchao Gu,Siwei Ma
机构: Communication University of China (中国传媒大学); National University of Singapore (新加坡国立大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-prompt-guided edit transfer aims to learn image transformations directly from example pairs, offering more precise and controllable editing than purely text-driven approaches. However, existing diffusion transformer-based methods often fail to faithfully reproduce the demonstrated edits due to structural mismatches between the task and the backbone, including a pretrained bias toward textual conditioning and inherent stochastic instability during sampling. To bridge this gap, we present EditTransfer++, a framework that combines progressively structured training with an efficient conditioning scheme to improve both visual prompt faithfulness and inference efficiency. We first mitigate textual dominance with a text-decoupled training strategy that removes text conditioning during fine-tuning, compelling the model to infer transformations solely from visual evidence while still supporting optional text guidance at inference. On top of this visually grounded model, a best-worst contrastive refinement mechanism reshapes the denoising trajectories to suppress unfaithful generations and improve consistency across random seeds. To alleviate the computational bottleneck of high-resolution in-context editing, we further introduce a condition compression and reuse strategy that reduces token redundancy and enables efficient generation of images with a 1024-pixel long edge. Extensive experiments on existing benchmarks and the proposed EditTransfer-Bench show that EditTransfer++ achieves state-of-the-art visual prompt faithfulness with substantially faster inference than prior methods, suggesting a promising direction for scalable prompt-guided image editing and broader visual in-context learning.
[CV-82] owards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
【速读】:该论文旨在解决移动设备在高倍数字变焦下难以生成自然、光学真实感背景虚化(bokeh)效果的问题,尤其针对低分辨率和细节丢失的图像。现有基于学习的方法虽具潜力,但在高倍变焦场景中仍存在性能瓶颈,且传统两阶段处理流程(先超分再渲染)效率低下并易引入误差累积。解决方案的关键在于提出 MagicBokeh——一个统一的扩散模型框架,通过交替训练策略与聚焦感知掩码注意力机制(focus-aware masked attention mechanism),联合优化背景虚化渲染与超分辨率任务,从而显著提升控制精度与视觉保真度;同时引入退化感知深度模块(degradation-aware depth module),增强对低质量输入的深度估计准确性,最终实现在真实低分辨率图像上高效生成逼真的 bokeh 效果。
链接: https://arxiv.org/abs/2605.07429
作者: Linxiao Shi,Siming Zheng,Zerong Wang,Hao Zhang,Jinwei Chen,Bo Li,Shifeng Chen,Peng-Tao Jiang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); vivo BlueImage Lab, vivo Mobile Communication Co., Ltd.(维沃移动通信有限公司蓝图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at this https URL.
[CV-83] SR2-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法导致的灾难性遗忘问题。研究表明,灾难性遗忘的本质在于层间表示关系漂移(inter-layer relation drift),即在学习新任务时,模型各层表示之间的相对关系逐渐被破坏,从而削弱了先前任务的分类边界,最终降低整体性能。解决方案的关键是提出一种名为Self-Rectifying inter-layer Relation Low-Rank Adaptation (SR²-LoRA) 的方法,通过构建当前任务样本下前后模型诱导的关系矩阵,并对齐其奇异值来约束层间关系漂移,从而有效缓解遗忘。理论分析进一步表明,这种奇异值对齐比逐元素对齐更具鲁棒性,尤其在任务数量增多时优势更加显著。
链接: https://arxiv.org/abs/2605.07420
作者: Fengqiang Wan,Yipeng Lin,Kan Lv,Yang Yang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained models with parameter-efficient fine-tuning (PEFT) have demonstrated promising potential for class-incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter-layer relation drift, i.e., the progressive disruption of relationships among layer-wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underlineSelf-\underlineRectifying inter-layer \underlineRelation Low-Rank Adaptation~(SR ^2 -LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter-layer relation drift. Specifically, SR ^2 -LoRA constructs the relation matrices induced by the previous and current models on current-task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry-wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR ^2 -LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \hrefthis https URLrepository.
[CV-84] Learning Image-Adaptive Scale Fields for Metric Depth Recovery
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中因缺乏精确尺度信息而导致的度量深度恢复难题,尤其在仅有稀疏度量锚点(metric anchors)条件下,如何实现高精度的度量深度重建。其解决方案的关键在于将度量深度恢复建模为图像自适应尺度场(image-adaptive scale field)的低维线性组合问题:通过从MDE结果及中间表征中提取语义和几何线索,构建一组图像自适应的基础映射(basis maps),并利用稀疏度量锚点通过最小二乘法求解这些基础映射的权重,从而实现对原始非度量深度的高效、鲁棒且可解释的尺度修正。
链接: https://arxiv.org/abs/2605.07418
作者: Yuanyan Li,Matthias Althoff
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular depth estimation (MDE) typically produces depth estimations that are defined up to an unknown scale or shift. When only sparse metric anchors are available, recovering accurate metric depth becomes challenging yet necessary for practical applications. We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations. Extensive experiments across multiple datasets and representative MDE models demonstrate the effectiveness and general applicability of our approach.
[CV-85] InsHuman: Towards Natural and Identity-Preserving Human Insertion
【速读】:该论文旨在解决现有图像编辑模型在进行人像插入(Human Insertion)时存在的三大问题:目标背景中人体姿态不自然、人物数量不一致以及面部身份信息被改变。为应对这些挑战,作者提出InsHuman框架,其核心创新在于三个关键技术:首先,提出人-背景自适应融合(Human-Background Adaptive Fusion, HBAF),通过检测前景人体并生成二值掩码,结合区域感知加权机制对预测与真实潜在空间特征进行对齐,从而确保插入人体的姿态、数量及整体外观与目标背景协调一致;其次,设计面面对齐的身份保持机制(Face-to-Face ID-Preserving, FFIP),基于人脸识别特征匹配源图与生成图中的面部,强制每个个体的身份一致性;最后,构建双向数据配对策略(Bidirectional Data Pairing, BDP),用于创建高质量的BDP-InsHuman数据集,其中包含真实的物理交互场景,显著提升模型训练效果。实验表明,InsHuman在生成合理图像的同时有效维持了人类身份不变。
链接: https://arxiv.org/abs/2605.07402
作者: Jie Li,Shulian Zhang,Yangyang Gao,Wenbo Li,Yulun Zhang,Yong Guo,Jian Chen
机构: South China University of Technology (华南理工大学); Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); Max Planck Institute for Informatics (马克斯·普朗克信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person’s pose, count, and overall appearance are coherently adapted to the target this http URL further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each this http URL addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.
[CV-86] GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
【速读】:该论文旨在解决扩散视觉语言模型(Diffusion Vision-Language Models, dVLMs)在安全对齐方面的潜在漏洞问题,特别是针对传统基于固定前缀优化(Fixed Prefix Optimization, FPO)的越狱攻击所表现出的“假性鲁棒性”——即dVLMs看似能抵御FPO类攻击,实则存在一种新型拒绝模式(Immediate Refusal与Progressive Refusal),而后者在生成过程中会暴露新的攻击面。解决方案的关键在于提出全局概率优化(Global Probability Optimization, GPO)这一通用越狱范式,其核心机制是通过操纵掩码扩散模型的去噪轨迹来扰动全局生成动力学,从而绕过模型内置的安全护栏;进一步地,作者构建了首个面向dVLM的视觉模态越狱框架GPO-V,实验表明该方法可生成具有高跨模型迁移能力的隐蔽扰动,揭示了非序列生成架构下的关键安全缺口。
链接: https://arxiv.org/abs/2605.07399
作者: Yu Pan,Andi Zhang,Yi Wang,Sibei Yang,Wenjie Wang
机构: ShanghaiTech University(上海科技大学); University of Warwick(华威大学); SUN YAT-SEN UNIVERSITY(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with “Sure, here is”), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: this https URL.
[CV-87] Exposing and Mitigating Temporal Attack in Deepfake Video Detection
【速读】:该论文旨在解决时空深度伪造检测模型(spatiotemporal deepfake detectors)在面对规避攻击时的脆弱性问题,其根本原因在于模型过度依赖于易受操纵的时频谱特征(temporal spectrum cues),而非学习到鲁棒的语义因果关系(semantic causality)。解决方案的关键在于提出SpInShield框架,该框架通过引入可学习的频谱对抗器(learnable spectral adversary)动态合成极端频谱失真以模拟高强度攻击场景,并采用捷径抑制优化策略(shortcut suppression optimization strategy),迫使编码器提取可靠的取证特征并清除潜在空间中不稳定的频谱统计量,从而实现对频谱不变性的显式建模与防御。
链接: https://arxiv.org/abs/2605.07398
作者: Zheyuan Gu,Minghao Shao,Zhen Wang,Yusong Wang,Mingkun Xu,Shijie Zhang,Hao Jiang
机构: Peking University (北京大学); New York University (纽约大学); Huzhou University (湖州大学); Institute of Science Tokyo (东京科学研究所); Guangdong Institute of Intelligence Science and Technology (广东省智能科学与技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.
[CV-88] BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
【速读】:该论文旨在解决图像描述(image captioning)任务中因强化学习(RL)方法与评估指标过于侧重单一质量维度而导致的多维性能权衡问题。现有方法往往在提升下游任务实用性时牺牲流畅性或覆盖度,反之亦然。其解决方案的关键在于提出一种更平衡的多目标强化学习框架,联合优化“以用途为导向的正确性”(utility-aware correctness)、“参考文本覆盖率”(reference coverage)和“语言质量”(linguistic quality),并通过GDPO风格的奖励解耦归一化处理连续奖励信号,并引入长度条件奖励掩码(length-conditional reward masking),从而实现对描述长度更合理的惩罚机制,显著提升了不同视觉语言模型(如LLaVA-1.5-7B和Qwen2.5-VL 3B/7B)上的生成质量。
链接: https://arxiv.org/abs/2605.07394
作者: Shaokai Ye,Vasileios Saveris,Yihao Qian,Jiaming Hu,Elmira Amirloo,Peter Grasch
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
[CV-89] ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
【速读】:该论文旨在解决生成式模型在物理世界中难以实现4D(时空维度)一致性的问题,即现有方法虽能在2D视频中生成看似连贯的内容,但缺乏对4D时空尺度下结构合理性与局部动态拓扑的建模能力。其解决方案的关键在于提出ST-Gen4D框架,通过四个核心设计构建基于4D时空认知的世界模型:首先建立多模态特征表示基础;其次将特征分解为全局外观图与局部动态图,并通过语义桥接融合形成4D认知图;再利用世界模型进行时空推理以预测未来状态;最后以推导出的认知作为条件引导潜在扩散过程生成4D高斯分布。该方法实现了4D生成过程中结构合理性和拓扑一致性的保障。
链接: https://arxiv.org/abs/2605.07390
作者: Haonan Wang,Hanyu Zhou,Tao Gu,Luxin Yan
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.
[CV-90] A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
【速读】:该论文旨在解决海洋机器人在复杂环境下对海洋垃圾(marine debris)检测性能下降的问题,尤其针对低质量图像中目标模糊、背景复杂及尺寸较小等挑战。其解决方案的关键在于提出一种改进的YOLO检测框架YOLO-MD,核心创新包括:1)设计双分支卷积增强自注意力模块(Dual-Branch Convolutional Enhanced Self-Attention, DB-CASA),强化空间与通道特征交互以提升退化图像中的特征表示能力;2)引入轻量级基于移位的操作,增强多尺度目标的细粒度特征提取并保持参数效率;3)提出SFG-Loss损失函数,通过动态样本重加权缓解类别不平衡和优化不稳定问题。实验表明,YOLO-MD在UODM数据集上达到0.875精度、0.822 F1-score和0.849 mAP50,优于当前最优方法,并已在实际机器人边缘部署中验证有效性。
链接: https://arxiv.org/abs/2605.07388
作者: Yuyang Li,Jiashu Han,Yinyi Lai,Wenbin Kang,Zenghui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Marine debris detection for ocean robot is crucial for ecological protection, yet performance is often degraded by low-quality images with blur, complex backgrounds, and small targets. To address these challenges, we propose YOLO-MD, an enhanced YOLO-based detection framework. A Dual-Branch Convolutional Enhanced Self-Attention (DB-CASA) module is designed to strengthen spatial-channel interactions, improving feature representation in degraded images. Additionally, a lightweight shift-based operation is introduced to enhance fine-grained feature extraction for objects of varying scales while maintaining parameter efficiency. We further propose SFG-Loss to mitigate class imbalance and optimization instability via dynamic sample reweighting. Experiments on the UODM dataset demonstrate that YOLO-MD achieves 0.875 precision, 0.822 F1-score, and 0.849 mAP50, outperforming the latest state-of-the-art methods. The effectiveness of this method has also been verified through real-world robotic edge deployment experiments.
[CV-91] Velocity-Space 3D Asset Editing
【速读】:该论文旨在解决3D资产局部编辑(local 3D asset editing)中的核心挑战:如何在不破坏未编辑区域的前提下,精确地修改目标区域。现有方法依赖于生成器外部的机制(如手动3D掩码、后处理体素合并或2D多视角提升),无法从源头——即扩散模型中的常微分方程(ODE)采样器内部进行有效干预,导致三个关键问题:(i) 身份泄露(identity leakage),即编辑信号残留在保留区域;(ii) 缺乏专用编辑增强通道,导致强化编辑不可避免地扰动原始结构;(iii) 几何与材质阶段存在全局条件干扰(identity drag),使得所有token被拉向目标。解决方案的关键在于提出VS3D(Velocity-Space 3D Asset editing)框架,其通过在采样器内部实施三种针对性干预模块实现无反演、无训练、无掩码的局部编辑:(1) 基于重建锚定的源注入(RASI)抑制身份泄露,利用源重建校准每步的资产特定锚点;(2) 局部均值引导(PMG)仅在一致编辑区域放大编辑信号,通过高质量与低质量子样本估计的差异对比实现;(3) 双重一致性残差注入(TAR)允许采样器逐token决定几何和材质阶段的保留内容,从而解耦编辑与保留过程。
链接: https://arxiv.org/abs/2605.07385
作者: Hao Liu,Yuxuan Lin,Jingfeng Guo,Ruihang Chu,Junjie Wang,Ruotong Li,Yujiu Yang
机构: Tsinghua University (清华大学); South China University of Technology (华南理工大学); Peng Cheng Laboratory (鹏城实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Editing a 3D asset locally, modifying a target region while preserving the rest, is a fundamental requirement of native 3D editing. Existing methods enforce locality through mechanisms external to the generator, such as manual 3D masks, post-hoc voxel merging, or 2D multi-view lifting. None of them intervene where the corruption actually originates: inside the ODE sampler. For a rectified-flow generator to achieve faithful local editing, its velocity field should be strong over the target editing region while vanishing on preserved content. Yet a single velocity field can hardly satisfy both requirements simultaneously, leading to three problems: (i) identity leakage that keeps the edit signal non-zero on preserved regions; (ii) no dedicated edit-amplification channel, so strengthening the edit inevitably perturbs identity; and (iii) an identity drag at the geometry and material stages, where a global condition pulls every token toward the target. We propose VS3D (Velocity-Space 3D Asset editing), an inversion-free, training-free, and mask-free framework that addresses each problem with a targeted intervention inside the sampler. VS3D integrates three complementary modules, each corresponding to a specific stage of the editing pipeline. Reconstruction-Anchored Source Injection (RASI) absorbs identity leakage by turning the unconditional embedding into a per-step, asset-specific anchor calibrated through source reconstruction. Partial-Mean Guidance (PMG) amplifies the edit signal by contrasting high- and low-quality subsample estimates of the velocity difference, active only where a consistent edit exists. Twin-Agreement Residual injection (TAR) lets the sampler decide token by token what to preserve at the geometry and material stages.
[CV-92] RELO: Reinforcement Learning to Localize for Visual Object Tracking ICML2026
【速读】:该论文旨在解决传统视觉目标跟踪方法中依赖手工设计的空间先验(handcrafted spatial priors)所带来的监督信号不准确问题,这些先验通常以热图形式存在,与跟踪优化目标(如交并比 IoU 和成功曲线下的面积 AUC)缺乏直接对齐。解决方案的关键在于提出一种基于强化学习的目标定位方法 RELO(REinforcement-learning-to-LOcalize),将目标定位建模为马尔可夫决策过程(Markov decision process),通过强化学习在空间位置上学习定位策略,并结合帧级 IoU 与序列级 AUC 构造奖励函数,从而实现更符合跟踪任务本质的端到端优化;此外,引入层对齐的时间标记传播机制以提升跨帧语义一致性,同时保持计算开销极低。
链接: https://arxiv.org/abs/2605.07379
作者: Xin Chen,Chuanyu Sun,Jiao Xu,Houwen Peng,Dong Wang,Huchuan Lu,Kede Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2026 paper
Abstract:Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.
[CV-93] Weather-Robust Scene Semantics with Vision-Aligned 4D Radar ICRA2026
【速读】:该论文旨在解决恶劣天气条件下(如雨、雾、雪)视觉传感器(如摄像头)性能显著下降导致场景语义理解失效的问题。其核心解决方案是利用毫米波雷达(millimeter-wave radar)在恶劣天气中仍能稳定工作的特性,通过将雷达编码器与冻结的SigLIP视觉嵌入对齐,并借助一个具有约700万可训练参数的冻结视觉语言模型(VLM),实现结构化场景描述的解码。关键创新在于识别出“token-norm mismatch”为跨模态桥接失败的主要原因,并提出使用投影输出层归一化(projector-output LayerNorm)有效缓解该问题,从而在K-RADAR数据集上的雾、轻雪和重雪测试序列中均显著优于依赖摄像头的基线方法(后者 hallucination 超过90%)。
链接: https://arxiv.org/abs/2605.07367
作者: Kali Hamilton,Christoffer Heckman
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages + references, 2 appendix pages. ICRA 2026 Radar in Robotics Workshop
Abstract:Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it. Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design. Comments: 5 pages + references, 2 appendix pages. ICRA 2026 Radar in Robotics Workshop Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.07367 [cs.RO] (or arXiv:2605.07367v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.07367 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kali Hamilton [view email] [v1] Fri, 8 May 2026 07:24:21 UTC (2,415 KB) Full-text links: Access Paper: View a PDF of the paper titled Weather-Robust Scene Semantics with Vision-Aligned 4D Radar, by Kali Hamilton and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.RO prev | next new | recent | 2026-05 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-94] UniISP: A Unified ISP Framework for Both Human and Machine Vision
【速读】:该论文旨在解决传统图像信号处理(Image Signal Processing, ISP)流水线在生成视觉上吸引人的RGB图像的同时,可能因压缩和损失操作破坏信息完整性的问题,尤其是在低光照等挑战性条件下,这种信息损失会显著影响计算机视觉任务的准确性。此外,现有直接使用原始传感器数据的方法往往仅进行最少的ISP处理,导致输出图像难以可视化或不符合人类审美偏好。解决方案的关键在于提出UniISP框架,其核心创新包括:一是引入精心设计的混合注意力模块(Hybrid Attention Module, HAM),通过监督学习确保生成图像在视觉上具有吸引力;二是提出特征适配器模块(Feature Adapter),有效将ISP阶段的信息特征传递至下游网络,从而兼顾人类视觉感知与计算机视觉应用的需求。实验表明,该方法在多种场景和数据集上均达到最优性能,验证了其泛化能力和有效性。
链接: https://arxiv.org/abs/2605.07359
作者: Hanxi Li,Yao Cheng,Bo Zhang,Li Zeng
机构: Li Auto Inc. (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compared to RGB images, raw sensor data provides a richer representation of information, which is crucial for accurate recognition, particularly under challenging conditions such as low-light environments. The traditional Image Signal Processing (ISP) pipeline generates visually pleasing RGB images for human perception through a series of steps, but some of these operations may adversely impact the information integrity by introducing compression and loss. Furthermore, in computer vision tasks that directly utilize raw camera data, most existing methods integrate minimal ISP processing with downstream networks, yet the resulting images are often difficult to visualize or do not align with human aesthetic preferences. This paper proposes UniISP, a novel ISP framework designed to simultaneously meet the requirements of both human visual perception and computer vision applications. By incorporating a carefully designed Hybrid Attention Module (HAM) and employing supervised learning, the proposed method ensures that the generated images are visually appealing. Additionally, a Feature Adapter module is introduced to effectively propagate informative features from the ISP stage to subsequent downstream networks. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various scenarios and multiple datasets, proving its generalizability and effectiveness.
[CV-95] UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition
【速读】:该论文旨在解决大规模3D点云语义分割中因LiDAR稀疏采样和图像观测的视图依赖性几何失真导致的跨模态对齐困难与融合不稳定问题。解决方案的关键在于提出一个统一的多模态框架,通过结合基于SAM(Segment Anything Model)的视觉编码器与基于SPTNet的几何编码器,提取互补的语义和几何特征,并将这些特征显式分解为共享子空间与私有子空间:共享部分捕捉2D与3D语义的一致性,私有部分保留各自模态的独特属性;进而利用轻量级注意力融合模块聚合共享特征以生成一致的跨模态表示,并引入正则化训练目标确保语义对齐与子空间独立性,从而实现高精度且鲁棒的联合2D-3D语义分割。
链接: https://arxiv.org/abs/2605.07356
作者: Shuai Zhang,Zhecheng Shi,Zhuxiao Li,Jing Ou,Tengxi Wang,Yuan Liu,Wufan Zhao
机构: HKUST(GZ), HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: this https URL.
[CV-96] F: Temporal Token Fusion for Efficient Video-Language Model NEURIPS2026
【速读】:该论文旨在解决视频语言模型(Video-Language Models, VLMs)在长视频推理时因视觉标记(visual token)数量随视频长度快速增长而导致的计算效率低下问题,尤其在Qwen3-VL等模型中,32帧以448×448分辨率输入即可产生约8000个视觉标记,使得大语言模型(Large Language Model, LLM)预填充(prefill)阶段成为主要吞吐瓶颈。现有方法多依赖全局相似性或注意力引导的压缩策略,常引入性能偏差。本文提出无需训练、可即插即用的**时间标记融合(Temporal Token Fusion, TTF)**框架,其核心在于利用视频中结构化的时序冗余特性:自动选取锚定帧后,对后续每帧进行局部窗口相似性搜索(如3×3邻域),将相似度超过阈值的标记进行融合;同时通过坐标重对齐(coordinate realignment)保持预填充与解码阶段的位置一致性,从而实现高效压缩且不破坏原有VLM流水线。实验表明,在Qwen3-VL-8B上使用阈值t=0.7时,TTF可移除约67%的视觉标记,保留99.5%的基线准确率,并仅增加约0.16 GFLOPs的匹配开销。
链接: https://arxiv.org/abs/2605.07355
作者: Simin Huo,Ning LI
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages; manuscript submitted to NeurIPS 2026
Abstract:Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at 448\times448 resolution already yield 8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbfTemporal Token Fusion (TTF), a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g., 3\times 3 ), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67% of visual tokens while retaining 99.5% of the baseline accuracy and introducing only \approx0.16 ,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \hrefthis https URLthis https URL
[CV-97] Disambiguating 2D-3D Correspondences in Gaussian Splatting-based Feature Fields for Visual Localization
【速读】:该论文旨在解决基于高斯溅射的特征场(Gaussian Splatting-based Feature Fields, GSFFs)在视觉定位任务中因光度优化导致的2D-3D匹配不稳定问题,具体表现为每个高斯的体素扩展引发多对一像素到点映射,从而破坏PnP(Perspective-n-Point)姿态估计的稳定性,同时光度优化产生缺乏多视角一致性的冗余高斯。解决方案的关键在于提出SplitGS-Loc框架,其核心设计为基于高斯混合的分割策略(Mixture-of-Gaussians-based splitting),将原始高斯分解为更小的高斯单元,以实现从模糊的多对一映射到精确的一对一对应关系;同时利用高斯渲染过程中的组合权重筛选出跨多视角稳定贡献的高斯,并通过强像素-高斯关联聚合判别性特征,从而强化多视角一致性,最终构建出紧凑且判别力强的特征场,实现稳定高效的位姿估计。
链接: https://arxiv.org/abs/2605.07351
作者: Miso Lee,Sangeek Hyun,Yerim Jeon,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Gaussian Splatting-based Feature Fields (GSFFs) have shown promise for visual localization, this paper highlights that photometrically optimized GSFFs are inherently ill-suited for 2D-3D matching. The volumetric extent of each Gaussian induces many-to-one pixel-to-point mappings that destabilize PnP-based pose estimation, while photometric optimization gives rise to superfluous Gaussians devoid of multi-view consistency. To address these issues, we propose SplitGS-Loc, a localization-specialized GSFFs construction framework that disambiguates 2D-3D correspondences by exploiting Gaussian attributes. Our key design, Mixture-of-Gaussians-based splitting, decomposes each Gaussian into smaller Gaussians, replacing ambiguous many-to-one with precise one-to-one correspondences. In parallel, we exploit composition weights from GS rasterization to select Gaussians that significantly and consistently contribute across multiple views and aggregate discriminative features through strong pixel-Gaussian associations, enforcing multi-view consistency. The resulting compact yet discriminative feature fields enable stable PnP convergence, achieving state-of-the-art performance on localization benchmarks. Extensive experiments validate that SplitGS-Loc extends the utility of photometric GSFFs to accurate and efficient localization by exploiting Gaussian attributes, without per-scene training or iterative pose refinement.
[CV-98] SoLAR: Error-Resilient Streamable Long-Horizon Free-Viewpoint Video Reconstruction with Anchor Activation and Latent Recalibration
【速读】:该论文旨在解决长时程自由视角视频(Long-Horizon Free-Viewpoint Video, LFVV)在重建质量上的稳定性问题,传统方法在处理长序列时因误差累积导致性能显著下降,且通常依赖分组图像(Group-of-Pictures, GOP)划分以实现流式传输,限制了实际应用的灵活性。解决方案的关键在于提出首个具有错误鲁棒性的可流式传输框架SoLAR,其核心创新包括:1)锚点激活动态机制(Anchor Activation Dynamics, AAD),通过动态激活信息丰富锚点并抑制冗余锚点,有效建模非刚性形变;2)潜在差异感知再校准机制(Latent Discrepancy Aware Recalibration, LaDAR),识别潜在表示间的差异并重新校准网络中编码的对应关系,从而抑制误差传播,同时保持实时性能与存储紧凑性。
链接: https://arxiv.org/abs/2605.07346
作者: Haotian Zhang,Xu Mo,Yixin Yu,Guanhua Zhu,Jian Xue,Tongda Xu,Yan Wang,Jiaqi Zhang,Siwei Ma,Wen Gao
机构: Peking University (北京大学); Jilin University (吉林大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Free-Viewpoint Video (FVV) has emerged as a cornerstone of next-generation immersive media systems and attracted widespread attention. Previous methods primarily focus on short video sequences and suffer from significant performance degradation when processing long-horizon free-viewpoint video (LFVV). Motivated by bit allocation theory, we analyze dynamic-anchor-based volumetric video representation within a rate-distortion optimization framework and propose \textbfSoLAR, which is the first error-resilient streamable FVV framework that maintains stable reconstruction quality on long sequences without requiring group-of-pictures partitioning. We propose the Anchor Activation Dynamics (AAD), which enables dynamic anchors to model non-rigid transformations by dynamically activating informative anchors and suppressing redundant ones. Furthermore, we introduce Latent Discrepancy Aware Recalibration (LaDAR), which is a mechanism to identify discrepancies between latent representations and recalibrate the correspondences encoded in the network, effectively mitigating error propagation in LFVV without compromising real-time performance or storage compactness. Extensive experiments demonstrate that \textbfSoLAR achieves state-of-the-art reconstruction performance while maintaining minimum storage overhead, which provides a new direction for LFVV reconstruction and advances the practical deployment of immersive systems. Demo free-viewpoint videos are provided in the supplementary material.
[CV-99] ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
【速读】:该论文旨在解决全球贝类生物多样性下降对滨海生态系统构成的严重威胁,以及现有海洋底栖生态监测中因图像数据不适应真实水下环境复杂性(如光照变化、物种姿态多样)而导致视觉模型泛化能力不足的问题。解决方案的关键在于构建了一个名为ShellfishNet的综合性图像基准数据集,该数据集包含8,691张涵盖32个分类单元的真实场景图像,并通过实地拍摄与网络爬取方式获取,同时提供带描述性标题的标注子集;在此基础上系统评估了80种代表性神经网络模型(包括卷积神经网络CNNs、视觉Transformer ViTs、状态空间模型SSMs及自监督学习SSL方法),并引入图像退化基准测试以模拟浊度、恶劣天气等常见水下退化场景,从而全面评估模型鲁棒性,为智能生态监测提供可靠的数据基础和模型评价标准。
链接: https://arxiv.org/abs/2605.07338
作者: Ziheng Zhou,Yang Wang,Nan Wang,Chengliang Wu,Jun Yan
机构: Shanghai Ocean University (上海海洋大学); Fudan University (复旦大学); DP Technology (DP 技术)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.
[CV-100] RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
【速读】:该论文旨在解决视频推理分割(Video Reasoning Segmentation, VRS)中因关键帧选择范围狭窄、时空理解不足而导致的复杂多目标场景下定位不鲁棒的问题。现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法通常依赖简单的采样策略或辅助MLLM进行帧选择,受限于监督信号和帧-语言相似性规则,难以实现全局时序理解。其解决方案的关键在于提出RCoT-Seg框架,通过显式分离时序视频推理(Temporal Video Reasoning, TVR)与关键帧目标感知(Keyframe Target Perception, KTP)两个阶段:在TVR阶段,设计了一个基于思维链(Chain-of-Thought, CoT)初始化并经任务对齐奖励优化(GRPO)迭代精调的代理式关键帧选择模块,实现自评估驱动的关键帧生成与重选;在KTP阶段,采用高分辨率分割结合SAM2-based传播机制,替代启发式采样和外部选择器,显著提升空间精度与帧间一致性。
链接: https://arxiv.org/abs/2605.07334
作者: Junwei Wen,Deshui Miao,Guangming Lu,Xin Li,Wenjie Pei
机构: Harbin Institute of Technology, Shenzhen; Pengcheng Laboratory; Pazhou Lab (Huangpu)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages
Abstract:Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at this https URL.
[CV-101] GC-ART: Global Learnable Second-Order Rational Tone Curves for Illumination Robustness
【速读】:该论文旨在解决图像分类任务中因光照变化(如暗化和对比度退化)导致的鲁棒性下降问题。解决方案的关键在于提出一种轻量级、可微分的预处理模块GC-ART(Global Curve Adaptive Rational Tone-mapping),其通过一个643参数的多层感知机(MLP)从每通道软直方图(soft histogram)中预测一个端点固定的有理数色调曲线(rational tone curve),并以逐像素方式应用该曲线,从而在不破坏边缘位置的前提下实现全局色调校正,特别是增强对比度。该方法在训练时结合交叉熵损失与软单调性惩罚项,确保学习到的曲线具有物理合理性,且在CIFAR-10上的实验表明其在多种退化场景下优于基线模型和其他学习增强方法,同时计算复杂度显著低于卷积型增强器。
链接: https://arxiv.org/abs/2605.07329
作者: Wei Huang,Joyce Huang
机构: Microsoft(微软); Massachusetts Institute of Technology(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GC-ART (Global Curve Adaptive Rational Tone-mapping), a lightweight differentiable pre-processing module for robust image classification. GC-ART predicts an endpoint-pinned rational tone curve from per-channel soft histograms using a 643-parameter MLP, then applies the curve pointwise before the classifier. The module is trained end-to-end with cross-entropy and a soft monotonicity penalty. On CIFAR-10 with a CIFAR-style ResNet-18, GC-ART matches clean accuracy with the unenhanced baseline and other learned enhancers, improves over the baseline on multiplicative darkening, and achieves the best learned-method result on contrast corruption (48.45% vs. 46.27% for the baseline and 47.13% for Zero-DCE++). These results suggest that histogram-conditioned rational curves can learn useful global tone corrections, including contrast-expanding behavior, while preserving edge locations by construction through pointwise mapping. GC-ART also uses substantially fewer FLOPs than convolutional learned enhancers at 32 x 32. The current hyperparameters are untuned, leaving room for systematic improvement.
[CV-102] acher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations
【速读】:该论文旨在解决预训练扩散模型或流匹配模型在生成高质量图像时需要大量前向传播步骤的问题,以及现有蒸馏方法通常依赖多个辅助网络、复杂训练阶段或优化流程所带来的高复杂度问题。其解决方案的关键在于重新审视并直接利用“漂移模型(Drifting Model)”目标函数,通过使用预训练教师模型的中间隐藏状态作为特征表示,无需额外训练或引入特征提取网络即可构建语义有意义的特征空间,从而实现一步蒸馏;同时引入轻量级模式覆盖损失以缓解蒸馏过程中的模式崩溃问题,提升学生生成器对教师支持区域的多样性覆盖能力,最终在ImageNet和SDXL数据集上实现了高效且高质量的单步生成结果。
链接: https://arxiv.org/abs/2605.07327
作者: Yuan Zhang,Chenyi Li,Guoqing Ma,Jiajun Zha,Yuanming Yang,Bo Wang,Wei Tang,Wenbo Li,Haoyang Huang,Nan Duan
机构: JD Explore Academy (京东探索研究院); Peking University (北京大学); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sampling from pretrained diffusion and flow-matching models typically requires many forward passes to generate diverse and high-fidelity images. Existing distillation methods often rely on multiple auxiliary networks, carefully designed training stages, or complex optimization pipelines. In this work, we revisit the recently proposed Drifting Model objective and show that a single drifting loss can be directly used to simplify one step distillation. A key observation is that the pretrained diffusion teacher itself already provides a strong representation space. Unlike the original Drifting Model, which relies on an additional pretrained feature extractor, we use intermediate hidden states of the pretrained teacher model as the feature representation. This removes the need for training or introducing an extra representation network while preserving a semantically meaningful feature geometry for drifting. Furthermore, we introduce a lightweight mode coverage loss to mitigate mode collapse during distillation and encourage the student generator to cover diverse teacher-supported regions. Extensive experiments on ImageNet and SDXL demonstrate that our method achieves efficient one step generation with competitive image quality and diversity, achieving FID scores of 1.58 on ImageNet-64 \times 64 and 18.4 on SDXL, while substantially simplifying the overall distillation framework.
[CV-103] GEM: Generating LiDAR World Model via Deformable Mamba
【速读】:该论文旨在解决基于激光雷达(LiDAR)的环境世界模型在自动驾驶中面临的两大核心挑战:一是LiDAR点云固有的无序性,二是难以区分动态物体与静态结构。为应对这些问题,作者提出了一种名为GEM(Generative LiDAR world model)的生成式世界模型,其关键创新在于引入了可变形Mamba架构(deformable Mamba architecture),通过自定义的LiDAR场景分词器将连续激光扫描转换为紧凑表征,并利用无监督动态-静态分离模块解耦特征;随后,三路径可变形Mamba结构对解耦后的特征进行选择性扫描和自适应门控融合,从而显著提升时空演化理解能力。此设计有效增强了模型在复杂交通场景下的模拟精度与想象力,实现了优于现有方法的性能表现。
链接: https://arxiv.org/abs/2605.07326
作者: Yang Wu,Zhaojiang Liu,Qiang Meng,Youquan Liu,Renliang Weng,Jianjun Qian,Jian Yang,Jin Xie
机构: PCA Lab@NJUST; NJU; SJTU; FDU; NTU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba’s processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model’s capability for autonomous rollout and its potential to generate ``what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: this https URL.
[CV-104] Amortized-Precision Quantization for Early-Exit Vision Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在低精度早期退出(early exiting)部署时的稳定性问题。现有量化方法假设静态全深度执行,当量化噪声扰动退出决策时,会导致动态推理路径上的误差放大,从而降低模型鲁棒性。解决方案的关键在于提出一种分摊精度量化(Amortized-Precision Quantization, APQ),该方法显式建模逐层随机暴露于量化噪声的情况,并揭示深度与精度之间的权衡关系;在此基础上设计了带早期退出的互适应量化(Mutual Adaptive Quantization with Early Exiting, MAQEE),通过双层优化框架联合调整退出阈值与比特宽度,在显式风险控制下提升推理稳定性,显著改善准确率-效率帕累托前沿。
链接: https://arxiv.org/abs/2605.07317
作者: Rui Fang,Hsi-Wen Chen,Ming-Syan Chen
机构: National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20% across classification, detection, and segmentation tasks.
[CV-105] EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在人机交互(Human-Machine Interaction, HMI)中普遍存在的反应式局限性问题,即缺乏对环境的持续感知能力和主动辅助用户的能力。现有基准测试多局限于警报场景,忽视个性化上下文,并未能精准评估交互时机。为此,作者提出EgoPro-Bench——一个基于流式第一人称视频(streaming egocentric videos)的新型基准,包含2400个评估视频和超过12000个训练视频,通过模拟用户画像生成多样化意图,构建高保真HMI数据集;其关键创新在于引入“短思考,更好交互”(short thinking, better interaction)原则,即在意图识别前分配有限的token预算,以提升推理效率与交互准确性,从而显著增强MLLMs对用户意图的理解能力并精确捕捉最优交互时机,为下一代以用户为中心的主动交互智能体奠定基础。
链接: https://arxiv.org/abs/2605.07299
作者: Dongchuan Ran,Linyu Ou,Xueheng Li,Wenwen Tong,Chenxu Guo,Hewei Guo,Kaibing Wang,Lewei Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training this http URL previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct this http URL, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive this http URL, we introduce an interaction principle termed “short thinking, better interaction”, which allocates a limited token budget prior to intent recognition, thereby enhancing interaction this http URL experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.
[CV-106] Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
【速读】:该论文旨在解决现有世界模型(World Models)在作为生成式模拟器用于特定环境(如LIBERO基准测试)时存在的泛化能力差、长程误差累积以及对初始状态扰动敏感等问题,这些问题导致模拟过程中出现严重失真或过曝等视觉异常,从而限制了其在视觉-语言-动作(VLA)模型强化学习后训练中的可靠性。解决方案的关键在于提出Sword框架:一是引入结构引导的风格增强(Structure-Guided Style Augmentation),以解耦交互环境中与任务无关的视觉纹理与任务相关的动态信息,提升泛化性能;二是设计动态潜在自举机制(Dynamic Latent Bootstrapping),在保持训练与推理一致性的同时控制内存消耗,有效缓解长期预测中的误差积累问题。
链接: https://arxiv.org/abs/2605.07288
作者: Jiaxuan Gao,Yongjian Guo,Zhong Guan,Wen Huang,Wanlun Ma,Xi Xiao,Junwu Xiong,Sheng Wen
机构: Tianjin University (天津大学); Tsinghua University (清华大学); JDT AI Infra; Swinburne University of Technology (斯威本科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within “imagination.” However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
[CV-107] SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
【速读】:该论文旨在解决现有基于3D高斯溅射(3D Gaussian Splatting)的通用新视角合成方法中,固定数量的高斯素元(Gaussian primitives)分配策略导致的表达能力不足与资源浪费问题。具体而言,传统方法对每个像素或体素统一分配相同数量的高斯素元,未能考虑真实场景中空间复杂度的差异性,从而在平滑区域造成冗余,在精细结构和高频细节区域则表达不足。解决方案的关键在于提出SplatWeaver框架,其核心创新是引入“基数高斯专家”(cardinality Gaussian experts)与像素级路由机制(pixel-level routing scheme),使模型能够以前馈方式动态分配不同区域的高斯素元数量:每个专家专注于生成0至M个特定数量的素元,而路由机制则根据局部复杂度自适应地选择最优专家组合;同时结合高频先验(high-frequency prior)及其引导模块与路由正则化,强化对复杂几何、纹理区域的优先分配,抑制平滑区域的冗余分配,从而实现更紧凑且更具表现力的3D场景表示。
链接: https://arxiv.org/abs/2605.07287
作者: Yecong Wan,Fan Li,Mingwen Shao,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Zhengzhou Advanced Research Institute of Harbin Institute of Technology (哈尔滨工业大学郑州研究院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology (深圳先进技术研究院人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive yet compact 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency structural cues, the routing process is encouraged to assign more Gaussian primitives to fine structures, complex geometry, and textured regions, while suppressing redundant primitives in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives.
[CV-108] Predictive but Not Plannable: RC-aux for Latent World Models
【速读】:该论文旨在解决 latent world model 在长期规划中因时空不匹配(spatiotemporal mismatch)而导致的性能瓶颈问题:尽管模型在短时预测上表现准确,但其隐空间中的欧氏距离无法有效反映有限动作预算下状态间的可达性,从而影响基于隐空间的目标导向搜索效果。解决方案的关键在于引入一种轻量级的辅助目标函数——Reachability-Correction auxiliary objective (RC-aux),它在不改变原世界模型主干的前提下,通过两个维度增强隐空间的规划一致性:一是沿时间轴进行多步开环预测以提升长期一致性;二是通过预算条件下的可达性监督与时间硬负样本,使隐空间能区分当前规划窗口内可达与最终可达的状态。该方法显著提升了基于隐空间的规划能力,同时保持了较低的计算开销。
链接: https://arxiv.org/abs/2605.07278
作者: Wenyuan Li,Guang Li,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at this https URL.
[CV-109] From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG
【速读】:该论文旨在解决遥感多模态检索增强生成(Remote Sensing Multimodal RAG)系统中,输入空间层面的证据检索阶段遭受攻击时缺乏有效防御的问题。现有研究主要关注对检索语料库或记忆模块的干扰,而忽视了对视觉-语言模型在检索阶段的潜在威胁,尤其是在大气条件变化背景下对遥感图像的隐蔽篡改。解决方案的关键在于提出CloudWeb——一种仅修改输入图像、不改变部署时的检索器、生成器和知识库的“大气检索劫持攻击”。其核心机制是通过叠加参数化的云层与雾霾模式,并优化一个以检索为导向的目标函数,使对抗样本嵌入向目标大气证据靠近,抑制源场景证据,强化排名分离度,同时保持自然性和覆盖范围。实验表明,该方法可显著提升天气相关证据在Top-K结果中的占比(如GeoRSCLIP ViT-B/32上Weather@5从0.71%提升至43.29%),并导致下游生成任务出现明显的天气幻觉和语义偏移,揭示了自然外观的大气扰动可能在生成前即破坏检索可靠性这一实际失效模式。
链接: https://arxiv.org/abs/2605.07273
作者: Jiaju Han,Chao Li,Chengyin Hu,Qike Zhang,Xuemeng Sun,Xin Wang,Fengyu Zhang,Xiang Chen,Yiwei Wei,Jiahuan Long,Jiujiang Guo
机构: China University of Petroleum, Beijing at Karamay (中国石油大学(北京)克拉玛依校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71% to 43.29%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.
[CV-110] Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
【速读】:该论文旨在解决卫星影像中高精度数字表面模型(Digital Surface Model, DSM)重建的难题,其核心挑战在于现有方法存在显著的效率与准确性权衡:优化-based方法虽精度高但计算耗时(每场景需数小时),而通用几何基础模型(geometry foundation models)虽推理快速却因遥感图像域差异(如RPC模型引入的域偏移和深度尺度分布不匹配)难以直接应用于卫星影像。解决方案的关键在于提出Sat3R框架,通过基于RPC(Rational Polynomial Camera)几何信息构建物理一致的伪深度监督信号,并采用尺度不变对数损失(Scale-Invariant Logarithmic, SiLog loss)对Depth Anything V2进行度量深度微调,从而实现无需逐场景优化的单目深度基础模型到卫星域的有效适配。实验表明,该方法在DFC2019基准上相较零样本前馈基线将平均绝对误差(MAE)降低38%,且精度媲美优化方法,同时推理速度提升超300倍。
链接: https://arxiv.org/abs/2605.07264
作者: Qiaoyi Yang,Chaoyi Zhou,Xi Liu,Run Wang,Minghui Xu,Mert D. Pesé,Feng Luo,Yuhao Xu,Zhi-Qi Cheng,Qiushi Chen,Hairong Qi,Siyu Huang
机构: Clemson University (克莱姆森大学); University of Washington (华盛顿大学); University of Tennessee (田纳西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.
[CV-111] Adaptive Subspace Projection for Generative Personalization
【速读】:该论文旨在解决生成式个性化(Generative Personalization)中常见的语义坍缩问题(Semantic Collapsing Problem, SCP),即模型在个性化过程中过度关注特定概念而忽略文本提示中的其他重要上下文信息。其解决方案的关键在于识别出导致语义漂移的低维子空间,并通过引入一种无需训练的测试阶段嵌入调整方法——自适应子空间投影(AdaptSP),利用预训练模型中稳定的嵌入作为锚点,将漂移量精确地投影到该子空间进行修正,从而在保持主体身份不变的前提下显著提升提示保真度和上下文一致性。
链接: https://arxiv.org/abs/2605.07257
作者: Van-Anh Nguyen,Anh Tuan Bui,Tamas Abraham,Junae Kim,Amardeep Kaur,Rollin Omari,Thuy-Trang Vu,Dinh Phung
机构: Monash University (莫纳什大学); Defence Science and Technology Group (国防科学技术集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.
[CV-112] AS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts CVPR2026
【速读】:该论文旨在解决Transformer架构搜索(Transformer Architecture Search, TAS)中普遍存在的特征坍塌(feature collapse)问题,即超网络(supernet)内的子网络因共享权重而无法学习到各自特有的特征,从而限制了子网络的性能表现。解决方案的关键在于引入参数高效低秩适应(Low-Rank Adaptation, LoRA),通过在超网络中为每个子网络分配独立的LoRA模块,实现子网络特异性特征学习,同时保持计算效率;进一步提出Mixture-of-LoRAExperts(MoLE)策略,利用轻量级路由机制根据子网络结构动态选择LoRA专家,并结合分组式路由器初始化技术,在训练初期促进不同专家间的特征多样性,从而有效缓解特征坍塌问题并显著提升TAS方法的性能。
链接: https://arxiv.org/abs/2605.07256
作者: Jeimin Jeon,Hyunju Lee,Bumsub Ham
机构: Yonsei University (延世大学); Articron Inc.; Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.
[CV-113] High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images
【速读】:该论文旨在解决多视角网格重建(multi-view mesh reconstruction)中从稀疏观测数据恢复高频几何细节的难题,尤其针对现有方法如3D高斯泼溅(3D Gaussian Splatting, 3DGS)和神经辐射场(Neural Radiance Fields, NeRF)依赖后处理提取网格、限制几何与外观联合优化的问题。其解决方案的关键在于引入一种具有局部支撑和更高灵活性的紧凑多项式核函数,替代传统隐式移动最小二乘(Implicit Moving Least Squares, IMLS)方法中性能受限的指数核函数,从而更精确地控制频率内容并提升几何保真度;同时结合拉普拉斯滤波的随机正则化策略,进一步增强细粒度结构的保留能力,实现稳定且高质量的端到端重建与渲染。
链接: https://arxiv.org/abs/2605.07254
作者: Nandhana Sunil,Abhirami R Iyer,Avirup Mandal
机构: IIT Palakkad, India
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 19 pages, 9 figures
Abstract:Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.
[CV-114] LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
【速读】:该论文旨在解决蒸馏扩散模型(distilled diffusion models)在加速图像生成过程中因减少去噪步骤而导致的图像质量下降问题,以及现有测试时优化方法因迭代特性带来的计算开销大、推理速度慢等限制。解决方案的关键在于提出一种名为LENS(Low-frequency Eigen Noise Shaping)的高效噪声调制框架,其核心思想是:低频噪声分量主导图像的全局结构与视觉保真度,因此可将噪声调制限制在低维低频子空间中进行。该方法通过理论推导构建了合理的训练目标,并设计了一个轻量级独立网络来选择性调节这些低频成分,从而实现高效且精准的噪声调控,在显著降低计算复杂度(FLOPs减少400–700倍)、模型参数(减少25–75倍)和推理开销(减少10–20倍)的同时,保持了与先进方法相当的图像质量。
链接: https://arxiv.org/abs/2605.07253
作者: Haewon Jeon,Si-Hyeon Lee
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 7 figures
Abstract:Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700 \times , model parameters by 25-75 \times , and inference-time overhead by 10-20 \times compared to prior methods.
[CV-115] PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
【速读】:该论文旨在解决现有基于VQ-VAE的共语手势生成方法在运动表征中未能编码语义结构、且未显式分离内容与风格的问题,从而限制了手势的语义一致性和个性化保真度。其解决方案的关键在于提出一个两阶段框架PersonaGest:第一阶段采用语义引导的残差向量量化变分自编码器(Semantic-Guided RVQ-VAE),通过语义感知运动码本(Semantic-Aware Motion Codebook, SMoC)按手势语义组织内容码本,并利用对比学习强化内容与风格的解耦;第二阶段则使用掩码生成Transformer结合语义感知重掩码策略生成内容token,并通过一系列风格残差Transformer(Style Residual Transformers)以参考运动提示为条件实现风格控制,从而在客观指标和主观用户评测中均达到当前最优性能,同时保持对参考提示的强风格一致性。
链接: https://arxiv.org/abs/2605.07252
作者: Junchuan Zhao,Qifan Liang,Ye Wang
机构: School of Computing, National University of Singapore (新加坡国立大学计算机学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 26 pages, 10 figures, 12 tables
Abstract:Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at this https URL
[CV-116] Hard to Read Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment ACL2026
【速读】:该论文旨在解决视觉上下文压缩(Visual Context Compression)在多模态大语言模型(MLLMs)中引发的安全漏洞问题,即低分辨率图像输入会显著削弱模型的安全防御机制,即使文本内容仍可辨识,也易导致“越狱”(jailbreaking)攻击。研究者将此现象归因于“认知过载”(Cognitive Overload),认为模型在处理模糊或退化的视觉输入时,注意力资源被用于解码图像而非执行安全审计,从而降低了对恶意内容的识别能力。解决方案的关键在于提出一种“结构化认知卸载”(Structured Cognitive Offloading)策略,通过构建串行化处理流程,将视觉转录(visual transcription)与安全评估(safety assessment)分离,从而有效缓解由视觉退化引发的安全风险。
链接: https://arxiv.org/abs/2605.07250
作者: Zhixue Song,Boyan Han,Yiwei Wang,Chi Zhang
机构: Westlake University (西湖大学); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026
Abstract:Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple Structured Cognitive Offloading’’ strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.
[CV-117] owards multi-modal forgery representation learning for AI-generated video detection and localization
【速读】:该论文旨在解决当前AI生成视频检测工具在面对部分篡改(即视觉与音频通道中存在局部伪造)时,因单一模态建模和缺乏细粒度时间定位能力而导致的检测性能不足问题。其解决方案的关键在于提出一种多模态联合架构,通过集成语言模型(LMM)语义分支、时空(Spatio-Temporal, ST)视觉分支以及多尺度局部伪造(Partial-Spoof, PS)音频分支,实现对部分篡改AI生成视频伪造内容的同时检测与细粒度时间维度上的精确定位。
链接: https://arxiv.org/abs/2605.07232
作者: Dat Le,Khoa Nguyen,Xin Wang,Shu Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
[CV-118] CASCADE: Context-Aware Relaxation for Speculative Image Decoding
【速读】:该论文旨在解决自回归图像生成(autoregressive image generation)中计算效率低、速度慢的问题,尤其是在使用推测解码(speculative decoding)技术时,现有方法在图像生成任务上难以达到与文本生成相当的加速效果。其关键在于目标模型在图像生成过程中具有高不确定性,导致草稿令牌(draft token)被拒绝率过高。论文提出通过识别目标模型在树状推测解码中自然涌现的两个新特性——语义可交换性(semantic interchangeability)和收敛性(convergence),利用这些由目标模型隐藏状态表示冗余所引发的规律,实现无需额外训练即可放宽接受条件的策略优化;同时,将目标模型中的冗余信号注入草稿器(drafter)训练中以提升其独立性能。该方法在多个文生图模型和草稿器架构上均实现了显著加速(最高达3.6倍),且保持图像质量和文本提示一致性。
链接: https://arxiv.org/abs/2605.07230
作者: Selin Yildirim,Subhajit Dutta Chowdhury,Mohammad Mahdi Kamani,Vikram Appia,Deming Chen
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); AMD (超威半导体)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model’s high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model’s behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model’s hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.
[CV-119] DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation
【速读】:该论文旨在解决在标注数据稀缺条件下,如何高效实现医学图像分割的问题。传统方法通常依赖于对基础模型(foundation models)的骨干网络微调或设计高容量的任务特定解码器,但在小样本场景下难以稳定训练。其解决方案的关键在于:利用冻结的自监督视觉骨干(DINOv3)提取的特征本身已包含结构和边界线索,通过设计一种轻量级多视角读出(Multi-View Readout, MVR)框架来优化特征读取方式——仅在DINOv3最后三个Transformer块的特征上训练小型MLP探测器(probes),并在推理时结合多尺度分辨率与测试时增强(test-time augmentation),以熵加权融合概率图并辅以空间正则化提升分割一致性。该方法在多个医学影像基准(如Kvasir-SEG、ISIC 2018、BraTS)上表现出色,且在仅使用5例标注数据时即可达到参考模型(40例)98.4%的性能,验证了冻结预训练特征与有效读出机制协同可实现高精度、低标注成本的医学分割。
链接: https://arxiv.org/abs/2605.07221
作者: Wei Jiang,Feng Liu,Nan Ye,Hongfu Sun
机构: The University of Queensland (昆士兰大学); The University of Newcastle (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.
[CV-120] LoHGNet: Infrared Small Target Detection through Lorentz Geometric Encoding with High-Order Relation Learning
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因目标线索稀疏和背景杂波严重而导致的检测困难问题。现有方法多依赖于欧几里得空间中的传统特征学习与局部交互建模,难以有效刻画弱目标的细微差异及其与背景之间的上下文关系。解决方案的关键在于提出LoHGNet网络,其核心创新是将洛伦兹几何编码(Lorentz geometric encoding)与高阶关系学习(High-Order Relation Learning)相结合:首先通过基于洛伦兹流形的特征学习模块(GA-LRCM)在双曲空间中建模目标的层次化几何表示,增强对弱目标的区分能力;随后利用对数映射将双曲特征投影至欧氏切空间,并设计超图结构的高阶关系学习模块(HORL),显式建模目标与背景间的高阶上下文依赖关系,从而提升复杂场景下的目标判别性能。
链接: https://arxiv.org/abs/2605.07213
作者: Qianwen Ma,Yang Xu,Shangwei Deng,Xiaobo Li,Haofeng Hu
机构: Tianjin University (天津大学); Jiangxi Science and Technology Normal University (江西科技师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) remains challenging due to the scarcity of useful target cues and the presence of severe background clutter. Most current methods rely on conventional feature learning and local interaction modeling, where features are represented in Euclidean space. However, such designs may still be limited in describing the subtle differences of weak targets and the contextual relations between targets and backgrounds. To address these limitations, we propose LoHGNet, an IRSTD network that integrates Lorentz geometric encoding with high-order relation learning. By introducing Lorentz manifold based feature learning, LoHGNet offers a different feature representation from conventional IRSTD methods and provides new discriminative cues for IRSTD. Specifically, a Lorentz encoding branch is constructed with the Geometric Attention Guided Lorentz Residual Convolution Module (GA-LRCM) to perform feature modeling under hyperbolic geometric constraints and enhance the hierarchical geometric representation capability of weak targets. Subsequently, the hyperbolic features are mapped into the Euclidean tangent space through logarithmic mapping, and a High-Order Relation Learning Module (HORL) is designed to model the high-order contextual dependencies between targets and backgrounds via hypergraph construction, thereby improving target discrimination in complex backgrounds. Experimental results on three datasets demonstrate that the proposed LoHGNet achieves competitive performance in both detection accuracy and adaptability to complex scenes. The code will be available at this https URL.
[CV-121] From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting)的场景变化检测问题,传统方法采用“渲染-比较”范式,将变化检测视为像素级或特征级残差分析,忽略了原始几何与外观基元(primitive)所携带的结构信息。其核心挑战在于高斯溅射表示的欠约束特性导致不同优化结果中基元数量、位置、形状和颜色存在不一致性,从而影响变化检测的稳定性。解决方案的关键在于将变化检测从像素空间转移到基元空间,直接利用高斯基元的原生属性(位置、各向异性协方差和颜色)进行比较,并引入几何与光度漂移的各向异性建模以及每个基元可观测性项,以提升基元匹配的鲁棒性。该方法命名为GD-DIFF,具有两个显著优势:一是变化图天然具备多视角一致性,无需额外优化目标;二是能分离地评分几何与外观变化,实现无监督的结构性变化(如新增物体)与表面变化(如颜色改变)识别。在真实世界基准测试中,GD-DIFF相比先前最优方法在平均交并比(mIoU)上提升了约17%。
链接: https://arxiv.org/abs/2605.07203
作者: Chamuditha Jayanga Galappaththige,Jason Lai,Timothy Patten,Donald Dansereau,Niko Suenderhauf,Dimity Miller
机构: QUT Centre for Robotics (QUT机器人中心); ARIAM (智能机器人系统用于实时资产管理研究中心); ACFR, University of Sydney (悉尼大学先进计算与机器人中心); Abyss Solutions (深渊解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone – position, anisotropic covariance, and color – carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by approximatelt 17% in mean Intersection over Union.
[CV-122] See Tomorrow Act Today: Foresight-Driven Autonomous Driving CVPR
【速读】:该论文旨在解决当前端到端自动驾驶规划器本质上为反应式(reactive)的问题,即其仅基于历史和当前观测预测未来动作,缺乏对未来的主动预判能力。解决方案的关键在于提出ForeSight框架,该框架以基础世界模型(foundation world model)为核心,将自动驾驶任务重构为前瞻性决策(anticipatory decision-making)。其核心创新在于:首先利用预训练的世界模型生成合理的未来视觉场景(future scene imagination),然后基于这些想象中的未来状态进行动作规划,从而实现从“我现在该做什么?”到“会发生什么,我该如何应对?”的范式转变。这一机制使决策基于预期情境而非单一时刻的观测,显著提升了在动态交互场景中的导航能力。
链接: https://arxiv.org/abs/2605.07195
作者: Bozhou Zhang,Nan Song,Yuang Wang,Jiankang Deng,Xiatian Zhu,Li Zhang
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Imperial College London (帝国理工学院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026
Abstract:Current end-to-end autonomous driving planners are fundamentally reactive: they condition on historical and present observations to predict future actions. We argue that autonomous agents should instead imagine future scenes before deciding, just as human drivers mentally simulate what will happen next" before acting. We introduce ForeSight, a foundation world model centric planning framework that reframes autonomous driving as anticipatory decision-making. Rather than treating world models as auxiliary components, ForeSight makes future scene imagination the primary driver of action prediction. Our approach operates in two stages: (1) generating plausible future visual worlds via a pretrained world model, and (2) planning actions conditioned on these imagined futures. This paradigm shift from what should I do now?" to ``what will happen, and how should I respond?" enables genuinely anticipatory rather than reactive planning. By grounding decisions in anticipated contexts rather than present observations alone, ForeSight navigates dynamic, interactive scenarios more effectively. Extensive experiments on NAVSIM and nuScenes demonstrate that explicit future imagination significantly outperforms previous state-of-the-art alternatives, validating our foresight-driven approach.
[CV-123] Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models
【速读】:该论文旨在解决在视觉迁移学习(visual transfer learning)场景下,如何高效地进行数据蒸馏(dataset distillation, DD)的问题。具体而言,现有方法通常针对从头训练(from-scratch training)设计,或依赖于神经切线核(NTK)近似与迭代轨迹匹配,未能充分利用冻结预训练特征(frozen pre-trained features)与轻量级线性探测(linear probing)之间的闭式解特性。其解决方案的关键在于提出一种闭式线性探测数据蒸馏(Closed-Form Linear-Probe Dataset Distillation, CLP-DD),该方法基于一个双层优化框架:内层通过样本空间核岭回归求解由合成数据诱导的线性分类器(闭式解),外层则利用温度缩放的softmax交叉熵损失(discriminative outer loss)更新合成图像,其中分类器列作为特征空间中的类锚点(class anchors)。此设计避免了无限宽网络近似和内循环轨迹计算,显著提升了效率与性能,在ImageNet-100和ImageNet-1K上均优于或接近当前最优方法(如LGM with DSA),同时大幅降低计算成本与显存占用。
链接: https://arxiv.org/abs/2605.07194
作者: Bincheng Peng,Guang Li,Ping Liu,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido University(北海道大学); University of Nevada, Reno(内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly 14\times faster and using less than one-eighth of the GPU memory.
[CV-124] AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes
【速读】:该论文旨在解决现有3D重建方法(如3D Gaussian Splatting, 3DGS 和 Neural Radiance Fields, NeRF)在输入图像存在严重运动模糊时性能显著下降的问题。其关键解决方案在于提出一种灵活的高分辨率异步RGB-事件双摄像头系统及相应的重建框架:首先利用事件数据重建清晰图像,随后通过基于视觉几何变换器(Visual Geometry Transformer, VGGT)的跨域位姿估计模块获得鲁棒的3DGS初始化;在优化过程中引入结构驱动的事件损失和视图特定一致性正则项,有效缓解传统事件损失与去模糊损失的病态性问题,从而实现稳定且高保真的3D重建。
链接: https://arxiv.org/abs/2605.07192
作者: Jun Dai,Renbiao Jin,Bo Xu,Yutian Chen,Linning Xu,Mulin Yu,Tianfan Xue,Shi Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D reconstruction methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) achieve impressive photorealism but fail when input images suffer from severe motion blur. While event cameras provide high-temporal-resolution motion cues, existing event-assisted approaches rely on low-resolution sensors and strict synchronization, limiting their practicality for handheld 3D capture on common devices, such as smartphones. We introduce a flexible, high-resolution asynchronous RGB-Event dual-camera system and a corresponding reconstruction framework. Our approach first reconstructs sharp images from the event data and then employs a cross-domain pose estimation module based on the Visual Geometry Transformer (VGGT) to obtain robust initialization for 3DGS. During optimization, we employ a structure-driven event loss and view-specific consistency regularizers to mitigate the ill-posed behavior of traditional event losses and deblurring losses, ensuring both stable and high-fidelity reconstruction. We further contribute AsyncEv-Deblur, a new high-resolution RGB-Event dataset captured with our asynchronous system. Experiments demonstrate that our method achieves state-of-the-art performance on both our challenging dataset and existing benchmarks, substantially improving reconstruction robustness under severe motion blur. Project page: this https URL
[CV-125] Attention Transfer Is Not Universally Effective for Vision Transformers
【速读】:该论文旨在解决注意力迁移(Attention Transfer)在视觉 Transformer(Vision Transformer, ViT)知识蒸馏中有效性不一致的问题,即为何部分 ViT 教师模型的注意力模式能够成功迁移到学生模型并提升性能,而另一些则导致显著性能下降。研究表明,这种失败并非源于损失函数设计或预训练策略差异,而是由于教师与学生之间的架构不匹配——具体而言,当学生模型缺少教师所采用的原生架构组件时,即使成功复制了注意力模式,这些模式也无法在学生模型中保持功能性。解决方案的关键在于:在随机初始化的学生模型中引入教师的原生架构组件(如特定的注意力机制结构),即可完全逆转失败现象,且这些组件本身对从头训练无增益,说明其作用是专门解锁教师注意力模式的可用性,而非通用性能提升。这一发现修正了当前对 ViT 中注意力机制功能性的理解:注意力的有效性仅在学生架构与教师匹配时成立。
链接: https://arxiv.org/abs/2605.07191
作者: Huaiyuan Qin,Muli Yang,Gabriel James Goenawan,Peng Hu,Chen Gong,Xi Peng,Hongyuan Zhu
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; Sichuan University; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher’s pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher’s attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher’s native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher’s attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textitonly when the student architecture matches the teacher.
[CV-126] PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
【速读】:该论文旨在解决混合现实(Mixed Reality, MR)应用中 gaze estimation(注视估计)的鲁棒性和泛化性问题,尤其在无校准、有校准、重戴后校准及预测等多种场景下保持高精度。现有方法往往需依赖特定设备姿态或单独处理不同任务(如校准、注视预测、3D眼结构重建),导致系统复杂且适应性差。解决方案的关键在于提出 PicoEyes——一个统一的端到端框架,能够从单目或双目输入中直接联合预测包括 3D 眼参数、眼区分割、光轴、视轴和深度图在内的全部关键属性,并同时处理校准、姿态变化与注视预测任务,从而实现无需人工干预的全流程自动化与高精度估计。
链接: https://arxiv.org/abs/2605.07188
作者: Fuxin Duan,Hui Wang
机构: Pico, ByteDance; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, conference
Abstract:We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.
[CV-127] SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction
【速读】:该论文旨在解决稀疏视图卫星影像表面重建中因多视角匹配可靠性空间异质性而导致的几何约束稀疏、分布不均及局部不可靠的问题,尤其是在大光度差异、弱纹理和重复纹理条件下,传统方法难以实现稳定且高质量的表面重建。其解决方案的关键在于提出一种基于2D高斯溅射(2D Gaussian Splatting, 2DGS)的通用化稀疏视图卫星表面重建方法SatSurfGS,通过构建粗到精的高斯属性预测框架,并在特征学习、高斯参数估计与训练优化三个层面显式建模局部几何可靠性:具体包括自适应融合单目先验与多视图匹配特征的置信度感知单目多视图特征融合模块、利用前一阶段渲染高度图与当前阶段MVS高度图残差及置信度信息进行阶段性参数精化的跨阶段自一致性残差引导模块,以及实现几何与外观监督差异化分配的置信度双向路由损失函数,从而显著提升重建质量、跨数据集泛化能力与推理效率。
链接: https://arxiv.org/abs/2605.07181
作者: Min Chen,Wei Guo,Bin Wang,Wen Li,Tong Fang,Jinbo Zhang,Junqi Zhao,Hong Kuang,Han Hu,Xuming Ge,Qing Zhu,Bo Xu
机构: Southwest Jiaotong University (西南交通大学); Jiangxi Normal University (江西师范大学); Shandong University of Science and Technology (山东科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse-view satellite image surface reconstruction remains highly challenging, fundamentally because the reliability of multi-view matching under satellite imaging conditions is strongly spatially heterogeneous. Affected by large photometric differences, weak textures, and repetitive textures, multi-view geometric constraints are often sparse, unevenly distributed, and locally unreliable. Although 2D Gaussian Splatting (2DGS) is more suitable than 3D Gaussian Splatting (3DGS) for the explicit representation of continuous surfaces, research on generalizable feed-forward 2DGS frameworks for sparse-view satellite surface reconstruction is still lacking. To address this issue, we propose SatSurfGS, a generalizable sparse-view surface reconstruction method for satellite imagery based on 2DGS. The proposed method builds a coarse-to-fine Gaussian attribute prediction framework and explicitly models local geometric reliability at three levels: feature learning, Gaussian parameter estimation, and training optimization. Specifically, we propose a confidence-aware monocular multi-view feature fusion module to adaptively integrate monocular priors and multi-view matching features according to local confidence; a cross-stage self-consistency residual guidance module to stabilize stage-wise Gaussian parameter refinement using the residual between the rendered height map from the previous stage and the current-stage MVS height map, together with confidence information; and a confidence bidirectional routing loss to achieve differentiated allocation of geometric and appearance supervision. Experiments on satellite datasets show that the proposed method achieves improved rendering quality, surface reconstruction accuracy, cross-dataset generalization, and inference efficiency compared with representative generalizable baselines and competitive per-scene optimization methods.
[CV-128] Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
【速读】:该论文旨在解决遥感变化检测中因单模态深度学习方法易将视觉相似但语义无关的变化误判为真实变化而导致的准确性不足问题。现有多模态方法虽引入文本作为辅助监督,但其描述通常语义粗糙、结构松散或由模型生成而存在噪声。论文提出的关键解决方案是:利用标准变化检测数据集中已有的标注掩膜(ground-truth mask labels)自动提取细粒度的结构化文本特征,无需额外人工标注成本。具体而言,每个变化区域被转化为包含“地点(where)、对象(what)、方式(how)、数量(how many)”四个要素的语义四元组,并映射为固定模板文本描述,从而提供精确、密集且无噪声的多模态监督信号。该方法通过两阶段训练策略实现域特定视觉表示的优化与视觉特征和结构化文本嵌入之间的深层对齐,显著提升了变化检测性能。
链接: https://arxiv.org/abs/2605.07178
作者: Kai Zheng,Hang-Cheng Dong,Jiatong Pan,Zhenkai Wu,Fupeng Wei,Wei Zhang
机构: Zhejiang University (浙江大学); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院); North China University of Water Resources and Electric Power (华北水利水电大学); University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80% and an F _\textscd of 66.14%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.
[CV-129] Hierarchical Perfusion Graphs for Tumor Heterogeneity Modeling in Glioma Molecular Subtyping MICCAI2026
【速读】:该论文旨在解决胶质瘤分子分型(如IDH突变状态、1p/19q共缺失)依赖侵入性组织活检的问题,提出一种基于动态对比增强磁共振成像(DSC-MRI)的非侵入性放射基因组学方法。其核心挑战在于如何有效整合灌注动力学信息以提升分子亚型预测准确性,同时克服多中心数据变异性和传统体素级分析的局限性。解决方案的关键在于提出HiPerfGNN框架:首先利用向量量化变分自编码器(VQ-VAE)从原始时间-强度曲线中学习离散的灌注表示,生成代表功能肿瘤微环境的粗粒度图节点;随后结合结构MRI对这些节点进行细粒度划分,并通过层次化图神经网络(Hierarchical Graph Neural Network)在多尺度间传播信息,实现精准分子预测。该方法在内部队列中达到AUC 0.96(IDH)、0.89(1p/19q)和0.84(WHO分级),并在独立外部队列中保持稳健性能(IDH AUC 0.89),且无需重新校准,验证了灌注动力学整合对放射基因组学的重要价值。
链接: https://arxiv.org/abs/2605.07156
作者: Han Jang,Junhyeok Lee,Heeseong Eum,Joon Jang,Yoseob Han,Seung Hong Choi,Kyu Sung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026. 11 pages, 2 figures, 2 tables
Abstract:Precise molecular subtyping of gliomas, including isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion, directly guides surgical and therapeutic decisions, yet currently relies on invasive tissue sampling. Deep learning on structural MRI has emerged as a non-invasive alternative, but anatomy-only approaches cannot capture the hemodynamic signatures that distinguish molecular subtypes. Radiogenomics based on dynamic susceptibility contrast (DSC) MRI holds immense potential for non-invasively characterizing glioma molecular subtypes, yet clinical deployment has been hindered by inter-site variability and the limitations of voxel-wise analysis. We introduce HiPerfGNN, a framework that first learns discrete hemodynamic representations from raw time-intensity curves using a vector-quantized variational autoencoder (VQ-VAE). These quantized perfusion codes define coarse-level graph nodes representing functional tumor habitats, each of which is hierarchically subdivided into fine-level subregions guided by structural MRI. A hierarchical graph neural network then propagates information across scales for molecular prediction. On an internal cohort (n=475), the model achieved AUCs of 0.96 (IDH), 0.89 (1p/19q), and 0.84 (WHO grade), and maintained robust IDH performance (AUC 0.89) on an independent external cohort (n=397) without recalibration. Gradient-based saliency analysis confirms biologically grounded attention patterns aligned with known glioma pathophysiology. Our results demonstrate the added value of integrating perfusion dynamics into radiogenomic pipelines for glioma molecular subtyping. Code is available at this https URL.
[CV-130] PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
【速读】:该论文旨在解决参考音频-视觉分割(Referring Audio-Visual Segmentation, Ref-AVS)任务中多模态信息融合不均衡的问题,即不同模态(视觉、听觉、文本)在不同场景和指代表达中的相关性存在差异,而现有方法通常将多模态输入视为同质信息进行融合或推理,易受无关或误导性模态干扰。解决方案的关键在于提出PRIMED框架,其核心创新包括:1)引入模态先验解码器(Modality Prior Decoder),基于语言引导的先验调制机制估计指代表达主要依赖的模态类型,从而自适应地指导高层注意力机制;2)设计Token Distiller模块,提取紧凑的全局视觉token并共享至竞争感知的跨模态融合模块,提供分层全局上下文信息;3)引入空间感知语义对齐损失(Spatial-Aware Semantic Alignment loss),通过对比学习增强前景与背景的判别能力。该方法有效实现了多模态信息的动态抑制与协同,显著提升了Ref-AVS的精度与鲁棒性。
链接: https://arxiv.org/abs/2605.07154
作者: Yuchen He,Jing Zhang
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures
Abstract:Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.
[CV-131] DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection
【速读】:该论文旨在解决多时相跨模态数据(如灾前数字表面模型DSM与灾后影像)在城市形态分析和应急响应中进行联合2D语义变化与3D高度变化检测的难题,其核心挑战在于影像与DSM之间存在显著的光谱-几何表示差异,导致模态差异易被误判为实际变化,从而影响检测精度。解决方案的关键在于提出一种深度先验引导的多时相跨模态融合框架DPG-CD:首先利用估计的深度先验(depth prior)缓解影像与DSM之间的模态鸿沟;进而通过门控融合机制选择性注入几何线索并保留判别性光谱特征;再采用多阶段跨时序、跨模态特征融合架构提取变化感知特征;最后通过多任务解码器联合预测2D语义变化与3D高度变化,并引入辅助DSM重建任务提升结构一致性和高程估计准确性。
链接: https://arxiv.org/abs/2605.07151
作者: Luqi Zhang,Zhen Dong,Bisheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.
[CV-132] Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection CVPR2025
【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中对细微几何缺陷(如划痕、凹坑)难以识别的问题。现有方法依赖于2D RGB图像时易受纹理和光照干扰,而使用稀疏3D点云则无法捕捉微米级细节。其解决方案的关键在于构建一个大规模工业数据集Real-IAD-MVN(Multi-View Normal),通过升级采集系统获取来自五个视角的高保真表面法向量图(surface normal maps),替代传统稀疏3D数据,从而在微观层面提供完整的几何表征,使原本不可见的侧壁及遮挡区域缺陷得以显式检测。实验表明,基于密集多视角伪3D法向量的特征表示显著优于稀疏3D点云,并提出一种基于重建的统一原型学习基线方法,从图像与法向量流中提取跨模态统一特征,在多模态融合性能上超越现有先进方法,验证了该数据集在几何异常检测中的潜力。
链接: https://arxiv.org/abs/2605.07149
作者: Wenbing Zhu,Jianing Liang,Linjie Cheng,Yurui Pan,Zhuhao Chen,Qingwang Yan,Yudong Cheng,Jianghui Zhang,Mingmin Chi,Bo Peng
机构: Fudan University (复旦大学); Shanghai Ocean University (上海海洋大学); Donghua University (东华大学); Rongcheer Co., Ltd. (荣车科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. 15 pages
Abstract:Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.
[CV-133] Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)是否具备对三维环境的拓扑结构进行内部表征的问题。尽管已有研究表明人类通过认知地图(cognitive maps)实现空间导航,且现代VLMs能从二维视角输入中涌现出空间推理能力,但其是否存在类似物理空间的潜在拓扑结构仍不明确。论文的关键解决方案在于:首先通过跨场景线性特征提取方法分离出模型中的空间子空间,从而揭示其隐含的拓扑地图;进一步地,证明该空间表示与场景3D高斯核图的拉普拉斯特征映射(Laplacian eigenmaps)一致,并在连续极限下收敛至真实三维空间;最后提出一种基于Dirichlet能量的数学严谨潜在正则化方法,在仅500步监督微调(SFT)下显著提升模型在真实世界空间任务上的表现,尤其在场景拓扑理解方面相较标准SFT和竞争基线提升达12.1%。
链接: https://arxiv.org/abs/2605.07148
作者: Haoming Wang,Wei Gao
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model’s spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene’s 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1% in spatial tasks involving scene topology understanding. Source code is available at this https URL
[CV-134] UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection
【速读】:该论文旨在解决水下显著性目标检测(Underwater Salient Object Detection, USOD)中因严重视觉退化(如选择性吸收和介质散射)导致的检测性能受限问题。传统方法采用“增强-再检测”的串行范式,易产生语义不一致,即图像增强结果未必适配检测任务,甚至引入无关噪声。解决方案的关键在于提出UniV2D——一种统一的视觉到检测网络(Unified Vision-to-Detection Network),通过语义驱动的学习范式实现视觉恢复与显著性检测的联合优化:高阶显著性语义主动引导恢复过程,而恢复后的视觉线索又反向提升显著性感知能力。其核心创新是采用分层双分支架构,包含自校准解码器与掩码感知恢复模块,以及基于跨层级调制的显著性引导精修模块,从而在结构保真度与语义一致性之间建立协同优化机制。
链接: https://arxiv.org/abs/2605.07146
作者: Laibin Chang,Shaodong Wang,Yunke Wang,Xu Zhang,Kui Jiang,Chang Xu,Bo Du
机构: Wuhan University (武汉大学); The University of Sydney (悉尼大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential “enhance-then-detect” paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.
[CV-135] riP: A Triangle Puzzle Approach to Robust Translation Averag ing
【速读】:该论文旨在解决**翻译平均(translation averaging)**问题,即从成对的相对平移方向中恢复相机位姿,这是全局结构光流(Structure-from-Motion, SfM)流程中的关键步骤。由于方向测量不含距离信息,该问题高度病态且对噪声和异常值敏感。解决方案的核心在于提出一种基于三角形几何的框架——TriP,其关键创新是:首先利用三角形几何关系推断局部边尺度,然后在对数域中同步重叠三角形的尺度,从而恢复全局一致的边长和相机位置。该方法通过利用三角形间的高阶一致性,显著提升对对抗性、循环一致性和其他结构化噪声的鲁棒性,并天然避免尺度坍塌问题(无需额外约束),从而实现更精确的相机位姿恢复。
链接: https://arxiv.org/abs/2605.07143
作者: Zhekai Fan,Wanze Li,Jinxin Wang,Yunpeng Shi
机构: UC Davis (加州大学戴维斯分校); University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Numerical Analysis (math.NA)
备注:
Abstract:Translation averaging aims to recover camera locations from pairwise relative translation directions and is a fundamental component of global Structure-from-Motion pipelines. The problem is challenging because direction measurements contain no distance information, making the estimation problem highly ill-conditioned and highly sensitive to corrupted observations. In this paper, we propose TriP, a triangle-based framework for robust translation averaging. TriP first infers local relative edge scales from triangle geometry, and then synchronizes the scales of overlapping triangles in the logarithmic domain to recover globally consistent edge lengths and camera locations. By leveraging higher-order consistency across triangles, the proposed method is robust to adversarial, cycle-consistent, and other structured corruptions. In addition, TriP avoids the collapse issue without requiring any extra anti-collapse constraints, since log-scale synchronization excludes the degenerate zero-scale solution by construction. These structural advantages enable a particularly strong theory for exact location recovery. On the practical side, TriP is fully parallelizable, computationally efficient, and naturally scalable to graphs with millions of cameras. Moreover, it outperforms all previous translation averaging methods by a large margin on both synthetic and real datasets.
[CV-136] AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification CVPR
【速读】:该论文旨在解决3D脑部磁共振成像(MRI)亚型分类中如何有效融合局部解剖特征与长距离上下文推理的问题。其关键解决方案是提出AGA3DNet框架,该框架通过从放射科报告中提取简短的解剖短语作为软解剖先验通道,并将其映射至图谱定义区域,利用符号距离变换(signed-distance transform)和高斯加权生成平滑的空间先验,从而在不依赖密集体素标注的前提下提供可解释的、基于解剖结构的引导信息;同时结合轻量级3D卷积神经网络(CNN)与多视角xLSTM聚合机制,实现性能均衡且具备临床可解释定位能力的亚型分类。
链接: https://arxiv.org/abs/2605.07142
作者: Peiyu Duan,Xueqi Guo,Sepehr Farhand,Mehmet Berk Sahin,Xinyuan Zheng,James S. Duncan,Gerardo Hermosillo Valadez,Yoshihisa Shinagawa
机构: Yale University (耶鲁大学); Siemens Healthineers (西门子医疗); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR CV4CLINIC 2026
Abstract:Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.
[CV-137] Qwen 3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
【速读】:该论文旨在解决开放世界指代表达分割(open-world referring segmentation)任务中,多模态大语言模型(MLLM)输出局限于稀疏边界框坐标、难以实现像素级精确分割的问题。现有方法要么直接预测稀疏轮廓点导致连续边界重建困难,要么依赖外部分割基础模型(如Segment Anything Model, SAM),带来显著的架构与部署开销。解决方案的关键在于提出Qwen3-VL-Seg框架,其核心创新是将MLLM预测的边界框作为语义锚定的结构先验,并通过一个轻量级的框引导掩码解码器(box-guided mask decoder)将其转化为像素级分割结果;该解码器融合多尺度空间特征注入、空间-语义查询构建、框引导高分辨率像素融合及迭代掩码感知查询优化机制,仅引入17M参数(约占基座模型的0.4%),实现了高效且高精度的开放世界视觉定位与分割能力。
链接: https://arxiv.org/abs/2605.07141
作者: Yuan Yao,Qiushi Yang,Humen Zhong,Jiangning Wei,Yifang Men,Shuai Bai,Miaomiao Cui,Zhibo Yang
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.
[CV-138] Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition IJCAI2026
【速读】:该论文旨在解决基于骨骼的人体动作识别(Skeleton-based Human Activity Recognition, HAR)模型普遍存在的“黑箱”问题,即现有方法虽具较强性能却缺乏可解释性。其解决方案的关键在于提出一种神经符号(neurosymbolic)框架,将动作识别重构为基于运动基元(motion primitives)的概念驱动的一阶逻辑推理过程。该框架通过可学习的空间-时间运动概念对齐骨骼特征与语言模型生成的原子运动描述,从而在感知与推理之间建立共享的概念空间;同时利用可微分的一阶逻辑层组合空间-时间概念谓词,使模型能够自动学习人类可读的动作语义逻辑规则,兼顾识别精度与解释性。
链接: https://arxiv.org/abs/2605.07140
作者: Talha Ilyas,Deval Mehta,Zongyuan Ge
机构: Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
Abstract:Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: this https URL
[CV-139] InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization
【速读】:该论文旨在解决跨视角地理定位(Cross-view geo-localization, CVGL)在GPS拒止环境中的鲁棒性与泛化能力不足问题,尤其针对由区域纹理差异、天气变化及无人机(UAV)视角下视觉杂乱导致的显著域偏移(domain shift)。其解决方案的关键在于引入信息论框架InfoGeo,通过将优化过程建模为信息瓶颈(information bottleneck)机制,实现两个核心目标:一是最大化视图不变的信息,即通过对齐不同视角下的对象中心结构关系来增强跨视角一致性;二是最小化视图特异的噪声信号,借助跨视图知识约束抑制无关干扰。该方法显著提升了模型在复杂场景下的定位精度和适应性。
链接: https://arxiv.org/abs/2605.07099
作者: Hongyang Zhang,Maonnan Wang,Ziyao Wang,Hongrui Yin,Man OnPun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. While existing approaches rely on global feature alignment, they often suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.
[CV-140] ask Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information
【速读】:该论文旨在解决现有通道重要性评估方法中将通道对任务的相关性与局部可替代性混为一谈的问题,从而导致剪枝(pruning)预测不准确。传统方法通常用单一分数衡量通道重要性,但忽略了两个关键维度:一是通道与目标任务的信息关联程度(target relevance),二是当该通道被移除时,同层其他通道是否能替代其功能(local replaceability)。论文提出一种双轴视角(two-axis view),分别从局部轴(local axis,衡量输入捕获能力与同伴重叠度)和目标轴(target axis,衡量任务信息与目标超额信息)来解耦这两个问题。实验表明,尽管初始时两轴强耦合,但在训练过程中迅速分离,且局部轴指标在固定FLOPs约束下的剪枝测试中比目标轴指标更可靠地预测通道可移除性,尤其在ResNet-18、MobileNetV2等结构中表现显著优于目标相关性指标;而Norm-based基线在VGG-16中仍具竞争力。核心创新在于揭示了“局部可替代性”是比“目标相关性”更可靠的剪枝依据,为高效、精准的神经网络结构压缩提供了理论支撑。
链接: https://arxiv.org/abs/2605.07086
作者: Houman Safaai,Andrew T. Landau,Celia C. Beron,Yasin Mazloumi,Bernardo L. Sabatini
机构: Harvard University (哈佛大学); Howard Hughes Medical Institute (霍华德·休斯医学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.
[CV-141] ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction
【速读】:该论文旨在解决口腔种植手术中植入体(implant)定位精度不足的问题,尤其在医学影像中植入区域缺乏显著纹理特征时,传统人工智能模型难以准确推断植入体位置及其倾斜角度(slope)。其解决方案的关键在于提出一种名为 ImplantMamba 的新型网络架构,该架构采用混合编码器设计,融合卷积神经网络(CNN)与 Mamba 层,以实现局部解剖特征的层级提取和全扫描体积范围内的全局上下文建模;同时引入斜率耦合预测分支(Slope-Coupled Prediction Branch, SCP),显式关联植入体位置与斜率的回归任务,从而确保预测结果在内部逻辑上一致且符合解剖学合理性,显著提升了植入体定位的准确性与鲁棒性。
链接: https://arxiv.org/abs/2605.07082
作者: Xinquan Yang,Congmin Wang,Xuguang Li,Yulei Li,Linlin Shen,Yongqiang Deng He Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the design of surgical guides for implant placement, determining the precise implant position is a critical step. However, the implant region itself is often characterized by a lack of distinctive texture in medical images. Consequently, artificial intelligence (AI) models must infer the correct implant position and angulation (slope) primarily by analyzing the texture of the surrounding teeth, which poses a significant challenge. To address this, we propose ImplantMamba, a network architecture designed for long-range sequential modeling to integrate texture information from adjacent teeth. Our approach explicitly couples the regression of the implant position with its slope. The core of ImplantMamba is a hybrid encoder that combines Convolutional Neural Networks (CNNs) with Mamba layers. This design enables the network to hierarchically extract local anatomical features through CNNs while simultaneously modeling global contextual dependencies across the entire scan volume via Mamba’s selective scan operations, leading to a more comprehensive understanding of the implant site. Furthermore, we introduce a Slope-Coupled Prediction Branch (SCP). This branch is designed to connect the prediction of implant position with the slope, ensuring internal consistency and anatomical plausibility by thereby enforcing a coherent relationship between the predicted implant location and its angulation. Extensive experiments on a large-scale dental implant dataset demonstrate that the proposed ImplantMamba achieves superior performance compared to existing methods.
[CV-142] Learning Visual Feature-Based World Models via Residual Latent Action
【速读】:该论文旨在解决现有世界模型(World Model)在复杂交互场景下预测质量差的问题,特别是针对当前基于视觉特征的世界模型依赖直接回归导致的模糊或坍缩预测问题,以及高维特征空间中生成建模的挑战。其关键解决方案是提出一种新型潜在动作表示——残差潜在动作(Residual Latent Action, RLA),该表示可从DINO特征残差中轻松学习,并具备预测性、泛化性和时间进展编码能力;在此基础上构建RLA世界模型(RLA-WM),通过流匹配(flow matching)预测RLA值,从而实现高效且高质量的未来状态预测,在仿真与真实数据集上均超越当前最先进的特征基和视频扩散世界模型,同时速度提升数个数量级。
链接: https://arxiv.org/abs/2605.07079
作者: Xinyu Zhang,Zhengtong Xu,Yutian Tao,Yeping Wang,Yu She,Abdeslam Boularias
机构: Rutgers University (罗格斯大学); Purdue University (普渡大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as Residual Latent Action (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose RLA World Model (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: this https URL
[CV-143] Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 图像检测中跨未见生成架构(如 Stable Diffusion 3)时性能下降的问题,其核心挑战在于现有检测模型易过拟合于特定生成器指纹和语义内容,而非学习到通用的伪造痕迹。解决方案的关键在于通过结构化解耦(structural disentanglement)实现对三类因素的分离:通用伪造痕迹、生成器特异性指纹和语义内容。具体而言,提出 Orthogonal Decomposition and Purification Network (ODP-Net),利用频域上的物理正交性,在特征空间中将三者投影至互斥子空间,并通过扰动净化与流形对齐机制强化泛化能力,从而显著提升在未见生成架构下的检测性能。
链接: https://arxiv.org/abs/2605.07074
作者: Zhiyuan Wang(1),Yanxiang Chen(2),Yuanzhi Yao(2),Yunfeng Diao(2) ((1) Hefei University of Technology, (2) Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education / School of Computer Science and Information Engineering, Hefei University of Technology / Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology))
机构: Hefei University of Technology (合肥工业大学); Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education (教育部知识工程大数据重点实验室(合肥工业大学)); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院); Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology) (安徽省智能互联系统实验室(合肥工业大学))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ~10 pages (IEEEtran two-column), 6 figures, 6 tables, 1 algorithm
Abstract:Detecting AI-generated images across unseen architectures remains challenging, as existing models often overfit to generator-specific fingerprints and semantic content rather than learning universal forgery traces. We attribute this failure to feature entanglement: detectors learn these factors as a single entangled representation, where universal forgery traces are inextricably confounded with both generator-specific fingerprints and semantic content. Crucially, our spectral analysis reveals that this entanglement is avoidable: distinct generator-specific fingerprints (e.g., GAN stripes vs. Diffusion Model spots) occupy disjoint frequency subspaces and coexist as independent superpositions. Leveraging this physical orthogonality, we propose the Orthogonal Decomposition and Purification Network (ODP-Net) to structurally disentangle these factors. Specifically, ODP-Net employs (1) Instance-aware Orthogonal Decomposition to project features into mutually exclusive subspaces: universal forgery traces, generator-specific fingerprints, and semantic content; (2) Perturbation-based Purification to enforce semantic invariance via cross-sample feature injection; and (3) Manifold Alignment to bridge domain gaps. By explicitly decoupling universal forgery traces from generator-specific fingerprints and semantic content, ODP-Net achieves state-of-the-art performance on unseen architectures (e.g., Stable Diffusion 3), validating that structural disentanglement is key to generalization.
[CV-144] Learning to Track Instance from Single Nature Language Description CVPR2026
【速读】:该论文旨在解决无监督视觉-语言(Vision-Language, VL)目标跟踪问题,即如何仅依赖自然语言描述从视频序列中追踪指定目标,而无需任何边界框标注(bounding-box ground truth)。其核心挑战在于如何在缺乏监督信号的情况下,实现语言与视觉特征的有效对齐和动态融合。解决方案的关键在于提出一种新颖的自监督VL跟踪器——\tracker,其创新性地引入了动态令牌聚合模块(Dynamic Token Aggregation Module),该模块通过锚点令牌选择模板帧中的关键视觉令牌,并依据注意力得分将这些目标令牌聚合到语言令牌中,从而抑制冗余视觉噪声并增强语义一致性;随后,融合后的语言令牌作为引导信号,在搜索帧中提取潜在目标令牌并传播至后续帧,强化时序提示,使模型能够从无标签视频中自主学习实例级跟踪表示,无需大规模边界框标注即可实现优于现有自监督方法的性能。
链接: https://arxiv.org/abs/2605.07064
作者: Yaozong Zheng,Bineng Zhong,Qihua Liang,Shuimu Zeng,Haiying Xia,Shuxiang Song
机构: Guangxi Normal University (广西师范大学); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbfwithout relying on any bounding-box ground truth? In this work, we achieve this goal by tackling \textitself-supervised VL tracking, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf\tracker, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbfunequally. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that \tracker surpasses SOTA self-supervised methods.
[CV-145] Do Joint Audio-Video Generation Models Understand Physics?
【速读】:该论文旨在解决当前联合音视频生成模型是否真正理解音视频物理常识的问题,即这些模型是基于对现实世界物理规律的深层认知,还是仅通过统计关联生成看似合理的视听内容。为评估这一问题,作者提出了AV-Phys Bench基准测试,涵盖稳态、事件过渡和环境过渡三类场景,并引入反物理(Anti-AV-Physics)提示以检测模型在违反现实一致性时的表现。解决方案的关键在于构建一个多维度评估体系(包括视觉/音频语义一致性、物理常识及跨模态物理一致性),并设计AV-Phys Agent——一种结合多模态语言模型与确定性声学测量工具的ReAct式评估代理,其排名结果与人类评分高度一致,从而揭示了跨模态物理一致性与动态场景过渡是当前联合音视频生成面临的两大核心挑战。
链接: https://arxiv.org/abs/2605.07061
作者: Zijun Cui,Xiulong Liu,Hao Fang,Mingwei Xu,Jiageng Liu,Zexin Xu,Weiguo Pian,Shijian Deng,Feiyu Du,Chenming Ge,Yapeng Tian
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Washington (华盛顿大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Preprint. Full abstract appears in the PDF
Abstract:Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
[CV-146] Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
【速读】:该论文旨在解决多模态医学影像基础模型(Foundation Models, FMs)在真实世界中因器官缺失(missing-organ scenarios)导致的性能下降、泛化能力受限及偏倚问题。现有方法通常在单一器官上训练,无法捕捉全身协同生物学过程;而直接进行多器官联合预训练易产生“主导器官捷径学习偏倚”(dominant-organ shortcut learning bias),即模型过度依赖如脂肪组织和心脏等高信息量器官,忽视其他器官特征。解决方案的关键在于提出Pan-FM框架,其核心创新是引入显著性引导掩码机制(Saliency-Guided Masking, SGM),该机制基于模型注意力分布动态掩码主导器官,在预训练阶段强制模型学习跨器官平衡表示,从而提升多器官协同建模能力。SGM计算开销极低,可无缝嵌入现有自监督学习范式,显著增强模型在器官缺失条件下的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2605.07055
作者: Qiangqiang Wu,Grace McIlvain,Zhou Yu,Junhao Wen
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models (FMs) have shown great promise in medical imaging, but most FMs are trained on unimodal data within isolated domains, such as brain MRI alone. Human aging and disease arise through coordinated biological processes across organs, therefore motivating multimodal FMs that learn whole-body representations. A key challenge, however, is that real-world multimodal biomedical data are often missing not at random, which can reduce power, limit generalizability, and introduce bias. We propose Pan-FM, a pan-organ foundation model pre-trained on imaging from seven organs (Brain, Heart, Adipose, Liver, Kidney, Spleen, and Pancreas) under realistic missing-organ scenarios. Pan-FM uses a unified backbone that handles organ missingness during both training and inference, and is pre-trained with masking-based self-distillation. We find that naive multimodal pre-training leads to dominant-organ shortcut learning bias, with the model over-relying on dominant organs such as adipose and heart. To address this, we introduce Saliency-Guided Masking (SGM), which uses the model attention distribution to adaptively mask dominant organs during pre-training, thus encouraging more balanced cross-organ, whole-body learning. Notably, SGM introduces negligible computational overhead and can be seamlessly integrated into existing self-supervised learning frameworks to improve multi-organ representation learning. On the UK Biobank, Pan-FM achieves stronger prediction across 13 disease categories and 14 single disease entities than single-organ and multi-organ baselines, with improved robustness under missing-organ settings. Pan-FM serves as a scalable solution to realistic modality-missingness in multimodal learning in system neuroscience and as a step toward more generalizable whole-body FMs.
[CV-147] Dr-BA: Separable Optimization for Direct Radar Bundle Adjustment Localization
【速读】:该论文旨在解决基于雷达的高精度状态估计与地图构建问题,尤其针对传统方法依赖稀疏点云、难以实现密集场景建模与鲁棒定位的局限性。现有方法通常从雷达的范围-方位-强度(range-azimuth-intensity)数据中提取稀疏点云,并通过点云配准进行位姿估计或地图构建,但这种方法忽略了雷达图像中丰富的二维强度信息,限制了建模密度和精度。解决方案的关键在于提出 Dr-BA(Direct Radar Bundle Adjustment),一种直接在 2D 螺旋雷达强度图像上操作的新型捆绑调整框架,通过将优化问题形式化为可分离的结构,实现了位姿估计与地图构建的解耦,从而高效地联合估计稠密地图和传感器位姿。此外,该框架自然扩展至仅使用雷达的直接定位(Direct Radar Localization, DRL),显著提升了跨会话定位性能,在超过 200 公里的多路线实车数据上验证了其先进性。
链接: https://arxiv.org/abs/2605.07041
作者: Daniil Lisus,Cedric Le Gentil,Timothy D. Barfoot
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at RSS 2026
Abstract:This paper introduces Dr-BA, a first-of-its-kind radar bundle adjustment (BA) framework that operates directly on 2D spinning radar intensity images. Unlike camera or lidar sensors, radar is largely unaffected by precipitation, making it a critical modality for autonomous systems that require all-weather robustness. Existing state estimation approaches using spinning radar typically extract sparse point clouds from range-azimuth-intensity measurements and apply point cloud alignment techniques to estimate vehicle motion, scene structure, or to localize within an existing map. In contrast, Dr-BA uses the full radar returns from multiple scans to jointly estimate dense maps and sensor poses. By formulating the problem as a separable optimization, we derive an efficient and general solution that decouples pose estimation from mapping. In addition to solving the BA problem, this formulation naturally extends to direct radar-only localization (DRL) within a previously built map. Dr-BA achieves state-of-the-art radar-based BA and cross-session localization performance, demonstrated on more than 200 km of on-road data across five distinct routes. Our implementation is publicly available at this https URL.
[CV-148] OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects
【速读】:该论文旨在解决在仅有单张真实世界RGB-D参考视图(无CAD模型)条件下进行6D物体位姿估计的挑战性问题,此类场景下现有方法因依赖显式3D模型或多视角数据而难以扩展。解决方案的关键在于提出OneViewAll框架,其核心创新是基于语义先验引导的“投影与比对”(Project-and-Compare)范式:通过在投影等变空间中直接对齐参考与查询观测,避免了计算昂贵的基于CAD的渲染过程;同时分层整合三类语义先验——类别与场景级先验用于高效假设初始化、对象级对称先验通过镜像融合完成几何补全、以及像素块级先验实现判别性精调,从而显著提升对对称、纹理缺失及遮挡物体的鲁棒性与精度。
链接: https://arxiv.org/abs/2605.07023
作者: Yang Luo,Yan Gong,Yongsheng Gao,Jie Zhao,Xinyu Zhang,Huaping Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In many practical 6D object pose estimation scenarios, we often have access to only a single real-world RGB-D reference view per object, typically without CAD models. Existing methods largely rely on explicit 3D models or multi-view data, which limits their scalability. To address this challenging single-reference model-free setting, we propose \textbfOneViewAll, a semantic-prior-guided framework that performs pose estimation via a novel Project-and-Compare paradigm. Instead of relying on computationally expensive CAD-based rendering, our method directly aligns reference and query observations within a projection-equivariant space. OneViewAll progressively integrates hierarchical semantic priors across three levels: (1) \textitcategory- and scene-level priors for efficient hypothesis initialization; (2) \textitobject-level symmetry priors for geometry completion via mirror fusion; and (3) \textitpatch-level priors for discriminative refinement. Extensive experiments demonstrate that OneViewAll achieves \textbf92.5% ADD-0.1 accuracy on the LINEMOD dataset using only one real reference view – significantly outperforming the CVPR 2025 baseline One2Any (52.6%). It also yields consistent improvements on YCB-V, Real275, and Toyota-Light while maintaining low inference latency. Our results underscore the efficacy of symmetry-aware projection in handling symmetric, texture-less, and occluded objects.
[CV-149] LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在处理文本图像时因压缩导致的准确性下降问题。当文本以低分辨率渲染时,字符尺寸低于视觉编码器的有效分辨能力,使得模型难以准确识别内容。解决方案的关键在于提出LensVLM——一种推理框架与后训练策略相结合的方法,通过学习到的工具动态选择性地扩展压缩图像中相关区域至原始分辨率,从而在保持高精度的同时实现显著压缩。该方法使模型能够在4.3倍有效压缩下维持接近完整文本的性能,并在高达10.1倍压缩时优于基于检索、文本或视觉压缩的基线方法。
链接: https://arxiv.org/abs/2605.07019
作者: Roy Xie,Dan Friedman,Donghan Yu,Bowen Pan,Christopher Fifty,Jang-Hyun Kim,Xianzhi Du,Zhe Gan,Vivek Rathod,Bhuwan Dhingra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder’s effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.
[CV-150] RAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
【速读】:该论文旨在解决现有多模态自监督学习(Multimodal Self-Supervised Learning, MSSL)方法在地理空间领域中对人类移动轨迹(human mobility trajectories)建模不足的问题。传统MSSL框架主要针对静态模态对(如卫星图像、街景图像与文本)设计,依赖于同一或邻近地理位置的观测对齐,而无法有效处理连续路径上的轨迹数据。解决方案的关键在于提出TrajGANR——一种以轨迹为中心的地理空间MSSL框架,其核心创新是通过学习任意路径点上的连续神经表示,实现轨迹与街景图像之间的细粒度对齐,即使街景图像并非位于轨迹的离散路点上也能完成对齐。该框架进一步引入联合对齐三个模态(轨迹、街景图像及地理坐标)的目标函数,在四个城市移动性和道路理解任务中显著优于现有方法,验证了细粒度地理空间对齐和多模态协同学习的重要性。
链接: https://arxiv.org/abs/2605.06990
作者: Maria Despoina Siampou,Gengchen Mai,Ni Lao,Jinmeng Rao,Neha Arora,Cyrus Shahabi,Shushman Choudhury
机构: Google Research(谷歌研究); Google LLC(谷歌有限责任公司); University of Southern California(南加州大学); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.
[CV-151] Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
【速读】:该论文旨在解决红外-可见光图像融合(Infrared-Visible Image Fusion, IVIF)中评估方法存在的局限性问题,即现有方法过度优化手工设计的无参考统计指标和全参考指标,且将源图像视为伪真实标签,导致评估与人类视觉感知不一致;同时,基于人类评分的奖励建模方法采用标量回归处理聚合评分,未能利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力,也未编码单幅图像的感知模糊性,甚至因离散 one-hot 监督使质量相近的融合图像被错误区分。解决方案的关键在于提出 FuScore,其核心创新是借助 MLLM 模拟人类视觉感知生成连续质量评分而非离散等级预测,从而实现对相似质量融合图像的细粒度区分;并通过四个 IVIF 特定子维度的一致性构建每张图像的软标签(soft label),以反映整体判断的一致性强度,并引入三重目标函数:基于图像分布的监督、源图像对内的 Thurstone 保真度用于方法级排序,以及跨源图像对的 Thurstone 保真度用于场景级排序,最终在多项实验中展现出与人类视觉偏好最强的相关性。
链接: https://arxiv.org/abs/2605.06969
作者: Yuchen Guo,Junli Gong,Yao Lu,Xintong Xu,Yiuming Cheung,Weifeng Su
机构: Northwestern University (西北大学); Northeastern University (东北大学); University of Washington (华盛顿大学); Hong Kong Baptist University (香港浸会大学); Beijing Normal - Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.
[CV-152] XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling
【速读】:该论文旨在解决在异构边缘设备上进行目标检测时,如何在严格的能量、延迟和内存约束下实现可靠的感知能力问题。现有能量感知神经架构搜索(NAS)方法通常局限于有限的部署场景,且真实能耗因设备差异大而难以优化,测量成本高。其解决方案的关键在于提出一个能量自适应框架,包含三个核心组件:1)能量感知的XiResOFA搜索空间,支持灵活的结构探索;2)两阶段能量估计器,显著提升少样本条件下的预测效率;3)迭代搜索机制,用于识别单一能量高效的基线架构,并通过复合缩放策略将其扩展为适用于不同部署预算的XiYOLO系列模型,从而在稀疏硬件测量条件下实现可解释的精度-能耗权衡。实验表明,XiYOLO在PascalVOC和COCO数据集上均优于YOLO基线,在GPU和NPU上分别实现最高达53.7%和51.6%的能耗降低,同时保持优异的检测性能。
链接: https://arxiv.org/abs/2605.06927
作者: Tony Tran,Richie R. Suganda,Bin Hu
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy-energy tradeoffs under sparse hardware measurements. Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy-accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2-20 target-device samples.
[CV-153] A2RD: Agent ic Autoregressive Diffusion for Long Video Consistency
【速读】:该论文旨在解决长视频生成中普遍存在的语义漂移(semantic drift)和叙事崩溃(narrative collapse)问题,这些问题导致生成视频在时间维度上难以保持一致性和连贯性。其解决方案的关键在于提出A²RD架构——一种代理式自回归扩散模型(Agentic Auto-Regressive Diffusion),通过将创造性合成与一致性约束解耦,并引入一个闭环的“检索-合成-精炼-更新”循环机制,实现逐段视频的自我改进。该架构包含三个核心组件:多模态视频记忆(Multimodal Video Memory)、自适应片段生成(Adaptive Segment Generation)以及分层测试时自我改进(Hierarchical Test-Time Self-Improvement),从而有效抑制误差传播并提升长视频的视觉一致性与叙事流畅性。
链接: https://arxiv.org/abs/2605.06924
作者: Do Xuan Long,Yale Song,Min-Yen Kan,Tomas Pfister,Long T. Le
机构: Google Cloud AI Research(谷歌云人工智能研究); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this http URL
Abstract:Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A ^2 RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A ^2 RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve–Synthesize–Refine–Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A ^2 RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
[CV-154] Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge
【速读】:该论文旨在解决生成式视频技术(Generative Video Technologies)快速发展背景下,如何可靠地检测和识别合成媒体内容的问题。其解决方案的关键在于组织并实施了“SAFE: Synthetic Video Detection Challenge”这一国际竞赛,通过构建包含13种高质量合成视频模型、覆盖21个多样化真实视频来源的6000样本数据集(总计20小时),并在Hugging Face平台上开展双任务评估:一是区分不同生成器产生的合成视频,二是检测常见后处理操作(如重缩放、重新压缩、运动模糊等)后的合成内容。该设计推动了当前合成视频检测方法在跨生成器泛化能力上的显著进步,同时揭示了现有方法对后处理伪影仍存在持续性脆弱性,为后续研究提供了明确方向与基准参考。
链接: https://arxiv.org/abs/2605.06912
作者: Kirill Trapeznikov,Gabriel Mancino-Ball,Jonathan Li,Paul Cummer,Jai Aslam,Danial Samadi Vahdati,Tai Nguyen,Matthew C. Stamm,Peter Bautista,Michael Davinroy,Laura Cassani,Jill Crisman
机构: STR; Drexel University; Aptima, Inc.; UL Research Institutes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of generative video technologies has intensified the need for reliable methods to detect and characterize synthetic media. To address this challenge, we organized the \hrefthis https URLSAFE: Synthetic Video Detection Challenge, co-located with the \textitAuthenticity and Provenance in the Age of Generative AI (APAI) Workshop at ICCV 2025. The competition invited participants to develop and evaluate algorithms capable of distinguishing real from synthetic videos under fully blind evaluation conditions with over 600 submissions from 12 teams over a 90 day span. Hosted on the Hugging Face platform, the challenge comprised two primary tasks: (1) detection of synthetic video content generated by diverse state-of-the-art models, and (2) detection of synthetic content following common post-processing operations such as resizing, re-compression, motion blur and others. The challenge data consisted of 13 modern high quality synthetic video models with generated content matched to real videos from 21 diverse and challenge sources, all adding up to 20 hours of 6,000 video samples. This paper describes the challenge design, dataset construction, evaluation methodology, and outcomes, offering insights into the generalization and robustness of contemporary synthetic video detection methods. Our findings highlight measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts. this https URL
[CV-155] Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中计算成本过高的问题,特别是针对当前基于Transformer的扩散模型(DiTs)在推理阶段对所有时空token均采用相同去噪步数导致的资源浪费。其核心问题是:尽管人类视觉系统会忽略大量冗余运动信息,现有密集模型仍以同等优先级处理每个spatiotemporal token,造成不必要的计算开销。解决方案的关键在于提出异质步长分配(Heterogeneous Step Allocation, HSA)——一种无需训练的推理算法,根据token的速度动态分配不同的去噪步数预算,并引入KV缓存同步机制来维持全局上下文一致性,同时通过**缓存欧拉更新(cached Euler update)**实现跳过的token状态的一次性推进,避免额外模型调用。该方法显著提升了加速效率,在极端压缩计算预算(如50%和25%运行时间)下仍能保持高质量生成结果,优于现有缓存策略与基线方法。
链接: https://arxiv.org/abs/2605.06892
作者: Ernie Chu,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that advances the latent states of skipped tokens in a single operation without additional model evaluations. We evaluate HSA on the Wan-2 and LTX-2 models for both text-to-video (T2V) and image-to-video (I2V) generation. Our results demonstrate that HSA significantly outperforms previous state-of-the-art caching methods and the vanilla Flow Matching baseline, especially at aggressive acceleration regimes (e.g., 50% and 25% runtimes). Crucially, HSA achieves a superior quality-runtime Pareto frontier without the need for expensive offline profiling, robustly preserving structural integrity and generation quality even under tight computational budgets. Project page: this https URL Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.06892 [cs.CV] (or arXiv:2605.06892v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.06892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-156] owards Fairness under Label Bias in Image Segmentation: Impact Measurement and Mitigation
【速读】:该论文旨在解决图像分割任务中因标注管道引入的标签偏差(label bias)问题,即群体条件下的标签错误导致不同人口子群之间的系统性性能差异。此类偏差在图像分割领域研究不足,主要因为检测通常需要干净、无偏的标注数据,而这类数据难以获取。其解决方案的关键在于提出一种面向数据的置信学习(Confident Learning)适配方法,无需依赖清洁的真值标签即可直接从训练数据中检测标签偏差;通过比较模型的置信预测与原始标签,识别出具有方向性的错误,从而量化偏差的存在及其性质——这是传统重叠指标(如Dice系数)无法实现的。此外,作者进一步发现标签偏差会影响编码器特征空间中子群的可分离性,并将这一现象用于偏差缓解而非简单抑制,最终在三个涵盖从合成到真实场景的基准数据集上验证了该框架在无清洁标签条件下可靠检测并缓解偏差的能力,实现跨实验条件的公平性能表现。
链接: https://arxiv.org/abs/2605.06891
作者: Aditya Parikh,Stella Frank,Sneha Das,Aasa Feragen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Labeled datasets reflect the biases of their annotation pipelines, which sometimes introduce label bias: group-conditional label errors that cause systematic performance disparities across demographic subgroups. Label bias in image segmentation remains underexplored, as even detecting it typically requires clean, unbiased annotations, which are not readily available. We present a data-centric adaptation of Confident Learning to segmentation, allowing detection of label bias directly in the training data without a clean, unbiased ground truth. By comparing the provided training labels to the model’s confident predictions, we isolate directional errors that quantify the presence and nature of bias, where standard overlap metrics like Dice fail. We further show that label bias influences subgroup separability in the encoder’s feature space, an artifact we leverage for bias mitigation rather than suppressing it. We evaluate three datasets, spanning from synthetic to real-life bias, showing how our framework reliably detects and mitigates bias without access to clean labels, achieving equitable performance across experimental conditions.
[CV-157] riDE: Triangle-Consistent Translation Directions for Global Camera Pose Estimation
【速读】:该论文旨在解决全局结构光束法平差(global structure-from-motion)中相机位置估计时,由成对图像间相对方向估计不一致所导致的误差累积问题。现有方法通常独立处理每一对图像,虽能获得局部合理的相对方向,但难以保证整体视图图(viewing graph)中方向的一致性。解决方案的关键在于提出TriDE方法,其核心创新是利用相机三角形一致性(camera-triangle consistency)作为高效高阶验证信号,通过方向与其关联加权三角形之间的消息传递机制,迭代优化不可靠的成对方向。该策略避免了昂贵且对初值敏感的全局非线性优化,实现了在现实随机噪声模型下精确恢复方向的强相变边界,并在真实图像图上显著提升方向精度和下游相机位姿估计质量。
链接: https://arxiv.org/abs/2605.06889
作者: Francisco Chen,Yiran Wang,Yunpeng Shi
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 6 figures
Abstract:Pairwise translation directions are a key input to camera location estimation in global structure-from-motion. Existing estimators usually process each image pair independently, producing directions that may be locally plausible but inconsistent with the other relative directions in the viewing graph. To jointly estimate the direction, we propose TriDE, which exploits camera-triangle consistency as an efficient higher-order verification signal. Instead of solving a costly global nonlinear optimization problem that is sensitive to initialization, TriDE refines unreliable pairwise directions through message passing between directions and their incident weighted triangles. This information propagation strategy enables us to establish a strong phase-transition bound for exact recovery under a realistic random corruption model. Experiments on real image graphs show that TriDE improves direction accuracy by a large margin and yields better downstream camera locations, providing a practical link between local pairwise estimation and global camera pose geometry.
[CV-158] AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)中因固定基数随机分裂导致的密度控制效率低下问题,即传统二元分裂操作需大量稠密化轮次才能揭示精细场景结构,成为训练时间优化的关键瓶颈。其解决方案的核心是提出AdpSplit——一种基于误差驱动的自适应分裂算子,它根据L1像素误差区域统计信息动态确定分裂子节点数量并初始化参数,从而显著减少稠密化迭代次数,在保持全调度训练渲染质量的前提下大幅缩短训练时间。实验表明,AdpSplit作为标准分裂算子的即插即用替代方案,在多个数据集上使加速后的3DGS流水线训练时间减少9.2%–22.3%,并在FastGS框架下实现MipNeRF360上与全调度PSNR相当的同时,训练时间减少16.4%,相较原始3DGS提升12.6倍加速比。
链接: https://arxiv.org/abs/2605.06876
作者: Yongjae Lee,Jingxing Li,Abhay Kumar Yadav,Rama Chellappa,Deliang Fan
机构: Arizona State University(亚利桑那州立大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adaptive density control in 3D Gaussian Splatting (3DGS) repeatedly grows the Gaussian population through fixed-cardinality random splitting to discover useful scene structure. However, in vanilla 3DGS, its binary split operator requires many densification rounds to expose fine details, making it a bottleneck for efficient training schedules with fewer iterations. We introduce AdpSplit, an error-driven adaptive split operator that determines the number of split children and initializes the child parameters from L1-pixel-error region statistics, enabling fewer densification iterations, thus reduced training time, while preserving the rendering quality of full-schedule training. Across the MipNeRF360, Deep-Blending, and TanksTemples datasets, AdpSplit reduces the training time of multiple accelerated 3DGS pipelines by 9.2%-22.3% as a simple drop-in replacement for the standard split operator. With FastGS, AdpSplit matches the full-schedule PSNR on MipNeRF360 while reducing training time by 16.4%, corresponding to a 12.6x acceleration over vanilla 3DGS.
[CV-159] EULER-ADAS: Energy-Efficient SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration
【速读】:该论文旨在解决高级驾驶辅助系统(ADAS)中神经网络推理对低延迟、低功耗和高可靠性的严格约束问题,尤其是传统浮点运算在资源受限场景下的效率瓶颈。其核心解决方案是提出EULER-ADAS,一种基于受限域(bounded-regime)Posit表示的SIMD可扩展神经计算引擎,关键创新包括:采用受限域Posit数据表示以降低编码/解码开销并缓解位域故障影响;引入分阶段自适应对数型尾数乘法与位截断机制提升精度与效率平衡;设计共享quire累加路径支持Posit-(8,0)、Posit-(16,1)和Posit-(32,2)多精度并行执行,实现硬件复用。该架构在FPGA上验证显示,相较精确Posit引擎功耗降低达71.9%,面积减少41.4%,且在图像分类、ADAS和边缘推理任务中保持接近FP32精度(误差<1.5%),证明其在低功耗实时ADAS推理中的适用性。
链接: https://arxiv.org/abs/2605.06875
作者: Mukul Lokhande,Ratko Pilipovic,Omkar Kokane,Adam Teman,Santosh Kumar Vishvakarma
机构: Indian Institute of Technology Indore (印度理工学院英迪拉普尔分校); University of Ljubljana (卢布尔雅那大学); Bar-Ilan Univ. (巴伊兰大学)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA)
备注:
Abstract:Advanced driver-assistance systems (ADAS) require neural compute engines that deliver low-latency inference under strict power and area constraints. Posit arithmetic is attractive for such accelerators because it provides high numerical fidelity at low precision, but its variable-length regime encoding increases encode/decode cost and exposes the datapath to large regime-field fault effects. This paper presents EULER-ADAS, a SIMD-enabled logarithmic bounded-Posit neural compute engine for energyefficient and reliability-aware ADAS acceleration. The proposed datapath combines bounded-regime Posit representation, stageadaptive logarithmic mantissa multiplication with bit truncation, and a SIMD-shared quire accumulation path supporting Posit- (8,0), Posit-(16,1), and Posit-(32,2) execution. The unified architecture enables 4xPosit-8, 2xPosit-16, or 1xPosit-32 operation without duplicating precision-specific hardware. FPGA implementation shows that the proposed configurations reduce LUT count by up to 41.4%, delay by up to 76.1%, and power by up to 71.9% relative to exact Posit neural compute engines, while achieving up to 10x lower energy-delay product than radix-4 Booth-based Posit multipliers. In 28-nm CMOS, the bounded variants occupy 0.013-0.016 mm2 , consume 19.8-22.1 mW, and operate at up to 1.84 GHz. Application-level evaluation across image-classification, ADAS, and edge-inference workloads shows that the evaluated Posit-16 and Posit-32 configurations remain within about 1.5 percentage points of FP32 accuracy. A TinyYOLOv3 prototype on Pynq-Z2 achieves 78 ms latency at 0.29 W and 22.6 mJ/frame, demonstrating the suitability of EULERADAS for low-power real-time ADAS inference.
[CV-160] Knowledge Transfer Scaling Laws for 3D Medical Imaging
【速读】:该论文旨在解决多模态3D医学影像(如CT、MRI和PET)统一预训练过程中数据混合策略缺乏理论指导的问题,即如何科学分配不同成像域的数据以最大化模型性能。其解决方案的关键在于发现并利用不同医学影像域在预训练中具有异质性扩展规律(scaling laws),并揭示知识迁移的非对称特性;基于此,将数据分配建模为一个可解释的“枢纽-岛屿”结构优化问题——高可迁移性的域作为枢纽,应优先分配资源以提升整体性能,而孤立域则需直接投入数据。实验证明,该方法相较传统按比例采样显著提升下游任务表现(最高达58%),且在未见预算下具有良好泛化能力(相关系数r=0.989)。
链接: https://arxiv.org/abs/2605.06859
作者: Ho Hin Lee,Dongna Du,Chu Wang,Yuankai Huo,Shi Gu,James C. Gee,Yifan Wu
机构: Vanderbilt University (范德比尔特大学); Zhejiang University (浙江大学); McGill University (麦吉尔大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 Pages
Abstract:Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.
[CV-161] A Unified Measure-Theoretic View of Diffusion Score-Based and Flow Matching Generative Models
【速读】:该论文旨在解决当前连续时间生成建模方法(如扩散模型、基于得分的生成模型和流匹配)在理论框架上碎片化的问题,这些方法虽在实践中逐渐趋同,但因符号体系不统一、推导路径各异,导致其共享结构与实际权衡(如采样效率、稳定性及计算复杂度)难以清晰识别。解决方案的关键在于提出一个统一的理论框架:将上述方法视为学习一个随时间变化的向量场(time-dependent vector field),该向量场驱动一组满足连续性方程(continuity equation)和福克-普朗克方程(Fokker-Planck equation)的边缘分布族 (ρt)t∈[0,1]。在此框架下,作者系统地揭示了反向时间采样可表示为受控随机动力学、概率流常微分方程(probability flow ODE)等价于对数似然优化的归一化流(normalizing flows),并明确指出流匹配本质上是对选定插值路径下速度场的直接回归,从而厘清其与基于得分训练的关系。这一统一视角有助于深入理解不同方法的本质联系与差异,并为未来理论分析(如逼近误差、稳定性与可扩展性)提供基础。
链接: https://arxiv.org/abs/2605.06829
作者: Aditya Ranganath,Mukesh Singhal
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); University of California, Merced (加州大学默塞德分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 62 pages, 1 figure, jmlr preprint
Abstract:We survey continuous-time generative modeling methods based on transporting a simple reference distribution to a data distribution via stochastic or deterministic dynamics. We present a unified framework in which diffusion models, score-based generative models, and flow matching are instances of learning a time-dependent vector field that induces a family of marginals (\rho_t)_t \in [0,1] governed by continuity and Fokker-Planck equations. Such a unified theory is timely because these methods are converging methodologically, yet fragmented notation and competing derivations continue to obscure their shared structure and the practical tradeoffs governing sampling, stability, and computation. Within this framework, we (i) derive reverse-time sampling for diffusion and score-based models as controlled stochastic dynamics, (ii) show that the probability flow ODE yields identical marginals and connects diffusion to likelihood-based normalizing flows, and (iii) interpret flow matching as direct regression of the velocity field under a chosen interpolation, clarifying when it coincides with or differs from score-based training. We compare objectives, sampling schemes, and discretization errors under unified notation, discuss connections to Schrodinger bridges and entropic optimal transport, and summarize theoretical guarantees and open problems on approximation, stability, and scalability.
[CV-162] Uneven Evolution of Cognition Across Generations of Generative AI Models
【速读】:该论文旨在解决如何系统评估生成式 AI(Generative AI)的认知能力问题,特别是在超越特定任务性能的基础上,识别其认知架构的不平衡性并追踪其发展轨迹。解决方案的关键在于引入心理测量学(psychometric)框架,将生成式 AI 的表现与人类认知标准进行比较,并开发了人工智能智商(Artificial Intelligence Quotient, AIQ)基准来量化多代模型在不同认知维度上的演变。通过这一方法,研究发现当前主流多模态模型在语言类任务上表现出接近人类天花板的水平,但在视觉感知推理方面仍处于极低水平,揭示出模型在抽象符号处理与视觉组织能力之间存在显著不对称的发展路径,从而指出单纯依赖规模扩展和优化策略难以实现均衡的人类级通用智能(Artificial General Intelligence, AGI)。
链接: https://arxiv.org/abs/2605.06815
作者: Isaac Galatzer-Levy,Daniel McDuff,Xin Liu,Jed McGiffin
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究); University of Washington(华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 5 Figures, 3 Tables
Abstract:The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory ( 98^\textth percentile) contrasted with near-floor performance in perceptual reasoning ( 1^\textst percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.
[CV-163] LookWhen? Fast Video Recognition by Learning When Where and What to Compute
【速读】:该论文旨在解决视频识别中Transformer模型因处理所有视频token而导致的超线性计算成本过高问题,尽管视频本身存在大量冗余信息。其解决方案的核心在于提出一种名为LookWhen的选择器-提取器(selector-extractor)框架,将视频识别任务分解为“何时、何地、以及如何计算”的学习过程:浅层选择器对缩放后的视频快速评分所有时空token,深层提取器仅处理Top-K选中的token以近似完整视频表征,从而显著降低计算开销。关键创新在于设计了两种预训练策略——通过基于最近邻距离的表示独特性得分实现选择预训练,以及通过蒸馏视频和图像教师模型来学习帧间变化特征以优化提取预训练,使模型在保持高精度的同时实现更高效的计算效率,在多个基准数据集上均优于同类高效模型与升级版基线。
链接: https://arxiv.org/abs/2605.06809
作者: Ali Salamatian,Anthony Fuller,Pritam Sarkar,James R. Green,Leonid Sigal,Evan Shelhamer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.
[CV-164] Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
【速读】:该论文旨在解决视觉网络代理(visual web agents)在训练数据规模和多样性上的局限性问题,现有方法受限于离线轨迹或少量模拟环境,难以捕捉真实网络的复杂性和动态变化。解决方案的关键在于提出Weblica(Web Replica)框架,其核心创新包括:1)基于HTTP级缓存实现稳定视觉状态的捕获与重放,同时保留交互行为;2)利用大语言模型(LLM)生成基于真实网站和核心网页导航技能的环境合成机制,从而构建可复现且可扩展的网络环境。该框架使强化学习(RL)训练能够扩展至数千个多样化环境与任务,显著提升模型性能与泛化能力。
链接: https://arxiv.org/abs/2605.06761
作者: Oğuzhan Fatih Kar,Roman Bachmann,Yuanzheng Gong,Anders Boesen Lindbo Larsen,Afshin Dehghan
机构: Apple(苹果)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 19 figures
Abstract:The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.
[CV-165] R3L: Reasoning 3D Layouts from Relative Spatial Relations ICML2026
【速读】:该论文旨在解决3D布局生成中相对空间推理的可靠性与一致性问题,特别是由于多跳推理过程中参考帧变换累积误差导致的语义和度量漂移(semantic and metric drift)。其解决方案的关键在于提出R³L框架,包含三个核心机制:1)不变空间分解(invariant spatial decomposition),用于打破耦合的关系链以减少误差传播;2)一致空间想象(consistent spatial imagination),通过“想象-修正”循环提升自一致性;3)支持性空间优化(supportive spatial optimization),借助全局到局部坐标重参数化简化位姿优化。实验表明,该方法显著提升了布局的物理可行性和语义一致性,尤其验证了消除帧诱导不一致对可靠多跳相对空间推理的重要性。
链接: https://arxiv.org/abs/2605.06758
作者: Zhifeng Gu,Yuqi Wang,Bing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICML 2026
Abstract:Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R ^3 L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self-consistency through an imagine-and-revise loop. We further introduce supportive spatial optimization to ease pose optimization via global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R ^3 L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame-induced inconsistencies is crucial for reliable multi-hop relative spatial reasoning. The code is available at this https URL.
[CV-166] HumanNet: Scaling Human-centric Video Learning to One Million Hours
【速读】:该论文旨在解决具身智能(embodied intelligence)发展中因缺乏大规模、多样化且标注丰富的真人活动数据而导致的物理交互学习受限问题。解决方案的关键在于构建HumanNet——一个包含百万小时人类中心视频的语料库,其核心创新在于将人类视角(第一人称与第三人称)的视频数据、细粒度动作标注、人-物交互信息、工具使用及长时序行为等多模态信号进行系统化整合,并以“人类中心过滤”、“时间结构化”、“视点多样性”和“标注增强”作为设计原则,从而将非结构化的互联网视频转化为可扩展的表示学习与具身任务迁移的基础数据资源。实验证明,仅用1000小时的第一人称人类视频微调Qwen视觉语言模型,性能即超越使用100小时真实机器人数据的微调结果,验证了人类视频在具身基础模型训练中的高性价比与有效性。
链接: https://arxiv.org/abs/2605.06747
作者: Yufan Deng,Daquan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Github: this https URL Project website: this https URL
Abstract:Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.
[CV-167] Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
【速读】:该论文旨在解决如何将深度学习模型高效部署到边缘设备上以实现实时决策的问题,尤其是在计算机视觉特别是医疗诊断领域的应用挑战。其解决方案的关键在于提出了一种基于性能与使用场景的边缘硬件平台新分类方法,并系统性地综述了轻量化设计与模型压缩等关键技术,从而提升模型在资源受限边缘设备上的运行效率和实用性,推动生成式 AI(Generative AI)与边缘计算深度融合,助力智能边缘深度学习解决方案的实际落地与未来发展。
链接: https://arxiv.org/abs/2605.06714
作者: Yiwen Xu,Tariq M. Khan,Yang Song,Erik Meijering
机构: University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Edge deep learning, a paradigm change reconciling edge computing and deep learning, facilitates real-time decision making attuned to environmental factors through the close integration of computational resources and data sources. Here we provide a comprehensive review of the current state of the art in edge deep learning, focusing on computer vision applications, in particular medical diagnostics. An overview of the foundational principles and technical advantages of edge deep learning is presented, emphasising the capacity of this technology to revolutionise a wide range of domains. Furthermore, we present a novel categorisation of edge hardware platforms based on performance and usage scenarios, facilitating platform selection and operational effectiveness. Following this, we dive into approaches to effectively implement deep neural networks on edge devices, encompassing methods such as lightweight design and model compression. Reviewing practical applications in the fields of computer vision in general and medical diagnostics in particular, we demonstrate the profound impact edge-deployed deep learning models can have in real-life situations. Finally, we provide an analysis of potential future directions and obstacles to the adoption of edge deep learning, with the intention to stimulate further investigations and advancements of intelligent edge deep learning solutions. This survey provides researchers and practitioners with a comprehensive reference shedding light on the critical role deep learning plays in the advancement of edge computing applications.
[CV-168] Visual Text Compression as Measure Transport
【速读】:该论文旨在解决视觉文本压缩(Visual Text Compression, VTC)中一个核心问题:尽管VTC能显著减少解码器标记数(通常为子词分词的3–20倍),但这种压缩效率并不能稳定转化为下游任务性能提升,即压缩比本身无法预测其在具体任务中的有效性。根本原因在于缺乏对任务相关的信息损失进行量化评估。为此,作者提出基于测度传输(measure transport)理论的建模框架,将文本和视觉标记视为经验概率测度,揭示ViT图像块编码器诱导的前向映射所对应的运输成本可分解为两个可估计项:块内聚合引起的精度成本(precision cost)与跨块碎片化带来的覆盖成本(coverage cost)。该框架的关键创新在于引入无需下游标签的探针即可估算上述两项成本,并由此衍生出两个实用机制:一是基于运输成本的无标签路由准则,用于动态选择是否采用视觉路径;二是受运输信息启发的中央凹机制(foveation mechanism),对高成本区域以更高分辨率重新编码。实验表明,在Qwen3-4B模型上,该方法在24个NLP数据集中有17个达到或接近最优oracle性能(70.8%),平均任务得分提升+3.3%,同时token消耗降低-10.3%。
链接: https://arxiv.org/abs/2605.06708
作者: Lv Tang,Tianyi Zheng,Yang Liu,Bo Li,Xingyu Li
机构: University of Alberta; vivo Mobile Communication Co., Ltd; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing 3 – 20\times fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across 24 NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on 17/24 datasets ( 70.8% ), and improves the average task score by +3.3% with -10.3% average tokens relative to a pure-LLM.
[CV-169] A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry CEC ECML KDD2025
【速读】:该论文旨在解决多变量遥测数据中的异常检测问题,尤其针对真实卫星遥测场景下细微异常难以识别的挑战。解决方案的关键在于构建一个分层集成流水线(hierarchical ensemble pipeline),其核心包括:基于形状子序列(shapelet)与统计特征的提取、按通道独立建模(per-channel modeling)、通道内堆叠(intra-channel stacking)以及最终跨通道聚合(cross-channel aggregation)。通过时间序列交叉验证和两级掩码策略防止信息泄露,该方法在欧洲航天局异常检测基准(ESA-ADB)挑战中展现出优异的泛化能力,验证了分层建模在捕捉复杂遥测数据异常模式中的有效性。
链接: https://arxiv.org/abs/2605.06681
作者: Lorenzo Riccardo Allegrini,Geremia Pompei
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)
Abstract:A hierarchical ensemble pipeline is introduced to address anomaly detection in multivariate telemetry data provided by European Space Agency (ESA). The method integrates shapelet-based and statistical feature extraction, per-channel modeling, intra-channel stacking, and a final cross-channel aggregation. The pipeline is trained and validated using time-series cross-validation and two-level masking strategies to prevent information leakage. Results on the European Space Agency Anomaly Detection Benchmark (ESA-ADB) challenge demonstrate strong generalization, highlighting the effectiveness of hierarchical modeling in detecting subtle anomalies in realistic satellite telemetry.
[CV-170] On the Role of Strain and Vorticity in Numerical Integration Error for Flow Matching
【速读】:该论文旨在解决流匹配(Flow Matching)方法在生成数据时因积分步数(Number of Function Evaluations, NFE)受限而导致的推理成本高与精度低的问题。其核心解决方案在于从速度场(velocity field)的雅可比矩阵分解出发,识别出应变(strain,对称部分 S)和涡度(vorticity,反对称部分 Ω)在数值积分误差中的不同作用机制:应变通过对数范数控制误差的指数放大,而涡度仅线性贡献于局部截断误差。基于此理论分析,作者提出一种加权雅可比正则化策略,分别引入应变权重 α 和涡度权重 β,以优化速度场结构。实验表明,该方法显著降低积分误差(2D合成数据中NFE=5时误差下降达2.7倍),并在CIFAR-10上实现轻量微调后FID提升14%(NFE=10),同时保持高NFE下的生成质量。
链接: https://arxiv.org/abs/2605.06680
作者: Chenxi Tao,Seung-Kyum Choi
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
备注: 16 pages, 7 figures. Preliminary version. Includes qualitative CIFAR-10 comparison and supporting synthetic experiments
Abstract:Flow matching generates data by integrating a learned velocity field, where the number of integration steps (NFE) directly determines inference cost. We analyze which properties of the velocity field govern integration error by decomposing the velocity Jacobian into its symmetric part S (strain rate) and antisymmetric part Omega (vorticity). We prove that strain and vorticity play different roles: strain controls exponential error amplification through the logarithmic norm, while vorticity contributes only linearly to the local truncation error. We further show that the optimal transport velocity field is irrotational and has zero material derivative, implying second-order Euler accuracy; for exact displacement interpolation, the associated Lagrangian particle dynamics are integrated exactly by Euler. Motivated by this analysis, we study weighted Jacobian regularization with strain weight alpha and vorticity weight beta. Experiments on 2D synthetic data confirm the main theoretical predictions, showing up to 2.7x lower integration error at NFE=5. Preliminary CIFAR-10 experiments show consistent trends, with a lightweight fine-tuning procedure improving FID by 14 percent at NFE=10 while preserving high-NFE quality.
[CV-171] Mean Mode Screaming: Mean–Variance Split Residuals for 1000-Layer Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在扩展至数百层时出现的结构性脆弱性问题,即网络可能进入一种“静默均值主导崩溃状态”(silent, mean-dominated collapse),导致 token 表示同质化并抑制中心变异。通过机制审计,作者识别出崩溃触发事件为“均值模式尖叫”(Mean Mode Screaming, MMS),其表现为训练看似稳定时仍会发生均值一致的反向传播冲击,激活深层残差分支并驱动模型进入均值主导状态。该现象源于梯度的精确分解为均值一致与中心分量,并因 Softmax 雅可比矩阵零空间对注意力 logits 梯度的结构抑制而加剧。解决方案的关键在于提出“均值-方差分离残差”(Mean-Variance Split, MV-Split Residuals),将中心残差更新与漏斗式均值替换结合,在400层单流DiT中有效防止发散崩溃,同时在完整训练周期内显著优于如 LayerScale 等基于 token 同质门控的方法;进一步在1000层DiT上验证了该架构在极端深度下的稳定可训练性。
链接: https://arxiv.org/abs/2605.06169
作者: Pengqi Lu
机构: Beijing, China
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages (9-page main paper + appendix)
Abstract:Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline’s pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth. Comments: 43 pages (9-page main paper + appendix) Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.06169 [cs.LG] (or arXiv:2605.06169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06169 Focus to learn more arXiv-issued DOI via DataCite
[CV-172] CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)分布式训练中因张量并行(Tensor Parallelism)和数据并行(Data Parallelism)导致的通信瓶颈问题,特别是现有基于数据切片的通信-计算重叠方法中存在的尾延迟(Tail Latency)问题。解决方案的关键在于提出一种名为CommFuse的新方法,该方法通过将传统的集合操作(如reduce-scatter和all-gather)分解为细粒度的点对点(Peer-to-Peer, P2P)通信,并重新调度分片计算任务,从而实现更高效的通信与计算重叠,彻底消除尾延迟,同时兼容多种并行策略(如TPSP和UP),显著提升模型FLOPS利用率(MFU)和吞吐量。
链接: https://arxiv.org/abs/2604.24013
作者: Rezaul Karim,Austin Wen,Wang Zongzuo,Weiwei Zhang,Yang Liu,Walid Ahmed
机构: Huawei(华为)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Slightly modified the title, and corresponding minor wording change in the content
Abstract:The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed CommFuse that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
[CV-173] Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在光学字符识别(Optical Character Recognition, OCR)任务中缺乏有效无监督质量控制的问题,尤其是难以检测样本级错误。其核心挑战在于现有方法无法在不依赖标注数据或额外训练的情况下可靠评估OCR输出的可靠性。解决方案的关键是提出Consensus Entropy (CE),一种无需训练、模型无关的指标,通过测量多个模型预测结果在输出空间中的共识熵来估计预测可靠性——正确预测倾向于收敛,而错误预测则趋于分散。基于此,作者进一步构建了CE-OCR框架,利用多模型集成一致性进行输出验证与选择,并通过自适应路由机制提升效率,显著优于自一致性等基线方法且无需额外计算成本。
链接: https://arxiv.org/abs/2504.11101
作者: Yulong Zhang,Tianyi Liang,Xinyue Huang,Erfei Cui,Guoqing Wang,Xu Guo,Chenhui Li,Gongshen Liu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration. Code: this https URL.
[CV-174] Uncertainty Quantification for Cardiac Shape Reconstruction with Deep Signed Distance Functions via MCMC methods
【速读】:该论文旨在解决基于图谱(Atlas-based)的心脏形态重建方法在处理稀疏或噪声数据时,因过度依赖先验信息而导致不确定性影响较大、临床可靠性受限的问题。其解决方案的关键在于提出一种概率框架,将深度符号距离函数(Deep Signed Distance Functions, DeepSDFs)与马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)采样相结合,通过将重建损失解释为对数似然,在潜在空间中进行贝叶斯推断,从而获得最大后验估计(MAP)和后验采样重构结果,实现高精度心脏几何重建及校准良好的不确定性估计。
链接: https://arxiv.org/abs/2605.07987
作者: Jan Verhülsdonk,Thomas Grandits,Francisco Sahli Costabal,Thomas Beiert,Simone Pezzuto,Alexander Effland
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Atlas-based approaches allow high-quality, patient-specific shape reconstructions of cardiac anatomy from sparse and/or noisy data such as point clouds. However, these methods are mainly prior-driven, so the impact of uncertainty can be large, limiting their clinical reliability. We propose a probabilistic framework for uncertainty-aware cardiac shape reconstruction that combines Deep Signed Distance Functions (DeepSDFs) with Markov Chain Monte Carlo (MCMC) sampling. Cardiac geometries are modeled implicitly as zero-level sets of a neural network conditioned on learned latent codes, enabling multi-surface reconstruction of the left and right ventricles. By interpreting the reconstruction loss as a log-likelihood, we perform Bayesian inference in the latent space to obtain both maximum a posteriori (MAP) and posterior-sampled reconstructions. Experiments on a public cardiac dataset show that our approach produces accurate reconstructions and well-calibrated uncertainty estimates.
[CV-175] Consistency Regularised Gradient Flows for Inverse Problems
【速读】:该论文旨在解决基于视觉-语言潜在扩散模型(Vision-Language Latent Diffusion Models, LDMs)的逆问题求解中存在计算成本高和重建质量下降的问题。现有方法通常需要大量神经函数评估(Neural Function Evaluations, NFEs)以及对大型预训练组件进行反向传播,导致效率低下。其解决方案的关键在于提出一个统一的欧几里得–Wasserstein-2梯度流框架,通过单一流在潜在空间中联合执行后验采样与提示优化,使先验分布与观测数据对齐;结合少步数的潜在文本到图像模型,该方法可在无需对自编码器进行反向传播的情况下实现低NFE推理,从而显著降低计算开销并提升重建性能。
链接: https://arxiv.org/abs/2605.07907
作者: Alessio Spagnoletti,Tim Y. J. Wang,Marcelo Pereyra,O. Deniz Akyildiz
机构: 未知
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
[CV-176] Pre-training Enables Extraordinary All-optical Image Denoising
【速读】:该论文旨在解决光学神经网络(Optical Neural Networks, ONNs)在训练方法上研究不足导致的性能不佳问题,尤其是在图像去噪任务中表现受限。其解决方案的关键在于提出一种基于预训练(pre-training)驱动的迁移学习策略:首先利用包含345万张多样化但结构简单的图像数据集对衍射网络(diffractive network)进行大规模预训练,随后使用特定任务的数据集进行微调(fine-tuning)。这种两阶段优化方法显著提升了在严重噪声(峰值信噪比PSNR低于8 dB)条件下的图像去噪质量,使PSNR提升至18 dB以上,并能有效保留图像细节特征。此外,该预训练模型具备良好的泛化能力,可适用于多种图像风格(如MNIST、ChestMNIST、CIFAR-10和CelebA),并已在视觉应用(如人脸检测、车牌识别和无人机定位)中验证了其有效性。
链接: https://arxiv.org/abs/2605.07810
作者: Xudong Lv,Yuxiang Sun,Shuo Wang,Nanxing Chen,Jun Guan,Jingtian Hu
机构: Hangzhou Dianzi University (杭州电子科技大学); The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical neural networks are emerging as powerful machine learning and information processing tools because of their potential advantages in speed and energy efficiency. The training methods of these physical models, however, remain underexplored compared to their digital counterparts and are leading to suboptimal performance. This paper reports a pre-training-driven approach that leads to snapshot image denoising with substantially improved quality. We demonstrated effective free-space optical denoising by a diffractive network optimized by a two-step process including (1) pre-training using a massive dataset of 3.45 million diverse but simple images and (2) fine-tuning with the corresponding task-specific datasets. Compared to conventional Fourier-domain filtering and directly trained diffractive networks, such a transfer learning process exhibited prominent advantages for denoising images degraded by severe noise, peak signal-to-noise ratio (PSNR) below 8 dB, while preserving fine image features and improving the PSNR to above 18 dB. Importantly, the same pre-trained optical network could be consistently fine-tuned to process degraded images from highly diverse styles ranging from handwritten digits (MNIST) and chest X-rays (ChestMNIST) to CIFAR-10 images and human faces (CelebA). We further demonstrated the critical role of our optical denoisers in vision-based applications, including face detection, plate recognition, and localization of UAVs in noisy conditions.
[CV-177] ask-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
【速读】:该论文旨在解决边缘计算场景下人类动作理解任务中因视频数据传输导致的带宽消耗大、延迟高及隐私泄露等问题。传统方法需将大量视频数据从资源受限的边缘设备上传至云端,造成显著的网络负担和隐私风险。其解决方案的关键在于提出一种面向任务的通信框架(Task-Oriented Communication Framework for Human Action Understanding, TOAU),通过边缘端的单目姿态估计器提取连续关节坐标,并利用向量量化变分自编码器(Vector Quantized Variational Autoencoder, VQ-VAE)将其转换为离散的动作标记(motion tokens),仅传输紧凑的码本索引序列(每帧仅9比特),从而大幅降低传输负载;云端则使用轻量级投影模块将这些标记对齐至大规模视觉-语言模型(Vision-Language Model, VLM)的嵌入空间,结合高效指令微调(instruction tuning)策略实现复杂动作的理解,最终在保持准确率的同时将传输负载降至约1%,系统延迟降至约20%。
链接: https://arxiv.org/abs/2605.07354
作者: Jingyi Liu,Cheng Yuan,Lijun He,Jun Zhang,Jiawei Shao
机构: Institute of Artificial Intelligence (TeleAI) of China Telecom (中国电信人工智能研究院); Xi’an Jiaotong University (西安交通大学); Hong Kong University of Science and Technology (香港科技大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1% and the system latency to around 20% compared to video codec-based solutions, while delivering comparable action understanding accuracy.
[CV-178] Fine-tuning a vision-language model for fracture-surface morphology recognition
【速读】:该论文旨在解决通用视觉语言模型(Vision-Language Models, VLMs)在材料科学领域中对断裂表面图像(fracture-surface images)进行可靠表征时缺乏特定领域视觉知识的问题。解决方案的关键在于:首先,构建了一个包含13,168张文献挖掘所得断裂表面图像的高质量数据集,并通过GPT-5.2-Reasoning(高)模型自动标注形态学特征;其次,结合人工补充稀有特征样本和基于图像旋转的数据增强策略,显著提升了模型对少见断裂形貌的识别能力;最终,通过对开源VLM(Qwen3-VL-32B-Instruct)进行微调,得到的专用模型在100张人工标注图像基准测试中精度达0.92,远超基线模型(如原始Qwen3-VL为0.35、GPT-5.5-Reasoning为0.58、Gemini 3.1 Pro为0.78),验证了目标导向的数据收集与细粒度微调对提升专业场景下视觉理解性能的有效性。
链接: https://arxiv.org/abs/2605.07145
作者: Quanliang Liu,Jungtaek Kim,Kangwook Lee,Hyunseok Oh
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.
[CV-179] Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention
【速读】:该论文旨在解决多模态医学数据(即体积磁共振成像MRI与结构化临床表格式数据)联合生成的问题,以支持数字孪生在医疗领域的应用。其核心挑战在于如何在统一的潜在空间中实现两种异构模态的协同建模与高质量合成。解决方案的关键在于提出一种基于交叉注意力机制的多模态潜在扩散模型(multimodal latent diffusion model),通过变分自编码器(VAE)将MRI和表格式数据融合至共享潜在空间,并利用独立解码器分别重建各模态数据,从而实现对MRI体积图像和临床特征的协同生成与一致性保持。实验表明,该方法能够生成具有解剖合理性且与合成表征一致的MRI数据,在定量指标上优于现有基线模型,首次验证了在单一扩散框架下联合建模MRI与混合类型表格式数据的可行性。
链接: https://arxiv.org/abs/2605.06699
作者: Daniel Mensing,Jan Kapar,Jochen G. Hirsch,Matthias Günther,Horst Hahn,Marvin N. Wright
机构: Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany; Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany; Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany; Faculty of Physics and Electrical Engineering, University of Bremen, Bremen, Germany
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fréchet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.
人工智能
[AI-0] VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection ACL2026
【速读】:该论文旨在解决加权多数投票(如Confidence-Informed Self Consistency, CISC)在推理时因需对每个候选答案的推理轨迹调用一个批评者大语言模型(critic LLM)来生成置信度评分,而导致计算开销和成本显著增加的问题。解决方案的关键在于提出VecCISC框架,其通过引入语义相似性度量机制,自动过滤掉语义等价、退化或幻觉的推理轨迹,从而大幅减少需由critic LLM评估的候选答案数量,实现更低的总token消耗(降低47%),同时保持或超越CISC的准确性。
链接: https://arxiv.org/abs/2605.08070
作者: James Petullo,Sonny George,Dylan Cashman,Nianwen Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026
Abstract:A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate’s reasoning trace to produce the answer’s confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.
[AI-1] Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)中奖励信号过于单一或模糊的问题,即如何在复杂任务中实现更精细、可解释且具备泛化能力的模型优化。其核心挑战在于:若仅依赖整体评分(如二元反馈或单一数值),难以指导模型在多维度任务中持续改进,并可能限制其在未见场景下的推理迁移能力。解决方案的关键在于提出一种基于评分标准的强化学习框架(rubric-grounded reinforcement learning),通过将奖励分解为加权的、可验证的任务特定标准(criteria),并利用一个冻结的大语言模型(LLM)裁判对每个标准进行独立打分,从而提供部分得分式的优化信号。该方法不仅提升了模型在结构化评估中的表现,还促使模型在未训练过的推理基准上展现出更强的迁移能力,证明了文档驱动的结构化奖励机制对提升生成式 AI(Generative AI)性能的有效性。
链接: https://arxiv.org/abs/2605.08061
作者: Manish Bhattarai,Ismael Boureima,Nishath Rajiv Ranasinghe,Scott Pakin,Dan O’Malley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emphrubric-grounded reinforcement learning (RL): a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves 71.7% normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus – GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.
[AI-2] Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
【速读】:该论文旨在解决传统直接偏好优化(Direct Preference Optimization, DPO)在处理多轮次输出(multiple rollouts per prompt)时的局限性问题,即DPO将复杂偏好结构简化为独立成对比较,导致无法利用排序中的传递性关系(transitivity),并引入冗余或冲突的监督信号,从而引发训练不稳定。其解决方案的关键在于提出图结构直接偏好优化(Graph Direct Preference Optimization, GraphDPO),通过构建由滚动生成结果排序诱导的有向无环偏好图(directed acyclic preference graph),将响应间的支配关系编码为边,并设计一种受Plackett–Luce模型启发的图结构目标函数,聚合邻域内的监督信号以强制传递性约束;同时引入等价类构造机制,在相同偏好响应间形成图层并设内层边损失为零,避免稀疏信号带来的虚假梯度;该方法在保持每提示线性复杂度的同时,有效整合了完整偏好图结构信息,且支持通过插入真实解作为主导节点进行可选的真值锚定(ground-truth anchoring),显著提升了训练稳定性与性能表现。
链接: https://arxiv.org/abs/2605.08037
作者: Ning Liu,Chuanneng Sun,Kristina Klinkner,Shervin Malmasi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett–Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
[AI-3] MPD2-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis
【速读】:该论文旨在解决青光眼筛查中因标准学习性拒判(Learning-to-defer, L2D)方法忽略专家可用性、阅片者行为异质性、工作负荷失衡、诊断伤害不对称性、病例形态学难度差异及部署域偏移等问题而导致的临床安全性与效率不足。其核心解决方案是提出MPD²-Router——一种掩码感知的多专家拒判框架,通过将眼科分诊重构为受约束的人机路由问题(决定是否拒判及分配至哪位可用专家),引入双头拒判/分配策略与掩码感知的Gumbel-Sigmoid门控机制以严格保障样本级专家可用性,并融合不确定性、形态学特征、图像质量及分布外(Out-of-Distribution, OOD)信号进行综合决策。训练阶段采用非对称代价敏感目标函数、增强拉格朗日拒判预算约束、群体特定分布先验以及秩主导化Jensen-Shannon正则项,有效防止专家坍塌且无需强制均匀分配,最终在三个跨国青光眼队列(REFUGE、CHAKSU、ORIGA)上实现临床成本显著降低、MCC指标提升且专家利用均衡,同时具备跨域鲁棒性和帕累托最优的F1-MCC-成本权衡。
链接: https://arxiv.org/abs/2605.08024
作者: Wenxin Zhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD ^2 -Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human–AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel–sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD ^2 -Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1–MCC–cost, robust under cross-domain shift, and yields balanced expert utilization.
[AI-4] Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在训练过程中因脉冲函数不可微而导致的梯度近似误差累积问题,该问题通常依赖代理梯度(surrogate gradients)方法来缓解,但会引入精度损失。解决方案的关键在于将平行前馈阈值网络的凸化理论扩展至平行递归阈值网络,从而为SNN提供一个更严格的数学框架;在此基础上,提出了一种参数重构算法(parameter reconstruction algorithm),该算法在多种任务中展现出显著且一致的性能优势,不仅可独立使用,还能与代理梯度训练有效结合,同时具备良好的数据可扩展性和对模型配置的鲁棒性,为大规模SNN训练提供了可行路径。
链接: https://arxiv.org/abs/2605.08022
作者: Himanshu Udupi,Xiaocong Yang,ChengXiang Zhai
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spiking Neural Networks (SNNs) have been proposed as biologically plausible and energy-efficient alternatives to conventional Artificial Neural Networks (ANNs). However, the training of SNN usually relies on surrogate gradients due to the non-differentiability of the spike function, introducing approximation errors that accumulate across layers. To address this challenge, we extend the work on convexification of parallel feedforward threshold networks to parallel recurrent threshold networks, which subsume parallel SNNs as a structured special case. Building on this theoretical framework, we propose a parameter reconstruction algorithm for SNN training that demonstrates consistent and significant advantages across various tasks, both as a standalone method and in combination with surrogate-gradient training. The ablations further demonstrate the data scalability and robustness to model configurations of our training algorithm, pointing toward its potential in large-scale SNN training.
[AI-5] Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
【速读】:该论文旨在解决现代人工智能系统是否能够像人类一样在复杂、新颖环境中快速学习抽象知识,并灵活运用这些知识进行高效决策的问题。其核心解决方案在于利用包含并发fMRI记录的复杂人类游戏行为数据集,联合评估多种前沿大型推理模型(Large Reasoning Models, LRMs)与基于模型的深度强化学习代理及贝叶斯理论代理,在游戏表现、人类学习行为拟合度以及大脑活动预测能力上的差异。关键发现是:前沿LRMs不仅最贴近人类在游戏探索中的行为模式,而且在预测大脑皮层和皮层下区域活动方面比强化学习方法高出一个数量级,且该脑对齐效应源于模型对游戏状态的上下文表征而非后续规划或推理过程,从而确立了LRMs作为自然环境中人类学习与决策的有力计算模型。
链接: https://arxiv.org/abs/2605.08019
作者: Botos Csaba,Sreejan Kumar,Austin Tudor David Andrews,Laurence Hunt,Chris Summerfield,Joshua B. Tenenbaum,Rui Ponte Costa,Marcelo G. Mattar,Momchil Tomov
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model’s in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: this https URL
[AI-6] Learning CLI Agents with Structured Action Credit under Selective Observation
【速读】:该论文旨在解决命令行接口(Command Line Interface, CLI)代理在复杂代码库环境中进行有效交互时面临的两大瓶颈问题:一是代理需从部分观测中识别任务相关的代码证据,二是稀疏的终端奖励难以分配给长多轮轨迹中的具体动作以实现有效的信用分配。针对这两个挑战,论文提出两个关键解决方案:其一为σ-Reveal机制,该机制在推理阶段选择有限令牌预算内的上下文信息以提升对CLI命令的针对性理解;其二为Action Advantage Assignment (A³),一种基于原生智能体强化学习(agentic RL)的方法,通过构建基于相对反馈的回合级优势、基于抽象语法树(Abstract Syntax Tree, AST)的动作子链残差以及树级轨迹边际来实现更精细的动作信用分配,从而保留标准智能体强化学习的算法复杂度并提高学习效率。
链接: https://arxiv.org/abs/2605.08013
作者: Haoyang Su,Ying Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce \sigma -Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ( \mathrmA^3 ), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. \mathrmA^3 constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.
[AI-7] Abductive Reasoning with Probabilistic Commonsense
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在神经符号框架中进行推理时,因缺乏常识性世界知识而导致无法完成人类认为理所当然的推理步骤的问题。现有方法依赖LLM提供缺失的常识假设,但隐含地假设所有个体对常识有统一认知,这与现实不符。解决方案的关键在于提出一种概率性归纳常识推理框架——Probabilistic Abductive CommonSense (PACS),其通过LLM与形式化求解器协同采样个体差异化的推理路径(即证明),并将这些路径聚合以判断多数人是否会认为某一陈述为真或假,从而显式建模常识信念的个体变异性。
链接: https://arxiv.org/abs/2605.08011
作者: Joseph Cotnareanu,Chiara Roverato,Han Zhou,Didier Chetelat,Yingxue Zhang,Mark Coates
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:
Abstract:Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals’ distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.
[AI-8] Graph-Structured Hyperdimensional Computing for Data-Efficient and Explainable Process-Structure-Property Prediction
【速读】:该论文旨在解决多光子光还原(Multiphoton photoreduction)过程中工艺-结构-性能(Process-Structure-Property, PSP)预测难题,其核心挑战在于数据稀疏、异质性强且变量间存在复杂交互作用,导致传统特征向量模型易产生虚假相关性、泛化能力差,而基于机理的流程又依赖于难以获取的校准子模型。解决方案的关键在于提出一种图结构超维计算框架(PSP-HDC),通过将PSP图作为内部先验编码到高维空间中,利用可训练的标量到超向量编码器学习参数特异性嵌入以适应不同尺度和噪声,并基于图对齐的绑定与捆绑操作组合样本表示,最终通过关联记忆检索实现预测与解释一体化——该方法在片电阻区域预测任务中达到0.910±0.077的准确率(1000次随机划分)及0.896的工艺折叠泛化准确率,显著优于强基线模型。
链接: https://arxiv.org/abs/2605.07999
作者: Jingzhan Ge,Ajeeth Vellore,Ajinkya Palwe,Ahsan Khan,David Gorsich,Matthew P. Castanier,SeungYeon Kang,Farhad Imani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 18 figures
Abstract:Multiphoton photoreduction enables high-fidelity fabrication of complex 3D microstructures, yet reliable process-structure-property (PSP) prediction remains difficult because the available data are sparse, heterogeneous, and interaction-dominated. In this regime, conventional feature-vector models are statistically underdetermined, making them prone to spurious correlations, poor regime transfer, and unstable post hoc explanations, whereas mechanistic pipelines depend on calibrated submodels that are rarely available during early process development. We present PSP-HDC, a graph-structured hyperdimensional computing framework that encodes a directed PSP graph as an internal prior for representation, inference, and explanation. A trainable scalar-to-hypervector encoder learns parameter-specific embeddings on a fixed hyperdimensional basis to accommodate heterogeneous scales and noise. Sample representations are then composed through graph-aligned binding and bundling along directed PSP dependencies, and prediction is performed by associative-memory retrieval against class prototypes. Because the same prototype memories support both decision making and attribution, PSP-HDC provides intrinsic explanations at the parameter, group, and within-group levels, while memory alignment and separation quantify prototype formation during training. On sheet-resistance regime prediction for the 3D platform, PSP-HDC achieves an accuracy of 0.910 +/- 0.077 over 1000 random splits and 0.896 under process-fold generalization, outperforming strong baselines.
[AI-9] Dooly: Configuration-Agnostic Redundancy-Aware Profiling for LLM Inference Simulation
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理配置优化中因硬件、服务引擎、注意力后端和模型架构多样性导致的性能评估难题,现有基于配置的模拟器需对每个新配置重新进行全量操作 profiling,效率极低。其核心问题在于缺乏对输入维度结构的系统性理解:许多模型配置参数(如头大小、层数)在不同模型间重复出现,而请求相关维度可统一扫描覆盖多个配置。解决方案的关键在于提出 Dooly,它通过一次推理遍历,利用污点传播(taint propagation)标记每个输入维度的来源,并仅对未在延迟数据库中记录的操作进行选择性 profiling;同时借助服务引擎自身的初始化代码隔离状态操作(如注意力机制),避免手动注入代码。最终构建基于数据库的延迟回归模型,作为现有模拟器的即插即用后端,在两个 GPU 平台上显著提升模拟精度(TTFT 误差 <5% MAPE,TPOT 误差 <8%),并减少 56.4% 的 GPU 配置耗时。
链接: https://arxiv.org/abs/2605.07985
作者: Joon Ha Kim,Geon-Woo Kim,Anoop Rachakonda,Daehyeok Kim
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine’s own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.
[AI-10] Wheres the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
【速读】:该论文旨在解决语言模型中规划性遗址形成(planning site formation)的问题,即探究在前向传播过程中,结构约束下的未来token内部表征是否形成及其是否对生成过程具有因果驱动作用。研究以押韵对句补全任务为纯净实验场景,采用线性探测(linear probing)和激活修补(activation patching)两种轻量级方法,在Qwen3、Gemma-3与Llama-3三个系列模型的十余个规模层级上进行验证。关键发现是:尽管所有模型在行边界处均能线性解码未来押韵信息(信号随模型规模增强),但仅Gemma-3-27B表现出对该信息的因果依赖,其因果驱动从押韵词迁移至行边界约第30层;其余模型始终依赖押韵词,行边界处因果效应趋近于零。通过两阶段路径修补定位到五个注意力头,可恢复约90%的押韵路由能力,揭示了不同架构下规划机制的异质性及关键神经通路。
链接: https://arxiv.org/abs/2605.07984
作者: Nicole Ma,Nick Rui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 20 figures, 3 tables
Abstract:We study planning site formation in language models – where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.
[AI-11] he Limits of AI-Driven Allocation: Optimal Screening under Aleatoric Uncertainty
【速读】:该论文旨在解决在资源有限的政策与人道主义场景中,如何最优地结合筛查(screening)与基于预测风险评分的算法靶向分配(algorithmic targeting),以提升资源分配效率的问题。其核心挑战在于:即使能够获取真实的条件脆弱性概率,由于个体层面的随机不确定性(aleatoric uncertainty)不可消除,仅依赖算法靶向仍会导致部分资源错配。论文提出一个两阶段分配框架——第一阶段对部分单位进行物理验证以获得真实结果,第二阶段在固定覆盖预算下完成最终资源分配——并证明最优策略是仅对算法分配临界点附近的单位进行筛查,同时直接靶向最高风险群体。关键发现在于:当群体中随机不确定性较高时,筛查与算法靶向呈现互补关系,能带来显著效率提升;反之则可能为替代关系。实证应用涵盖哥伦比亚的收入型社会保护计划与人道主义排雷项目,验证了该框架在权衡筛查成本与分配效率方面的实际价值。
链接: https://arxiv.org/abs/2605.07979
作者: Santiago Cortes-Gomez,Mateo Dulce Rubio,Carlos Patino,Bryan Wilder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of machine learning has shifted targeted resource allocation in policy and humanitarian settings toward algorithmic targeting based on predicted risk scores. This approach is typically cheaper and faster than traditional screening procedures that directly observe the latent vulnerability status through physical verification. Yet, even access to the true conditional vulnerability probability cannot eliminate misallocation: aleatoric uncertainty over individual vulnerability status is irreducible, and probabilistic targeting inevitably misallocates some resources. In this work we study how screening and algorithmic targeting should be optimally combined in a two-stage allocation framework where a screening stage observes true outcomes for a subset of units before a final allocation stage assigns the resource under a fixed coverage budget. We show that the optimal strategy screens units at the margin of algorithmic allocation, while directly targeting the highest-risk units. Furthermore, we empirically characterize when screening and algorithmic targeting act as complements or substitutes: efficiency gains from screening grow as the aleatoric uncertainty in the population increases. We illustrate our framework with applications in income-based social protection programs and humanitarian demining in Colombia, where the tension between screening costs and allocation efficiency is operationally consequential.
[AI-12] It Just Takes Two: Scaling Amortized Inference to Large Sets
【速读】:该论文旨在解决神经后验估计(neural posterior estimation)在处理集合型条件变量时的计算效率问题,尤其是在部署阶段集合大小 N 较大时,传统方法需在 N 的规模上训练模型,导致内存和计算资源迅速成为瓶颈。其核心解决方案是将表示学习与后验建模解耦:首先在一个最多包含两个元素的集合上训练一个均值池化(mean-pool)的 Deep Set 编码器,从而获得可泛化到任意集合大小的特征表示;随后仅对预聚合的嵌入向量微调推理头(inference head),使得训练成本几乎与部署集合大小 N 无关,显著降低计算开销,同时在多种基准任务中达到或超越现有基线性能。
链接: https://arxiv.org/abs/2605.07972
作者: Antoine Wehenkel,Michael Kagan,Lukas Heinrich,Chris Pollard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size – a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.
[AI-13] Exploring the non-convexity in machine learning using quantum-inspired optimization
【速读】:该论文旨在解决现代机器学习中高维非凸优化问题的挑战,尤其是在存在严重异常值干扰的情况下,传统方法(如凸松弛或局部搜索启发式算法)容易陷入次优局部极小值,难以恢复真实的离散结构。其解决方案的关键在于将非凸优化视为全局搜索问题,并提出基于量子启发式进化优化(Quantum-Inspired Evolutionary Optimization, QIEO)的统一框架;该框架通过受量子叠加态启发的概率表示机制,在搜索空间中保持全局视角,从而能够穿越传统梯度法和贪心算法所困的局部极小值陷阱,实现更优的结构保真度与鲁棒性。
链接: https://arxiv.org/abs/2605.07947
作者: Kandula Eswara Sai Kumar,Parth Dhananjay Danve,Abhishek Chopra,Rut Lineswala
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:The escalating complexity of modern machine learning necessitates solving challenging non-convex optimization problems, particularly in high-dimensional regimes and scenarios contaminated by gross outliers. Traditional approaches, relying on convex relaxations or specialized local search heuristics, frequently succumb to suboptimal local minima and fail to recover the true underlying discrete structures. In this paper, we propose treating these non-convex challenges as a global search problem and introduce a unified framework based on Quantum-Inspired Evolutionary Optimization (QIEO). By leveraging a probabilistic representation inspired by quantum superposition, QIEO maintains a global view of the search space, enabling it to tunnel through local optima that trap conventional gradient-based and greedy solvers. We comprehensively evaluate QIEO across diverse non-convex applications, including sparse signal recovery (gene expression analysis and compressed sensing) and robust linear regression. Extensive benchmarking against state-of-the-art continuous solvers (ADAM, Differential Evolution), classical metaheuristics (Genetic Algorithms), and specialized non-convex algorithms (Iterative Hard Thresholding) demonstrates that QIEO consistently achieves superior structural fidelity, lower mean squared error, and enhanced robustness without support inflation. Our findings suggest that embracing a quantum-inspired global search provides a resilient, unified paradigm for overcoming the inherent intractability of discrete nonconvex machine learning landscapes.
[AI-14] INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy ICLR-26
【速读】:该论文旨在解决个体化差分隐私(Individualized Differential Privacy, IDP)框架下存在的效用失衡问题:在满足不同数据所有者个性化隐私需求时,对隐私要求更强的数据子集在训练模型中可能被严重欠采样,导致模型在部署阶段对这类敏感数据的预测性能显著下降。解决方案的关键在于提出INOSGD算法,该算法通过在每轮迭代中对批次内数据进行策略性降权(strategically down-weighting),以提升对高隐私保护要求数据的学习效果,同时确保整体机制仍严格满足IDP约束——这一特性使得现有缓解效用失衡的方法难以直接适配或实现。
链接: https://arxiv.org/abs/2605.07930
作者: Xiao Tian,Jue Fan,Rachael Hwee Ling Sim,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 14th International Conference on Learning Representations (ICLR-26)
Abstract:Differential privacy (DP) is widely employed in machine learning to protect confidential or sensitive training data from being revealed. As data owners gain greater control over their data due to personal data ownership, they are more likely to set their own privacy requirements, necessitating individualized DP (IDP) to fulfil such requests. In particular, owners of data from more sensitive subsets, such as positive cases of stigmatized diseases, likely set stronger privacy requirements, as leakage of such data could incur more serious societal impact. However, existing IDP algorithms induce a critical utility imbalance problem: Data from owners with stronger privacy requirements may be severely underrepresented in the trained model, resulting in poorer performance on similar data from subsequent users during deployment. In this paper, we analyze this problem and propose the INO-SGD algorithm, which strategically down-weights data within each batch to improve performance on the more private data across all iterations. Notably, our algorithm is specially designed to satisfy IDP, while existing techniques addressing utility imbalance neither satisfy IDP nor can be easily adapted to do so. Lastly, we demonstrate the empirical feasibility of our approach.
[AI-15] Agent EscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能体在面对长程依赖约束下的工具调用与推理能力不足的问题,尤其关注其在非熟悉工作流和远距离状态关联场景中的表现。解决方案的关键在于提出一个类逃脱房间(escape-room-style)的基准测试集 AgentEscapeBench,该基准通过定义工具与物品之间的有向无环依赖图(directed acyclic dependency graph),强制智能体执行真实外部函数、持续追踪隐含状态、传递中间结果并提交可确定性验证的答案,从而系统性评估其长期推理与适应能力。实验表明,随着依赖深度增加,无论是人类还是模型性能均显著下降,且模型失败主要源于长程状态跟踪、线索遵循及中间结果传播的失效,凸显了现有智能体在深层上下文建模方面的局限性。
链接: https://arxiv.org/abs/2605.07926
作者: Zhengkang Guo,Yiyang Li,Lin Qiu,Xiaohua Wang,Jingwen Xv,Dongyu Ru,Xiaoyu Li,Xiaoqing Zheng,Xuezhi Cao,Xunliang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.
[AI-16] BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing
【速读】:该论文旨在解决无监督状态下从非发声生物信号(如蜜蜂嗡鸣)中发现结构化特征的问题,传统生物声学方法通常依赖发声模型假设或预定义语义单元,难以适用于非发声物种。其解决方案的关键在于提出BeeVe框架,该框架采用冻结的自监督Patchout Spectrogram Transformer(PaSST)作为特征提取器,随后在无标签的嵌入空间上训练向量量化变分自编码器(VQ-VAE),直接从未标注蜂巢音频中学习有限离散的声学标记(acoustic tokens)代码本。整个过程无需标签、预训练任务或对比目标,最终成功识别出与蜂后状态相关的稳定声学子状态,并验证了跨实验和未见录音的结构一致性与泛化能力,证明了无监督离散码本学习可有效揭示非发声生物信号中的重复性声学结构。
链接: https://arxiv.org/abs/2605.07903
作者: Hamze Hammami,Nidhal Abdulaziz
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.
[AI-17] What if AI systems werent chatbots?
【速读】:该论文试图解决的问题是:当前人工智能(AI)发展过度聚焦于对话式聊天机器人(chatbot)界面,这种趋势并非中立的技术选择,而是一种主导性的社会技术配置,其广泛采用正在重塑社会、经济、法律和环境系统,并带来结构性负面影响,如用户需求满足不足、技能退化、知识同质化、劳动替代、权力集中及环境成本上升。解决方案的关键在于摆脱“一刀切”的聊天机器人范式,转向多元化的系统设计(pluralistic system design),开发任务特定工具(task-specific tools),并建立制度性保障机制(institutional safeguards),以增强AI的领域特异性、问责制与长期社会可持续性。
链接: https://arxiv.org/abs/2605.07896
作者: Sourojit Ghosh,Pranav Narayanan Venkit,Sanjana Gautam,Avijit Ghosh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at The 2026 ACM Conference on Fairness, Accountability, and Transparency, June 25–28, 2026, Montreal, QC, Canada
Abstract:The rapid convergence of artificial intelligence (AI) toward conversational chatbot interfaces marks a critical moment for the industry. This paper argues that the chatbot paradigm is not a neutral interface choice, but a dominant sociotechnical configuration whose widespread adoption reshapes social, economic, legal, and environmental systems. We examine how treating AI primarily as conversational assistants has extensive structural downsides. We show how chatbot-based systems often fail to adequately meet user needs, particularly in complex or high-stakes contexts, while projecting confidence and authority. We further analyze how the normalization of chatbot-mediated interaction alters patterns of work, learning, and decision-making, contributing to deskilling, homogenization of knowledge, and shifting expectations of expertise. Finally, we examine broader societal effects, including labor displacement, concentration of economic power, and increased environmental costs driven by sustained investment in large-scale chatbot infrastructures. While acknowledging legitimate benefits, we argue that the current trajectory of AI development reflects specific value choices that prioritize conversational generality over domain specificity, accountability, and long-term social sustainability. We conclude by outlining alternative directions for AI development and governance that move beyond one-size-fits-all chatbots, emphasizing pluralistic system design, task-specific tools, and institutional safeguards to mitigate social and economic harm.
[AI-18] On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)框架下,如何有效利用生成式模型(如变分自编码器 VAE、生成对抗网络 GAN 和扩散模型 DM)进行时间序列异常检测,以支持关键工业基础设施中的预测性维护(Predictive Maintenance, PdM)。其核心挑战在于平衡模型性能、通信开销与跨设备数据异质性(non-IID)之间的权衡。解决方案的关键在于提出一种新颖的联邦生成模型分类法,将部分组件共享(partial component sharing)形式化为一种有原则的个性化机制;实验表明,在带宽受限和非独立同分布(non-IID)环境下,针对不同生成模型采用差异化共享策略(如扩散模型中仅共享解码器)可显著提升模型稳定性与可扩展性,优于传统的全联邦训练方式。
链接: https://arxiv.org/abs/2605.07860
作者: Usevalad Milasheuski,Piero Baraldi,Enrico Zio,Stefano Savazzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) has emerged as a promising paradigm for preserving client data ownership and control over distributed Internet of Things (IoT) environments. While discriminative models dominate most FL use cases, recent advances in generative models – such as Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models (DM) – offer new opportunities for unsupervised anomaly detection in time series analysis, with relevant applications in predictive maintenance (PdM) in critical industrial infrastructures. In this work, we present a comprehensive analysis of VAEs, GANs, and DMs in the context of federated PdM. We analyze their performance and communication overhead under both full and partial federation setups, where only subsets of model components are shared. Building on this analysis, the paper proposes a novel taxonomy for federated generative models that formalizes partial component sharing as a principled mechanism for model personalization. Our experiments over a real-world time series dataset reveal distinct trade-offs in model utility, stability, and scalability, especially in heterogeneous and bandwidth-constrained FL settings. For the evaluated GAN-based configurations, full federation improves training stability relative to independent local training, although the model remains less robust than the VAE- and DDPM-based alternatives. For DMs, however, partial federation – especially decoder sharing – can outperform full federation in bandwidth-constrained, non-IID settings.
[AI-19] mathsfVISTA: Decentralized Machine Learning in Adversary Dominated Environments
【速读】:该论文旨在解决去中心化机器学习中由恶意主导(adversary-dominated)的计算环境下的鲁棒性问题,即当攻击者控制多数工作节点时,传统基于诚实多数假设的聚合方法失效的问题。其核心解决方案是提出一种激励导向的框架——VISTA算法,通过动态调整梯度报告的接受阈值来平衡早期收敛速度与对抗污染风险:该机制迫使攻击者从纯粹破坏者转变为理性决策者,权衡提高估计误差与被拒绝及失去奖励的风险;同时利用优化历史自适应调节阈值,确保在无诚实多数前提下仍能保持与标准随机梯度下降(SGD)相当的渐近收敛性能。
链接: https://arxiv.org/abs/2605.07841
作者: Hanzaleh Akbari Nodehi,Parsa Moradi,Soheil Mohajer,Mohammad Ali Maddah-Ali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Decentralized machine learning often relies on outsourcing computations, such as gradient evaluations, to untrusted worker nodes. Existing robust aggregation methods can mitigate malicious behavior under honest-majority assumptions, but may fail when adversaries control a majority of the workers. We study this adversary-dominated setting through an incentive-oriented framework in which reports are accepted and rewarded only when they are mutually consistent up to a threshold. This turns the adversary from a pure saboteur into a rational agent that trades off increasing estimation error against the risk of rejection and loss of reward. We consider iterative optimization under this model. Unlike one-shot computation, iterative learning requires long-horizon decisions: permissive acceptance rules enable faster early progress but admit more adversarial corruption, while strict rules improve estimation accuracy but cause frequent rejections. We propose \mathsfVISTA, an adaptive algorithm that tunes the acceptance threshold using the optimization history. Numerical results show that \mathsfVISTA improves convergence over static thresholds. We also provide a rigorous convergence analysis showing that, with suitable incentive-aware adaptation, adversary-dominated decentralized learning can retain the asymptotic convergence behavior of standard SGD without relying on an honest majority. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.07841 [cs.LG] (or arXiv:2605.07841v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-20] Exact Regular-Constrained Variable-Order Markov Generation via Sparse Context-State Belief Propagation
【速读】:该论文旨在解决在变量阶马尔可夫模型(variable-order Markov model)中如何精确地施加正则约束(regular constraints)的问题。传统方法仅适用于一阶马尔可夫链,而变量阶模型通过动态选择历史中最长可用后缀进行预测,其状态空间与正则约束自动机的组合需重新定义以保持精确性。解决方案的关键在于识别出应在何种状态空间上运行已有的信念传播(belief propagation, BP)-正则机制:具体而言,将原一阶马尔可夫状态替换为观测到的上下文状态(observed context state),再与正则约束自动机进行标准乘积构造,从而实现对变量阶分布的精确条件化,同时避免展开所有K元组。此方法在固定训练上下文图和自动机时,推理复杂度为序列长度的线性关系;一般情况下为可达乘积边数的多项式复杂度,且支持基于逆计数查找的可逆数据增强,无需存储变换后的语料库。
链接: https://arxiv.org/abs/2605.07839
作者: François Pachet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Variable-order Markov models generate sequences over a finite alphabet by conditioning each symbol on the longest available suffix of the generated history. Regular constraints, by contrast, describe finite-horizon control requirements by an automaton: fixed positions, forced endings, metrical patterns, and forbidden copied fragments are all special cases. Existing exact methods already handle regular constraints with belief propagation for first-order Markov chains. The contribution here is the variable-order extension: identifying the state space on which the existing BP-regular machinery must be run when the generator is a variable-order/backoff model. A first-order constraint layer can enforce useful support conditions, but it computes future mass after merging histories that a variable-order generator deliberately keeps distinct. We formalize this mismatch and give the sparse construction obtained by replacing the first-order Markov state with the observed context state, then taking the standard product with the regular constraint automaton. For a fixed trained context graph and automaton, inference is linear in the sequence horizon; in general it is polynomial in the number of reachable product edges. This gives the correct variable-order distribution conditioned on regular constraints without expanding to all K-tuples. The same finite-source interface supports reversible data augmentation by inverse count lookup, matching materialized transposition augmentation without storing transformed corpora. We also separate exact BP inference from generation-time backoff policies, such as singleton avoidance, whose stochastic semantics must be made explicit if exactness is claimed.
[AI-21] Approximation-Free Differentiable Oblique Decision Trees
【速读】:该论文旨在解决训练高精度斜决策树(oblique decision trees, DTs)时面临的挑战,包括复杂的优化景观和过拟合风险,尤其是在回归任务中。现有方法通常依赖近似策略,如通过概率软化边界(软决策树)或使用量化梯度(如直通估计器 STE),这限制了模型的准确性和训练稳定性。解决方案的关键在于提出一种全新的、语义等价且可逆的硬斜决策树表示法——DTSemNet,它将决策树结构映射为神经网络,从而支持标准梯度下降的端到端训练,无需任何近似。此外,针对回归任务中内部节点与叶节点回归器联合优化的难点,作者进一步引入了退火Top-k梯度方法,在不依赖近似的情况下提供精确的梯度信号,显著提升了模型性能。
链接: https://arxiv.org/abs/2605.07837
作者: Subrat Prasad Panda,Blaise Genest,Arvind Easwaran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in JMLR, Vol. 27, 2026
Abstract:Decision Trees (DTs) are widely used in safety-critical domains such as medical diagnosis, valued for their interpretability and effectiveness on tabular data. However, training accurate oblique DTs is challenging due to complex optimization landscapes and overfitting risks, particularly in regression. Recent advances have introduced differentiable formulations that enable gradient-based training and joint optimization of decision boundaries and leaf regressors. Yet, existing approaches typically rely on approximations, either through probabilistic softening of boundaries (soft DTs) or quantized gradients such as the Straight-Through Estimator (STE). To overcome these limitations, we propose DTSemNet, a novel, semantically equivalent, and invertible representation of hard oblique DTs as neural networks. DTSemNet enables end-to-end training with standard gradient descent, eliminating the need for approximations in both classification and regression. While classification aligns naturally with this formulation, regression remains challenging due to the joint optimization of internal nodes and leaf regressors. To address this, we analyze the limitations of STE and introduce an annealed Top-k method that provides accurate gradient signals without approximation. Extensive experiments on classification and regression benchmarks show that DTSemNet-trained oblique DTs outperform state-of-the-art differentiable DTs. Furthermore, we demonstrate that DTSemNet can serve as programmatic DT policies in reinforcement learning environments, thereby broadening their applicability.
[AI-22] CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
【速读】:该论文旨在解决生成式 AI(Generative AI)在进攻性网络安全场景中作为自主代理时,其攻击行为存在显著且可量化的选择偏倚(attack-selection bias)的问题。研究发现,不同代理即使面对相同提示(prompt),也会倾向于集中于特定攻击家族,表现出稳定的攻击偏好模式,而非基于任务目标动态调整策略。解决方案的关键在于构建了 CyBiasBench 基准测试平台,通过 630 次会话系统评估五个代理在三类目标和四种提示条件下的攻击分布特性,量化了偏倚的强度与一致性,并揭示了这种偏倚更接近于代理本身的“特质”而非攻击成功率的函数;进一步实验证明,这种偏倚具有“动量效应”,即代理对偏离其固有偏好的攻击方向具有抵抗性,强行干预无法提升攻击性能。这一发现为理解代理决策机制、设计更可控的 AI 安全工具提供了关键依据。
链接: https://arxiv.org/abs/2605.07830
作者: Taein Lim,Seongyong Ju,Munhyeok Kim,Hyunjun Kim,Hoki Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at this https URL and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at this https URL.
[AI-23] Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
【速读】:该论文旨在解决在线蒸馏(On-policy Distillation, OPD)在长时序任务中因学生模型生成前缀与教师模型思维过程发生漂移(prefix drift)而导致奖励信号局部不可利用的问题。当学生生成的序列偏离教师的推理路径时,教师提供的密集奖励失去有效性,继续基于这些“漂移”轨迹进行token生成和评估不仅降低奖励质量,还造成大量计算资源浪费。解决方案的关键在于提出Prune-OPD框架,其核心机制是通过实时监测学生与教师预测之间的局部兼容性(如top-k重叠度),动态识别前缀漂移事件,并在此基础上单调下调后续不可靠奖励的权重,同时触发动态回溯截断(dynamic rollout truncation)。这使得训练过程能够及时终止无效生成,将计算资源精准分配给局部可利用的教师监督信号,从而实现计算预算与监督可靠性的一致性对齐。
链接: https://arxiv.org/abs/2605.07804
作者: Zhicheng Yang,Zhijiang Guo,Yifan Song,Minrui Xu,Yongxin Wang,Yiwei Wang,Xiaodan Liang,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures
Abstract:On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student’s generated prefix inevitably diverges from the teacher’s thought process, the teacher’s dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted’’ trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbfPrune-OPD, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top- k overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6%–68.0% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
[AI-24] oward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
【速读】:该论文旨在解决基础模型(Foundation Models)在训练过程中计算资源消耗大、收敛速度慢以及泛化能力不足的问题。其核心解决方案是提出PIQL(Privileged Information for Quick and Quality Learning)框架,首次系统性地将特权信息(Privileged Information, PI)引入表格基础模型(Tabular Foundation Models, TFMs)的训练中,以同时加速学习过程并提升泛化性能。关键创新在于构建两类互补的PI:一是数据集层面的统计量,用于减轻上下文学习负担;二是数据生成程序的编码表示,提供观测数据之外的知识。此外,设计了一种架构,在训练时利用PI并学习从推理时可观察的上下文重建PI,从而实现训练阶段独有信息的有效迁移。理论分析表明,在有限数据条件下,PI可缩小总体近似误差并加快收敛速度;实验验证了PIQL能显著提升模型收敛速度、降低最终损失并增强泛化能力,从而减少对数据和计算资源的需求。
链接: https://arxiv.org/abs/2605.07799
作者: Xueying Ding,Leman Akoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training foundation models is computationally intensive and often slow to this http URL introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.
[AI-25] Neural Operators as Efficient Function Interpolators
【速读】:该论文旨在解决传统神经网络在处理高维函数映射任务时参数效率低、训练时间长的问题,尤其是在科学计算中对复杂函数关系建模的挑战。其解决方案的关键在于将有限维函数重新表述为作用于辅助基空间(auxiliary base-space)上的算子,从而利用神经算子(Neural Operators, NOs)的结构优势来实现高效且高精度的函数插值与近似。通过引入张量化傅里叶神经算子(Tensorized Fourier Neural Operator, TFNO),作者在解析函数基准测试和核质量预测等真实场景中验证了该方法不仅能显著减少参数量和训练时间,还能达到甚至超越现有主流模型(如多层感知机和Kolmogorov–Arnold网络)的精度,展现出良好的可扩展性和实际应用潜力。
链接: https://arxiv.org/abs/2605.07792
作者: Vasilis Niarchos,Angelos Sirbu,Sokratis Trifinopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Nuclear Theory (nucl-th)
备注: 12 pages, 9 figures
Abstract:Neural operators (NOs) are designed to learn maps between infinite-dimensional function spaces. We propose a novel reframing of their use. By introducing an auxiliary base-space, any finite-dimensional function can be viewed as an operator acting by composition on functions of the base-space. Through a range of benchmarks on analytic functions of increasing complexity and dimensionality, we demonstrate that NOs can match or outperform standard multilayer perceptrons and Kolmogorov–Arnold Networks in accuracy while requiring significantly fewer parameters and training time. As a real-world application, we apply a two-dimensional Tensorized Fourier Neural Operator (TFNO) to the nuclear chart, learning a correction to state-of-the-art nuclear mass models as a partially observed residual field. A TFNO ensemble reaches a held-out root-mean-square error of 198.2 keV, placing it among the best recent neural-network approaches while retaining high parameter efficiency and short training times. More broadly, these results introduce NOs as a scalable framework for finite-dimensional function interpolation, from analytic benchmarks to structured scientific data.
[AI-26] POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
【速读】:该论文旨在解决顺序决策与黑箱优化中探索(exploration)与利用(exploitation)之间的平衡问题。传统方法通常需要先训练一个不确定性感知的奖励模型,再单独拟合策略以优化该模型,过程复杂且计算成本高。其解决方案的关键在于提出POETS(Policy Ensembles for Thompson Sampling)框架,通过KL正则化训练的策略隐式编码奖励函数,直接利用在线Bootstrap数据匹配这些隐式奖励函数来构建策略集成;同时采用共享预训练主干网络并结合独立的低秩适配(Low-Rank Adaptation, LoRA)分支的设计,显著降低大型语言模型(Large Language Models, LLMs)集成的计算与内存开销。理论证明POETS等价于KL正则化的Thompson采样,具备O(TγT)的累积遗憾界,实验证明其在蛋白质搜索和量子电路设计等科学发现任务中实现最优样本效率,并在离策略设置或小数据场景下表现鲁棒。
链接: https://arxiv.org/abs/2605.07775
作者: Nicolas Menet,Andreas Krause,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: preprint
Abstract:Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ( \textbfPo licy \textbfE nsembles for \textbfT hompson \textbfS ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of \mathcal O(\sqrtT \gamma_T) . Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
[AI-27] RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
【速读】:该论文旨在解决当前多模态内容安全评估基准在平台内容审核(content moderation)任务中忽视规则驱动决策过程的问题。现有基准通常仅以最终标签匹配度作为评价指标,无法检验模型是否真正理解并正确应用了政策规则及其相互作用,导致高分数可能源于表面线索而非深层推理。解决方案的关键在于提出RuleSafe-VL——一个基于公开平台政策规则构建的视觉-语言内容审核评估基准,其核心创新是将审核决策分解为四个诊断性任务:激活规则识别、规则关系恢复、决策充分性判断以及缺失上下文下的结果修正,从而实现对规则条件化决策链的系统性评估。这一设计使评测重心从单一标签准确性转向对模型规则推理能力的深度诊断。
链接: https://arxiv.org/abs/2605.07760
作者: Zhifeng Lu,Dianyuan Wang,Yuhu Shang,Zhenbo Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether the available evidence is sufficient. Current multimodal safety benchmarks largely reduce moderation to matching predefined final labels, leaving this underlying rule structure untested. As a result, a high benchmark score reveals little about whether a model applies the policy correctly or arrives at the correct label through superficial cues. To evaluate this rule-governed process, we introduce RuleSafe-VL, a benchmark for rule-conditioned decision reasoning in vision-language content moderation. Derived from publicly available platform moderation policies, RuleSafe-VL formalizes 93 atomic rules and 92 typed rule relations, yielding 2,166 context-sensitive image-text cases across three high-risk policy families. Its four diagnostic tasks decompose moderation into a rule-conditioned decision chain. They identify activated rules, recover rule interactions, judge decision sufficiency, and resolve outcomes once missing context is supplied. Experiments on 10 frontier, open-source, and safety-oriented VLMs reveal rule-relation recovery as the dominant bottleneck, where the best model reaches only 64.8 Macro-F1 and some safety-oriented models fall below 7 Macro-F1. Decision-state prediction also remains unreliable, peaking at 64.5 Macro-F1. RuleSafe-VL shifts moderation evaluation from final-label scoring toward diagnostic assessment of rule-conditioned decision reasoning.
[AI-28] When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
【速读】:该论文旨在解决大规模预训练模型中多损失函数权重(loss weights)调优成本过高的问题,尤其是在使用复合目标函数进行无标签数据预训练时,传统随机搜索或贝叶斯优化方法需要大量独立训练运行,计算开销巨大。其解决方案的关键在于提出一种基于梯度的双层优化方法,通过将复合预训练梯度与下游任务目标对齐,在线学习预训练损失权重;该方法利用损失结构特性,避免了通常由截断反向传播(truncated backpropagation)带来的多次反向传递,将超参数调优的额外开销控制在单次训练运行的约30%以内,从而显著降低调优成本并保持甚至超越人工调参的性能表现。
链接: https://arxiv.org/abs/2605.07756
作者: Ivan Karpukhin,Andrey Savchenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern deep models are often pretrained on large-scale data with missing labels using composite objectives, where the relative weights of multiple loss terms act as hyperparameters. Tuning these weights with random search or Bayesian optimization is computationally expensive, as it requires many independent training runs. To address this, we propose a gradient-based bilevel method that learns pretraining loss weights online by aligning the composite pretraining gradient with a downstream objective. By exploiting the structure of the loss, the method avoids the multiple backward passes typically required by truncated backpropagation through the full model, reducing the overhead of hyperparameter tuning to approximately 30% above a single training run. We evaluate the approach on event-sequence modeling and self-supervised computer vision, where it matches or improves upon carefully tuned baselines while substantially reducing the cost of hyperparameter tuning compared to random or Bayesian search.
[AI-29] Vibe coding before the trend
【速读】:该论文旨在探讨生成式 AI (Generative AI) 在高等教育教学实践中的应用效果及其对学生学习方式的影响,试图解决的问题是如何在真实课堂环境中理解 AI 工具对不同专业背景学生的学习动机、技能发展和职业认知的重塑作用。解决方案的关键在于通过开展跨学科、多国的学生群体实验(涵盖 ICT、数字营销、新闻学及传播学等专业),收集并分析学生反思数据,识别出五类核心模式:AI 促使学生从语法关注转向高阶思维;从记忆向评估能力迁移;将 AI 技能视为职业必备;形成与 AI 的协作而非替代关系;以及非技术类学生更显著地感受到工具可及性的价值。这些观察为教育者提供了早期实证依据,强调了以“实践导向”而非“理论验证”为基础的教学探索路径。
链接: https://arxiv.org/abs/2605.07751
作者: Leon van Bokhorst,Koen Suilen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Early 2025 we ran a series of vibe coding challenges across four different student cohorts. The cohorts included 54 ICT students, 24 digital marketing students, and 7 journalism students at Fontys University of Applied Sciences (Netherlands), and 22 BA Communication students at North-West University (South Africa). From the student reflections, five major patterns emerged. Students reported that AI tools shifted their focus from syntax to higher-order thinking; they also described a skill shift from memorizing to evaluating; they viewed AI proficiency as career-essential; they framed their relationship with AI as partnership rather than replacement; and finally non-technical students showed the strongest appreciation for the accessibility these tools provide. This practitioner report documents what we observed during the classroom experiments, we reflect on how the landscape has shifted in the year since, and shares practical lessons for educators considering similar experiments. We present the observations as what they are: patterns from practice, not proven conclusions, in the beleif that sharing early stage experiences contributes to the overall field of AI and education. Comments: 10 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07751 [cs.CY] (or arXiv:2605.07751v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.07751 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-30] Alternating Target-Path Planning for Scalable Multi-Agent Coordination
【速读】:该论文旨在解决并发目标分配与路径规划(Target Assignment and Pathfinding, TAPF)问题,该问题在多智能体路径规划(Multi-Agent Pathfinding, MAPF)基础上进一步要求为每个智能体分配唯一目标并生成无碰撞路径。传统方法依赖冲突搜索(Conflict-Based Search, CBS)框架,其将目标分配与路径规划紧密耦合,导致计算复杂度高且难以扩展。本文提出一种迭代精化框架,关键在于解耦目标分配与路径规划过程:利用快速的次优MAPF求解器(如LaCAM)在限定时间内反复求解当前目标分配下的路径,并通过MAPF反馈识别瓶颈智能体,进而动态调整目标分配。实验表明,该反馈驱动的重分配机制显著提升了可扩展性,在保持良好解质量的前提下突破了现有CBS方法的规模限制,为实际场景中大规模TAPF应用提供了可行方案。
链接: https://arxiv.org/abs/2605.07744
作者: Yu Kumagai,Keisuke Okumura
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute-intensive, non-scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback-driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state-of-the-art CBS-based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real-world setups.
[AI-31] Online Goal Recognition using Path Signature and Dynamic Time Warping
【速读】:该论文旨在解决在线目标识别(online goal recognition)在连续状态空间中面临的两个核心挑战:高效编码长轨迹以及有效比较轨迹。传统方法依赖于定制的状态空间表示和度量来比较观测数据与假设,但往往忽视了其他领域已验证的高效编码技术。本文的关键解决方案是引入路径签名(path signatures),这是一种基于粗糙路径理论的紧凑且表达能力强的轨迹表示方法,能够高效捕获轨迹的关键语义特征,从而实现更具有意义的轨迹间比较。实验表明,该方法在预测准确性和在线规划效率上均优于当前最优方法,同时保持离线性能竞争力。
链接: https://arxiv.org/abs/2605.07736
作者: Douglas Tesch,Nathan Gavenski,Leonardo Amado,Odinaldo Rodrigues,Felipe Meneguzzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as part of the 35th International Joint Conference on Artificial Intelligence
Abstract:Online goal recognition in continuous domains poses two central challenges: efficiently encoding large trajectories and effectively comparing them. Recent work addresses these challenges by using custom state-space representations and metrics to compare observations against hypotheses. However, these approaches often overlook well-established encoding techniques used in other domains that offer substantial advantages. This paper introduces a novel method for online goal recognition that leverages path signatures, a compact, expressive representation of rough path theory that efficiently captures key semantic features of trajectories, enabling more meaningful comparisons between them. Experiments show that our method consistently outperforms the state of the art in predictive accuracy and online planning efficiency, while remaining competitive offline.
[AI-32] Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach
【速读】:该论文旨在解决全托盘运输(Full Truckload, FTL)供应链中因GPS数据缺失或损坏导致的车辆与货单匹配失败问题,从而保障运输过程的实时可视性及准确到达时间(ETA)预测。解决方案的关键在于将匹配问题建模为概率排序任务,利用Uber H3六边形空间索引对GPS位置信息进行离散化以提取路径相似性特征,并融合时间信息,最终采用LightGBM梯度提升算法结合阈值后处理机制实现高精度匹配。该方法在北美和欧洲分别实现了26和14个百分点的精度提升,同时覆盖范围翻倍,且对地理编码误差、多候选车辆和稀疏定位数据具有鲁棒性。
链接: https://arxiv.org/abs/2605.07733
作者: Srinivas Kumar R,Jose Mathew,Ankit Singh Chauhan,Dinesh Rajkumar,Aravind Manoj,Mohit Goel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)
Abstract:Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.
[AI-33] Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
【速读】:该论文旨在解决生成式策略(Generative Policy)在强化学习中因依赖常微分方程(ODE)参数化而导致的推理效率低、训练不稳定的问题。其解决方案的关键在于提出了一种非 ODE 的单步生成策略——漂移场策略(Drifting Field Policy, DFP),通过将策略更新建模为向软目标策略方向的逆 KL 散度 Wasserstein-2 梯度流,使每一步更新等价于概率空间中的梯度步骤;该梯度被分解为两个部分:一是向高动作价值区域的上升方向,二是与锚定策略进行得分匹配以形成信任区域。进一步地,作者推导出一个可计算的近似损失函数,类似于在 Top-K 评论器选择的动作上执行行为克隆,从而显著提升了漂移模型骨干网络的性能,且由于其非 ODE 参数化特性,实现了更快的一步推理和更优的操控任务表现。
链接: https://arxiv.org/abs/2605.07727
作者: Juil Koo,Mingue Park,Jiwon Choi,Yunhong Min,Minhyuk Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.
[AI-34] Curated Synthetic Data Doesnt Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
【速读】:该论文旨在解决生成式模型在递归训练(recursive training)过程中因依赖固定奖励信号而导致的输出坍缩(collapse)问题,即模型会过度优化单一目标而丧失多样性。其解决方案的关键在于引入多奖励函数(multiple reward functions)进行内容筛选,通过异质偏好下的训练动态建模,证明在特定条件下模型可收敛至一个稳定分布,该分布能在多个高奖励区域间分配概率质量,从而保留多样性并满足加权纳什讨价还价解(weighted Nash bargaining solution),为合成数据再训练循环中的价值聚合提供了形式化解释。
链接: https://arxiv.org/abs/2605.07724
作者: Ali Falahati,Mohammad Mohammadi Amiri,Kate Larson,Lukasz Golab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
Abstract:Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that over-optimize that objective. Prior work suggests that such collapse is unavoidable without adding real data into the mix. We revisit this conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.
[AI-35] LLM hallucinations in the wild: Large-scale evidence from non-existent citations
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学文献中生成虚假引用(即“幻觉引用”)的问题,这类错误可能严重威胁科研成果的可靠性与公平性。其解决方案的关键在于利用可验证的学术引用作为审计指标,在arXiv、bioRxiv、SSRN和PubMed Central等平台共250万篇论文中系统检测1.11亿条参考文献,发现LLM广泛采用后非存在引用数量显著上升,且这些错误主要集中在AI技术快速渗透的领域、具有AI辅助写作特征的稿件以及小型或早期职业作者群体中,并呈现出对已有声望较高及男性学者的偏倚,表明LLM幻觉正在规模化渗透知识生产过程,而当前预印本审核与期刊出版流程仅能捕获其中一小部分,凸显现有防护机制滞后于问题扩散速度。
链接: https://arxiv.org/abs/2605.07723
作者: Zhenyue Zhao,Yihe Wang,Toby Stuart,Mathijs De Vaan,Paul Ginsparg,Yian Yin
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注:
Abstract:Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.
[AI-36] An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
【速读】:该论文针对长上下文推理中因KV缓存(Key-Value Cache)驻留于CPU内存导致的系统效率瓶颈问题展开研究,旨在解决现有GPU-only或CPU-GPU混合设计在PCIe带宽限制、元数据内存开销、GPU空闲时间以及CPU端top-k选择和稀疏注意力计算瓶颈等方面的性能不足。其解决方案的关键在于提出Fluxion框架,通过三个核心洞察实现协同优化:(1) 输出感知的KV预算分配(output-aware KV budgeting),(2) 基于头特性和粒度的稀疏配置策略(head-specific and granularity-aware sparse configuration),以及(3) 跨设备协调执行稀疏注意力机制(cross-device coordinated execution)。具体而言,Fluxion结合轻量级头属性预测器、粒度预算选择器与基于优先级的调度器,联合优化KV缓存预算分配、稀疏结构配置及CPU-GPU执行重叠,从而在保持生成质量的前提下显著提升长上下文推理的整体系统效率,在多个模型、基准和任务上相较最优固定稀疏混合基线实现1.5×–3.7×加速比。
链接: https://arxiv.org/abs/2605.07719
作者: Feiyu Yao,Zhixiong Niu,Xiaqing Li,Yongqiang Xiong,Juan Fang,Qian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well – the worst average degradation is only -0.26 relative to FULL, while delivering 1.5 \times -3.7 \times speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2605.07719 [cs.LG] (or arXiv:2605.07719v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] he AI-Native Large-Scale Agile Software Development Manifesto
【速读】:该论文旨在解决大规模软件开发中实现真正敏捷性的难题,当前的大规模敏捷框架仍以人工为中心、高度依赖协调会议、文档同步和角色交接,阻碍了实时适应能力。其解决方案的关键在于提出《AI原生大规模敏捷软件开发宣言》,将人工智能(AI)视为首要参与者而非辅助工具,通过六大核心原则重构开发流程:并行过程、意图驱动团队、动态知识体系、验证优先保障机制、协同代理工作流以及可复用蓝图,从而推动开发从会议驱动、文档密集的线性流程向智能、自适应且持续学习的系统转变。
链接: https://arxiv.org/abs/2605.07717
作者: Ricardo Britto,Fredrik Palmgren,Nishrith Saini,Marcus Ohlin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.
[AI-38] Hierarchical Task Network Planning with LLM -Generated Heuristics NEURIPS2026
【速读】:该论文旨在解决层次任务网络(HTN)规划中启发式信息不足的问题,即当前HTN规划所依赖的启发式方法在信息性上仍远不如经典规划算法中的启发式。其解决方案的关键在于利用大语言模型(LLMs)生成有效的搜索启发式,并将其扩展至层次规划场景,通过领域特定提示(domain-specific prompting)增强启发式的针对性与有效性。实验表明,LLM生成的启发式在覆盖范围上接近最优HTN规划器,同时在83%的共享问题上显著降低搜索开销。
链接: https://arxiv.org/abs/2605.07707
作者: Felipe Meneguzzi,Alexandre Buchweitz,Augusto B. Corrêa,Victor Scherer Putrich,André Grahl Pereira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures; submitted to NeurIPS 2026
Abstract:HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.
[AI-39] Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization
【速读】:该论文旨在解决如何从逻辑层面形式化描述编码器-解码器Transformer(encoder-decoder transformer)的计算机制问题,尤其针对其在实际应用中使用浮点数表示和软注意力(soft-attention)的情形。解决方案的关键在于提出一种新的时序逻辑(temporal logic),该逻辑扩展了命题逻辑,引入了对编码器输入的计数全局模态(counting global modality)和对解码器输入的过去模态(past modality),从而能够精确刻画Transformer中跨注意力机制与序列生成行为的语义特性。此外,作者还通过一类分布式自动机(distributed automata)提供了另一种等价表征,并证明其结果具有架构无关性,可适用于掩码(masking)等结构变化场景。
链接: https://arxiv.org/abs/2605.07705
作者: Veeti Ahvonen,Damian Heiman,Antti Kuusisto,Miguel Moreno,Matias Selin
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. We also give an additional characterization of such transformers via a type of distributed automata, and show that our results are not limited to the specific choices in the architecture and can account for changes in, e.g., masking. Finally, we discuss encoder-decoder transformers in the autoregressive setting.
[AI-40] Finite-Time Analysis of MCTS in Continuous POMDP Planning
【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中缺乏有限时间分析与概率集中性边界的问题,尤其针对启发式动作选择策略(如UCB)所引发的状态非平稳性和依赖性带来的理论挑战。解决方案的关键在于:在离散观测空间中,通过将多项式探索奖励扩展至POMDP场景下的UCB机制,获得根节点经验价值估计的多项式集中界;在连续观测空间中,提出一种抽象划分框架并建立划分损失的有限时间界,进一步证明在弱条件下价值估计的高概率边界。为此,作者设计了Voro-POMCPOW算法,该算法利用Voronoi单元自适应划分连续观测空间,在保持有限分支因子的同时保留原始观测生成机制,从而实现理论保证与实际性能的平衡。
链接: https://arxiv.org/abs/2605.07703
作者: Da Kong,Vadim Indelman
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 1 figure
Abstract:This paper presents a finite-time analysis for Monte Carlo Tree Search (MCTS) in Partially Observable Markov Decision Processes (POMDPs), with probabilistic concentration bounds in both discrete and continuous observation spaces. While MCTS-style solvers such as POMCP achieve empirical success in many applications, rigorous finite-time guarantees remain an open problem due to the nonstationarity and the interdependencies induced by heuristic action selection (e.g., UCB). In the discrete setting, we address these challenges by extending the polynomial exploration bonus to UCB in POMDP setting, yielding polynomial concentration bounds for the empirical value estimation at the root node. For continuous observation spaces, we introduce an abstract partitioning framework and propose a finite-time bound on partitioning loss. Under mild conditions, we prove highprobability bound on value estimates in POMDPs with continuous observation space. Specifically, we propose Voro-POMCPOW, a variant of POMCPOW with f inite-time guarantees that adaptively partitions the continuous observation space using Voronoi cells. This approach maintains a finite branching factor while preserving the original observation generator. Empirical validation demonstrates that the proposed Voro-POMCPOW shows competitive performance while providing theoretical guarantees. Although our analysis focuses on continuous POMDPs, the techniques developed herein are also applicable to continuous MDPs, closing another gap on the MDP side.
[AI-41] GASim: A Graph-Accelerated Hybrid Framework for Social Simulation
【速读】:该论文旨在解决大规模社会模拟中因混合方法(结合大语言模型(LLM)驱动的智能体与数值型基于主体模型(ABM))导致的高延迟问题,特别是由昂贵的记忆检索和串行ABM执行所引发的性能瓶颈。其解决方案的关键在于提出GASim框架,通过三个核心创新实现加速:一是引入图优化记忆(Graph-Optimized Memory, GOM),以稀疏记忆图上的轻量级传播替代高开销的LLM检索管道;二是采用图消息传递(Graph Message Passing, GMP),用细粒度特征聚合与图注意力网络并行更新多数普通智能体,取代串行ABM执行;三是设计熵驱动分组(Entropy-Driven Grouping, EDG),利用信息熵动态识别处于信息多样性邻域中的核心智能体,从而实现高效的混合分区策略。实验证明,GASim在保持与现实舆情趋势强一致性的前提下,相较传统混合框架实现了9.94倍的端到端速度提升,并将token消耗降低至基准的20%以下。
链接: https://arxiv.org/abs/2605.07692
作者: Xuan Zhou,Yanhui Sun,Hantao Yao,Allen He,Yongdong Zhang,Wu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)-based agents with numerical agent-based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph-accelerated hybrid multi-agent framework for large-scale social simulations. For core agents driven by LLM, GASim introduces Graph-Optimized Memory (GOM) to replace intensive LLM-based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine-grained feature aggregation and Graph Attention Network. We further introduce Entropy-Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information-diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94-fold end-to-end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real-world public opinion trends. Our code is available at this https URL.
[AI-42] FactoryBench: Evaluating Industrial Machine Understanding
【速读】:该论文旨在解决工业机器人遥测数据中机器理解能力的评估问题,即如何系统性地衡量时间序列模型和大语言模型(LLM)在工业场景下对设备状态、干预、反事实及决策等因果层次的理解水平。其解决方案的关键在于构建FactoryBench——一个基于Pearl因果阶梯(ladder of causation)设计的多层级问答基准,涵盖状态、干预、反事实与决策四个因果层级,并引入结构化与自由格式答案相结合的评分机制;同时提出可扩展的QA生成框架与高质量多任务传感器数据集FactoryWave,从而实现对模型在真实工业语境中机器理解能力的全面量化评估。
链接: https://arxiv.org/abs/2605.07675
作者: Yanis Merzouki,Coral Izquierdo,Matei Ignuta-Ciuncanu,Marcos Gomez-Bracamonte,Riccardo Maggioni,Alessandro Lombardi,Camilla Mazzoleni,Federico Martelli,Balazs Gunther,Jonas Petersen,Philipp Petersen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 14 tables; appendix with 24 pages
Abstract:We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. QA pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl’s ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable QA generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k QA items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.
[AI-43] acit Knowledge Extraction via Logic Augmented Generation and Active Inference
【速读】:该论文旨在解决隐性知识(Tacit Knowledge)在工业场景中难以被机器理解和复用的问题,尤其是在以流程为中心的领域中,隐性知识往往包含未明文记录的隐含假设、情境约束、具身技能和经验判断,导致现有知识工程流程无法将其转化为形式化、可查询、可推理的机器可读表示。解决方案的关键在于提出一种神经符号框架(Neuro-Symbolic Framework),其核心由两部分组成:一是基于逻辑增强生成(Logic-Augmented Generation)的方法,用于从非结构化数据(如操作视频)中提取语义信息并融合先验逻辑规则;二是受主动推理(Active Inference)启发的本体引导知识图谱构建机制,实现对知识的结构化组织与一致性验证。实验在制造领域的装配类维修流程中验证了该方法在知识完整性和语义质量上的显著提升。
链接: https://arxiv.org/abs/2605.07639
作者: Lorenzo Lamazzi,Aldo Gangemi,Alessio Giberti,Andrea Giovanni Nuzzolese,Vittorio Andrea Rocca,Mattia Torta,Francesco Poggi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tacit knowledge plays a central role in human expertise, yet it remains difficult to capture, formalize, and reuse in machine-interpretable form. This challenge is especially relevant in procedural domains, where successful execution depends not only on explicit instructions, but also on implicit assumptions, contextual constraints, embodied skills, and experience-based judgments rarely documented. As a result, current knowledge engineering pipelines struggle to transform tacit and process-centric knowledge into formally specified, machine-interpretable representations that can be queried, validated, reasoned over, and reused. In this paper, we introduce a neuro-symbolic framework that combines Logic-Augmented Generation and an Active-Inference-inspired approach for ontology-grounded Knowledge Graph construction. We evaluate the approach in a knowledge transfer case study in manufacturing, using assembly-like repair procedures from instructional videos as a reproducible proxy domain. Results show that the proposed solution improves completeness and semantic quality, advancing neuro-symbolic knowledge engineering for industrial domains.
[AI-44] Inference Time Causal Probing in LLM s
【速读】:该论文旨在解决生成式模型中内部表示对行为影响的因果性测试与控制问题,即如何在不依赖外部探测器(probe classifier)的前提下,精准干预隐藏状态以改变模型输出属性,同时避免因任务或模型特定性导致的预测几何失配。其解决方案的关键在于提出一种无探测器、基于梯度的隐状态驱动边际干预方法(Hidden-state Driven Margin Intervention, HDMI),该方法直接利用模型原生输出构建边际目标函数,通过优化隐藏状态使目标延续项的概率增加而源项概率降低,从而实现对生成行为的可控调整;进一步引入前瞻变体LA-HDMI,通过反向传播至softmax嵌入层,使当前隐藏状态调整能够提升用户指定token的下一轮生成概率并保持文本流畅性,显著提升了干预的完整性(completeness)和选择性(selectivity),在LGD agreement corpus和CausalGym基准上优于现有方法。
链接: https://arxiv.org/abs/2605.07631
作者: Sadegh Khorasani,Saber Salehkaleybar,Negar Kiyavash,Matthias Grossglauser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 tables, 3 figures
Abstract:Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model’s predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model’s native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.
[AI-45] Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
【速读】:该论文旨在解决Transformer架构中层参数化设计 largely 依赖经验选择的问题,特别是多头注意力(Multi-Head Attention, MHA)与门控MLP(Gated MLP)的参数配置缺乏理论指导。其解决方案的关键在于提出因果能量最小化(Causal Energy Minimization, CEM)框架,将Transformer层建模为条件能量函数上的优化步骤,并显式地将层参数化纳入能量函数形式中:具体而言,CEM表明权重共享的MHA可视为交互能量的梯度更新,而具有共享上下投影的门控MLP则可通过元素级能量进行解释。这一视角揭示了一个包含层内权重共享、对角加低秩交互、轻量预条件器及递归更新的设计空间,从而为Transformer结构提供了一种基于能量原理的系统性设计思路。
链接: https://arxiv.org/abs/2605.07588
作者: Jin Xu,Camille Couturier,Victor Rühle,Saravan Rajmohan,James Hensman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.
[AI-46] Parallel Lifted Planning via Semi-Naive Datalog Evaluation
【速读】:该论文旨在解决 lifted classical规划(lifted classical planning)在实际应用中效率较低的问题,尤其是在大规模任务下因重复实例化导致的计算开销过大。其核心挑战在于如何有效利用并行计算资源来加速规划过程中涉及的Datalog式推理,如 successor generation、axiom evaluation 和 delete-relaxed heuristics 等关键组件。解决方案的关键在于提出一个双层并行执行模型:一是规则级并行(rule-level parallelism),二是实例化并行(grounding parallelism),并通过基于团枚举(clique enumeration)的求解器实现,进一步支持半朴素(semi-naive)Datalog评估策略,从而显著提升规划过程中的并行效率和整体性能,在硬实例化任务上实现了高达6倍的加速比。
链接: https://arxiv.org/abs/2605.07584
作者: Dominik Drexler,Oliver Joergensen,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Lifted classical planners operate directly on first-order planning tasks to avoid the computationally demanding grounding step. However, lifted planning is typically slower, as planners must repeatedly instantiate ground structures during search. Many core components of lifted classical planning, such as successor generation, axiom evaluation, task grounding, and delete-relaxed heuristics, have previously been studied through the lens of Datalog evaluation. We build upon this line of work and extend it by developing and analyzing an execution model with two levels of parallelism: rule-level parallelism and grounding parallelism. We further specialize this solver for planning-specific workloads with a grounder based on clique enumeration, which we extend to support semi-naive Datalog evaluation. Our experimental evaluation using greedy best-first search with the FF heuristic shows that our implementation already solves more tasks than the baselines on a single core, and the gap widens as additional cores are used. Moreover, on hard-to-ground tasks where on average 97.6% of the total runtime is spent in Datalog execution, the proposed execution model exhibits an average parallel fraction of 92.4%, while achieving up to a 6-fold speedup on 8 cores in practice.
[AI-47] Open-Ended Task Discovery via Bayesian Optimization
【速读】:该论文旨在解决科学工作流中贝叶斯优化(Bayesian Optimization, BO)因任务定义不确定性(即优化目标和评估方式随证据积累而演变)而导致的性能下降问题。其解决方案的关键在于提出一种开环式贝叶斯优化框架——Generate-Select-Refine (GSR),该框架通过交替执行任务生成与任务优化,从用户提供的初始任务出发,以粗到精的方式生成新任务,并利用任务获取函数(task-acquisition function)调度优化过程。理论上,GSR在渐近意义上将评估资源集中于最优任务,相对于单任务BO仅引入对数级 regret(遗憾)增长,从而实现高效且自适应的任务探索与优化。
链接: https://arxiv.org/abs/2605.07572
作者: Masaki Adachi,Yuta Suzuki,Juliusz Ziomek
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 60 pages, 11 figures
Abstract:When applying Bayesian optimization (BO) to scientific workflow, a major yet often overlooked source of uncertainty is the task itself – namely, what to optimize and how to evaluate it – which can evolve as evidence accumulates. We introduce Generate-Select-Refine (GSR), a open-ended BO framework that alternates between task generation and task optimization. Starting from a user-provided seed task, GSR generates new tasks in a coarse-to-fine manner while a task-acquisition function schedules optimization. Asymptotically, it concentrates evaluations on the best task, incurring only logarithmic regret overhead relative to single-task BO. We apply GSR to new product development, chemical synthesis scaling, algorithm analysis, and patent repurposing, where it outperforms existing LLM-based optimizers.
[AI-48] Ensemble Distributionally Robust Bayesian Optimisation
【速读】:该论文旨在解决零阶优化(zeroth-order optimisation)中因上下文分布不确定性(context distributional uncertainty)带来的挑战,这类问题通常通过贝叶斯优化(Bayesian optimisation, BO)来处理。现有方法常采用集成代理模型(ensemble surrogate model)以提升对复杂且噪声数据的鲁棒性,但其计算效率与理论保证仍存在不足。本文提出了一种新的算法——集成分布鲁棒贝叶斯优化(Ensemble Distributionally Robust Bayesian Optimisation),其关键在于在保持计算可 tractable 的前提下有效管理连续上下文,并通过理论分析获得次线性 regret 保证,显著优于当前最先进结果,同时实验证明其性能与理论预期一致。
链接: https://arxiv.org/abs/2605.07565
作者: Tigran Ramazyan,Denis Derkach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We study zeroth-order optimisation under context distributional uncertainty, a setting commonly tackled using Bayesian optimisation (BO). A prevailing strategy to make BO more robust to the complex and noisy nature of data is to employ an ensemble as the surrogate model, thereby mitigating the weaknesses of any single model. In this study, we propose a novel algorithm for Ensemble Distributionally Robust Bayesian Optimisation that remains computationally tractable while managing continuous context. We obtain theoretical sublinear regret bounds, improving current state-of-the-art results. We show that our method’s empirical behaviour aligns with its theoretical guarantees.
[AI-49] ProteinJEPA: Latent prediction complements protein language models
【速读】:该论文旨在解决蛋白质语言模型(Protein Language Models)在预训练阶段如何提升性能的问题,特别是在与掩码语言建模(Masked Language Modeling, MLM)相比时,是否可以通过引入潜在空间预测(Latent-Space Prediction)来增强模型表示能力。其核心解决方案是提出一种混合策略——“掩码位置 MLM+JEPA”(masked-position MLM+JEPA),即仅在被掩码的位置上进行潜在空间目标预测,同时保留原有的 MLM 交叉熵损失。这一设计在匹配计算时间预算下显著优于纯 MLM 方法,在多个下游任务中实现更多胜出(如稳定性、变体效应、远程同源等),表明潜在空间预测与 MLM 的协同作用能够有效提升蛋白质表征质量,且不依赖于全位置潜在预测或单独使用 JEPA(JEPAs-only 表现崩溃)。
链接: https://arxiv.org/abs/2605.07554
作者: Dan Ofer,Dafna Shahaf,Michal Linial
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
备注:
Abstract:Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35–150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \beta\beta \beta-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.
[AI-50] From Pixels to Prompts: Vision-Language Models
【速读】:该论文试图解决的问题是:当前视觉语言模型(Vision-Language Models, VLMs)领域发展迅速,新模型和术语层出不穷,导致研究者容易迷失在“知晓术语”与“真正理解机制”之间的巨大鸿沟中。解决方案的关键在于提供一个清晰、结构化的认知框架——即构建一个关于VLMs的“心智地图”,使读者能够基于对核心原理的理解去阅读新论文,并具备设计系统的能力,而非仅依赖碎片化知识或盲目模仿。这一方法强调理解的深度与可迁移性,而非罗列大量数据集或模型变体。
链接: https://arxiv.org/abs/2605.07544
作者: Khang Hoang Nhat Vo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emphit is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between I know the buzzwords'' and I actually understand how this works’’ can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.
[AI-51] Multi-Environment POMDPs with Finite-Horizon Objectives
【速读】:该论文旨在解决多环境部分可观测马尔可夫决策过程(Multi-Environment Partially Observable Markov Decision Processes, MEPOMDPs)中有限时域目标下的最优价值与策略计算问题。由于POMDPs中的此类问题是PSPACE完全的,本文进一步证明在更一般的MEPOMDP设定下该问题同样为PSPACE完全,从而确立了其理论复杂度边界;解决方案的关键在于提出了一种实用的算法,并在经典基准测试中显著优于此前唯一已知的算法,实现了计算效率与求解质量的双重提升。
链接: https://arxiv.org/abs/2605.07537
作者: Léonard Brice,Filip Cano,Krishnendu Chatterjee,Thomas A. Henzinger,Stefanie Muroya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Partially Observable Markov Decision Processes (POMDPs) are systems in which one agent interacts with a stochastic environment, and receives only partial information about the current state. In a multi-environment POMDP (MEPOMDP), the initial state is unknown, and assumed to be adversarially chosen. In this work we focus on computing the optimal value and policy in MEPOMDPs with finite-horizon objectives. That problem is known to be PSPACE-complete in POMDPs. Our main results are as follows: (1) we establish that it is also PSPACE-complete in the more general setting of MEPOMDPs; (2) we present a practical algorithm and evaluate it on classical benchmarks, significantly outperforming the only previously known algorithm.
[AI-52] Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It
【速读】:该论文旨在解决自解释图神经网络(Self-Interpretable Graph Neural Networks, SI-GNNs)中存在的自不一致性问题,即模型在对自身生成的解释子图重新推理时,可能产生不同的解释结果,从而导致解释不可靠。研究指出,这种自不一致性源于再解释引发的上下文扰动(re-explanation-induced context perturbation),并提出潜在信号分配假设(latent signal assignment hypothesis)来解释为何仅部分边对扰动敏感。关键解决方案是提出一种无需训练、模型无关的后处理策略——自去噪(Self-Denoising, SD),通过一次额外前向传播即可校准解释,显著提升解释质量,同时仅引入约4–6%的计算开销。
链接: https://arxiv.org/abs/2605.07527
作者: Wenxin Tai,Yaqian Liu,Ting Zhong,Fan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model’s prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4–6% computational overhead in practice.
[AI-53] From Feasible to Practical: Pareto-Optimal Synthesis Planning
【速读】:该论文旨在解决当前计算机辅助合成路线规划(CASP)方法在实际应用中与化学家决策需求脱节的问题,即现有方法通常仅关注单一指标(如收敛性或最短路径),而忽视了工业实践中多目标权衡(如成本、可持续性、毒性及总收率)的复杂性。其解决方案的关键在于将合成规划建模为一个多目标搜索问题,并提出MORetro算法,通过加权标量化和贝叶斯优化(BO)引导采样策略,在组合搜索空间中高效探索并生成帕累托前沿(Pareto front),从而显式捕捉用户定义目标之间的权衡关系。该方法基于多目标A搜索框架,提供最优性保证,并在多个基准测试中验证了其生成多样且高质量帕累托解集的能力,显著优于传统单目标方法。
链接: https://arxiv.org/abs/2605.07521
作者: Friedrich Hastedt,Dongda Zhang,Antonio del Rio Chanona
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.
[AI-54] Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration
【速读】:该论文旨在解决不同iable planning(可微规划)在高度非线性及混合离散-连续域中因优化景观病态(ill-conditioned)而导致的优化困难问题,具体表现为平坦区域和陡峭过渡阻碍有效搜索。解决方案的关键在于提出模型驱动策略优化(Model-Driven Policy Optimization, MDPO),其核心创新是通过在优化过程中向动作空间注入噪声引入随机探索,并基于轨迹目标对梯度敏感性的分析动态调整噪声幅度,形成随时间步和迭代次数变化的自适应探索策略。这种机制显著提升了目标函数景观的探索效率,有助于跳出局部最优解,从而在复杂非线性和混合环境中实现更高质量的决策优化。
链接: https://arxiv.org/abs/2605.07520
作者: Yuval Aroosh,Ayal Taitler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Differentiable planning enables gradient-based optimization of decision-making problems by leveraging differentiable models of system dynamics. However, in highly nonlinear and hybrid discrete-continuous domains, the resulting optimization landscapes are often ill-conditioned, with flat regions and sharp transitions that hinder effective optimization. We propose Model-Driven Policy Optimization (MDPO), a framework that introduces stochastic exploration into differentiable planning by injecting noise into the action space during optimization. Leveraging access to the model, MDPO further adapts the noise magnitude based on gradient-derived sensitivity of the trajectory objective, yielding a time-dependent exploration profile. This enables improved exploration of the objective landscape and helps escape poor local optima via dynamic allocation of exploration across timesteps and iterations. Experiments on benchmark domains demonstrate that MDPO consistently outperforms deterministic differentiable planning, including both the noise-free variant of our method and available state-of-the-art implementations, as well as model-free baselines such as PPO, significantly improving solution quality across challenging nonlinear and hybrid settings. We further analyze the evolution of the adaptive noise magnitude across both time steps and optimization iterations, providing insight into how exploration is allocated during learning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07520 [cs.AI] (or arXiv:2605.07520v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.07520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
【速读】:该论文旨在解决轻量级、本地化视觉-语言图形用户界面(GUI)智能体在模型容量受限条件下性能提升困难的问题,尤其是传统监督微调(Supervised Fine-Tuning, SFT)方法易导致过拟合、灾难性遗忘和策略僵化等缺陷。其核心解决方案是提出一种无需SFT的训练范式,关键在于两个创新:一是引入引导式在线策略蒸馏(Guided On-policy Distillation),通过结合Oracle参考轨迹与动态检索机制,减少幻觉并缓解多解GUI任务中的认知错位;二是构建多解双层GRPO框架(Multi-solution Dual-level GRPO),联合对齐宏观子任务规划与微观执行匹配,增强长程GUI交互场景下的探索能力。实验表明,该方法在轻量级模型中达到最先进性能,并可与远大于其规模的模型相竞争。
链接: https://arxiv.org/abs/2605.07505
作者: Yubin Wu,Zicheng Cai,Liping Ning,Hua Wang,Zhi Chen,Yaohua Tang,Hao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.
[AI-56] Efficient Data Selection for Multimodal Models via Incremental Optimization Utility
【速读】:该论文旨在解决大规模多模态模型(Large Multimodal Models, LMMs)在训练过程中因合成数据的质量-数量权衡问题而导致的扩展瓶颈。传统方法如LLM-as-a-Judge虽有效,但存在计算成本高且缺乏可解释性的缺陷。其解决方案的关键在于提出One-Step-Train(OST)框架,将数据选择建模为增量优化效用排序问题,通过轻量级代理模型模拟单步更新来估计每个样本的边际效用,从而实现高效、可解释的数据筛选。此方法显著降低了训练成本并提升了性能,在固定计算预算下使用top-20子集即可超越LLM-as-a-Judge基线5.6分,并有效识别有毒样本以避免负迁移现象。
链接: https://arxiv.org/abs/2605.07488
作者: Jinhao Jing,Qiannian Zhao,Chao Huang,Zhan Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.
[AI-57] Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints
【速读】:该论文旨在解决物理约束深度生成模型在分布外(out-of-distribution)条件下 extrapolation(外推)性能差的问题,尤其针对现有方法未能处理物理定律的层次结构和混淆变量(confounding variable)问题的局限性。其核心解决方案是提出去混淆层次门控机制(Deconfounded Hierarchical Gate, DHG),该机制通过引入因果推理中的 do-操作符进行反事实估计,并结合后门调整(backdoor adjustment)来消除温度混淆对各物理约束层级的影响,从而实现从粗到细的物理一致性约束。关键创新在于:DHG不仅诊断并控制温度混淆污染的程度,还使模型能够识别真正的物理不一致而非伪相关,最终显著提升外推性能——例如在锂离子电池温度外推任务中,RMSE 降低至 0.215,较无约束基线提升 46%。
链接: https://arxiv.org/abs/2605.07485
作者: Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures
Abstract:Extrapolation to out-of-distribution conditions is a fundamental challenge for physics-constrained deep generative models. Existing methods apply physical constraints as a single static regularization term uniformly across the generation process, and address neither the hierarchical structure of physical laws and the confounding variable problem. We propose the Deconfounded Hierarchical Gate (DHG), which serves as a diagnostic and control mechanism: it identifies when and how strongly temperature confounding contaminates each constraint level, so that hierarchical gates reflect intrinsic physical inconsistency rather than spurious temperature effects. DHG combines counterfactual estimation via the do-operator with backdoor adjustment to remove confounding, then applies Coarse-to-Fine physical constraints progressively. We report a counter-intuitive finding in pretraining: excluding the target-domain data from pretraining outperforms including it by 39% in extrapolation performance (RMSE 0.224 vs. 0.324). This occurs because FNO learns domain-agnostic physical patterns that transfer more effectively when the target domain is withheld. On a lithium-ion battery temperature extrapolation benchmark (trained at 24 degrees Celsius, evaluated at 4.0–43.0 degrees Celsius), our method achieves RMSE = 0.215, a 46% improvement over the unconstrained baseline (Pure CFM: 0.397).
[AI-58] Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization
【速读】:该论文旨在解决深度神经网络在分布外(out-of-distribution, OOD)场景下难以学习到相关表示的问题,即为何模型在训练数据分布(in-distribution, ID)上表现良好时,却无法可靠地泛化至未见过的分布。其核心发现是:OOD泛化能力取决于对数据生成过程(data-generating process, DGP)的结构假设(structural commitment),具体包括特征映射(feature map)、标签映射(label map)和模型类(model class)三者构成的组合 (φ,ψ,M)。这一结构承诺决定了模型对OOD区域的推理机制,而仅靠ID训练数据本身无法唯一确定该结构——因为存在无限多个在ID数据上观测等价但在OOD区域差异极大的DGP。因此,解决方案的关键在于:通过架构设计、预训练、数据增强、输入格式或领域知识等手段,隐式注入正确的结构承诺,使模型能够从ID数据中识别出OOD相关的结构信息;当此结构可识别且正确时,OOD误差可完全消除,如傅里叶坐标将周期性外推转化为插值问题。
链接: https://arxiv.org/abs/2605.07483
作者: Leonel Aguilar,Jan Nagler,Christoph Hoelscher,Nino Antulov-Fantulin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are \varepsilon -observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class (\varphi, \psi, \mathcalM) , dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by \sim520\times out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on \mathbbS^1 . The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler’s-third-law exoplanet prediction, n=2,362 ; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.
[AI-59] SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中记忆内容不可控的问题,即如何在不进行昂贵的全量重新训练的前提下,有效移除模型中对特定数据(如私有信息、受版权保护文本或有害知识)的记忆。传统方法依赖于精心构建的保留集(retain set)以避免模型通用能力的灾难性退化,从而引入额外的数据依赖性,限制了实际部署的灵活性。本文提出了一种无保留集(retain-set-free)的遗忘机制 SHRED(Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion),其核心创新在于识别出遗忘样本中高信息量(high-surprisal)与低信息量(low-surprisal)token的差异:前者集中体现模型的记忆内容,后者反映语言建模的一般能力。SHRED通过两个阶段实现高效遗忘——首先在前向传播中筛选出高信息量token作为遗忘位置,其余为良性锚点;随后构建基于KL散度的自蒸馏目标,在遗忘位置降低对应logits的同时保持良性位置分布不变,从而同步实现遗忘效果与模型性能的稳定。实验表明,该方法在多个基准上达到帕累托最优的遗忘效能与模型效用平衡,并具备对抗重学习攻击和成员推断攻击的能力。
链接: https://arxiv.org/abs/2605.07482
作者: Zizhao Hu,Ameya Godbole,Johnny Tian-Zheng Wei,Mohammad Rostami,Jesse Thomason,Robin Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model’s memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token’s logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.
[AI-60] Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)输出内容的版权溯源与责任归属问题,通过水印技术实现对生成文本的可追溯性。其解决方案的关键在于设计并评估多种水印机制在面对语义不变但形式多变的攻击(如词汇替换、机器翻译和神经 paraphrasing)时的鲁棒性,同时量化水印移除的成功率与语义保留程度(通过BERT分数、文本复杂度、语法错误率及Flesch阅读易读性指数等指标)。实验表明,尽管现有水印方案具备一定实用性,但仍存在被合理努力移除的风险,提示未来需从算法结构和防御策略上提升安全性。
链接: https://arxiv.org/abs/2605.07481
作者: Jonathan Hong Jin Ng,Anh Tu Ngo,Anupam Chattopadhyay
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal varying levels of effectiveness among different watermarking models, with the same underlying result that it is possible to remove the watermark with reasonable effort. This study sheds light on the strengths and weaknesses of existing LLM watermarking systems, suggesting how they should be constructed to improve security of available schemes.
[AI-61] Physical Simulators as Do-Operators: Causal Discovery under Latent Confounders for AI-for-Science
【速读】:该论文旨在解决现有干预性因果发现方法在科学计算场景中面临的两大挑战:一是假设无潜在混杂因子(causal sufficiency),而实际科学问题中潜变量普遍存在;二是依赖虚拟干预(如合成模拟器),难以利用真实物理模拟器进行可解释的因果推断。其解决方案的关键在于提出CFM-SD(Causal Flow Matching with Simulation Data),该方法利用第一性原理物理模拟器作为Pearl因果演算中的do-操作符,从而同时处理潜变量和真实干预数据。理论证明,在物理可实现约束下,仅需O(d)个单变量干预即可识别d变量的因果结构,这是最小干预次数。实验表明,CFM-SD在合成数据上显著优于基线方法(平均F1=0.800 vs. 0.127–0.562),并在分子毒性预测与电池电解液优化等真实科学任务中实现57–58%的偏差降低,验证了其在AI-for-Science领域的实用性。
链接: https://arxiv.org/abs/2605.07467
作者: Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 17 pages, 1 figure
Abstract:Existing interventional causal discovery methods – IGSP, DCDI, ENCO – assume causal sufficiency (no latent confounders) and rely on virtual interventions in synthetic simulators. In AI-for-Science settings such as molecular design and materials science, latent confounders are ubiquitous and real interventions (e.g., physics-based simulations) require hours to days per data point. We propose CFM-SD (Causal Flow Matching with Simulation Data), which uses first-principles physical simulators as do-operators in Pearl’s interventional calculus to simultaneously handle latent confounders and real interventional data. Theoretically, d -variable causal structure is identifiable with O(d) single-variable interventions – the minimum under physical realizability constraints. In Intrinsic Evaluation on synthetic data ( \gamma=0.2 – 0.8 ), CFM-SD achieves average F1 =0.800 vs. F1 =0.127 – 0.562 for all baselines. In Extrinsic Evaluation on real scientific data, CFM-SD achieves 57–58% bias reduction in molecular toxicity prediction and battery electrolyte optimization, demonstrating practical value beyond synthetic benchmarks.
[AI-62] Bounded Fitting for Expressive Description Logics IJCAI ECAI2026
【速读】:该论文旨在解决在表达能力强的描述逻辑(Description Logic, DL)中进行概念学习的问题,特别是针对扩展了逆角色(inverse roles)、限定数量限制(qualified number restrictions)和特征比较(feature comparisons)的逻辑系统。传统上,bounded fitting 方法已在 ALC 逻辑中成功应用,但其在更复杂的逻辑体系下是否仍保持良好的理论性质尚不明确。解决方案的关键在于通过引入基于 SAT 求解器的实现方式,在保证 PAC-风格泛化保证的前提下,将 bounded fitting 推广至这些增强型描述逻辑中,并验证其在实际应用中的有效性。实验表明,该方法相较当前最先进的概念学习工具具有竞争力,证明了其作为表达式概念学习实用方案的潜力。
链接: https://arxiv.org/abs/2605.07452
作者: Maurice Funk,Jean Christoph Jung,Tom Voellmer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, full version of paper accepted at IJCAI-ECAI 2026
Abstract:Bounded fitting is an attractive paradigm for learning logical formulas from labeled data examples that offers PAC-style generalization guarantees and can often be implemented leveraging SAT solvers. It has been successfully applied to learning concepts of the description logic ALC. We study bounded fitting for learning concepts in expressive description logics that extend ALC with inverse roles, qualified number restrictions, and feature comparisons. We investigate under which conditions bounded fitting keeps its favorable theoretical properties in this setting, and implement it using a SAT solver. We compare our tool with state-of-the-art concept learners with encouraging results, demonstrating that it is a practical approach to expressive concept learning.
[AI-63] Accelerated and data-efficient flow prediction in stirred tanks via physics-informed learning
【速读】:该论文旨在解决工业尺度搅拌釜中稳态流场模拟计算成本高昂的问题,通过引入机器学习模型作为替代模型(surrogate),以实现从仿真数据中学习并快速预测流场。其核心解决方案是采用隐式神经表示(implicit neural representations)来建模流场,并对比纯数据驱动与引入物理约束的变体。关键发现在于:训练数据量增加可单调降低预测误差,但存在显著的边际收益递减现象;在小数据场景下,物理约束(如基于雷诺平均Navier-Stokes方程RANS的先验信息)能显著提升精度、减少训练波动并改善示踪剂输运行为的稳定性;尽管如此,这种物理约束带来的优势随训练集规模扩大而减弱,且增加了训练复杂性。
链接: https://arxiv.org/abs/2605.07444
作者: Mahdi Naderibeni,Liang Wu,David M.J. Tax
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:The simulation of fluid flows is computationally expensive due to the complexity of its governing partial differential equations. Machine learning models offer a potential surrogate, enabling learning from simulations and significantly faster predictions of flow fields. However, these models require large training datasets, which introduces a trade-off between dataset generation cost and predictive accuracy. In this work, we investigate the relationship between the size of the training-set and accuracy of the prediction when learning steady flow fields in an industrial-scale stirred vessel. A data set of steady flows is generated using Reynolds Averaged Navier Stokes (RANS) simulations in a range of realistic operating conditions, including impeller speeds and liquid heights. We train implicit neural representations of flow fields and compare purely data-driven and constrained variants. Model performance is evaluated using global mean squared error (MSE), qualitative spatial comparisons of predicted and reference flow fields, and tracer transport simulations. We find that the prediction error decreases monotonically with increasing training data, but also that it exhibits clear diminishing returns beyond moderate dataset sizes. Physics-based constraints significantly improve accuracy and reduce variability across training runs in low-data regimes, and they lead to more stable tracer-transport behavior. Furthermore, reasonable interpolation can be achieved over different impeller speeds and liquid heights. However, these benefits come with an increase in the complexity of training, and their relative advantage diminishes as the training set grows.
[AI-64] Prompt Engineering Strategies for LLM -based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件工程领域辅助定性编码时的可靠性问题,特别是其在不同提示工程策略下是否能稳定、准确地复现人类研究者的主观判断。解决方案的关键在于通过受控实验设计,系统评估三种大型语言模型(LLMs)——Claude Haiku、DeepSeek-Chat 和 Gemini 2.5 Flash——在零样本与多示例闭合编码两种提示策略下的表现差异,并以 Cohen’s kappa 作为主要一致性指标进行量化分析。结果表明,多示例提示可显著提升 Claude Haiku 的编码一致性(Δκ = +0.034, p = 0.004),但对其他模型无效;同时发现各模型在稳定性与偏差模式上存在显著差异,如 Gemini 2.5 Flash 稳定性最差(标准差 SD = 0.038),且普遍存在对“分享负面反馈”的高估和对“表达关切”的低估,从而为 LLM 辅助定性分析提供了实证驱动的提示工程指导原则。
链接: https://arxiv.org/abs/2605.07422
作者: Moaath Alshaikh,Tasneem Alshaher,Ricardo Vieira,Beatriz Santana,Clelio Xavier,Jose Amancio,Glauco Carneiro,Julio Leite,Savio Freire,Manoel Mendonca
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Accepted at the 1st International Workshop on Prompt Engineering for Software Engineering (PROMPT-SE 2026), co-located with the 30th International Conference on Evaluation and Assessment in Software Engineering (EASE 2026), Glasgow, Scotland, United Kingdom, June 9–12, 2026
Abstract:Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs – Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash – across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen’s kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially – DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of “Sharing Negative Feedback” is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of “Expressing Concerns.” Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.
[AI-65] racking Large-scale Shared Bikes with Inertial Motion Learning in GNSS Blocked Environments
【速读】:该论文旨在解决仅依赖低成本惯性传感器进行自行车轨迹跟踪时面临的累积漂移和鲁棒性差的问题,尤其是在城市峡谷等GNSS信号受限的复杂环境中。其解决方案的关键在于提出了一种融合自行车机械约束与混合专家(mixture-of-experts)模型的惯性跟踪框架:一方面利用多个专家模块捕捉共享特征并通过门控机制加权,提升多任务学习性能并实现不确定性感知的轨迹估计;另一方面基于踏板与后轮之间的机械传动关系,挖掘骑行者周期性踩踏行为与加速度变化间的内在关联,并将其转化为车轮速度用于动态校准,从而显著降低轨迹误差。实验表明,该方法在滴滴共享单车的真实骑行数据上相较基线模型精度提升至少12%,且95%分位数下的车轮速度误差低于0.5 m/s。
链接: https://arxiv.org/abs/2605.07412
作者: Feng Liu(1),Kejia Li(1),Zhiwei Yang(2),Chunwei Yang(2),Qun Li(2),Guobin Wu(2),Qiang Ni(3),Ruipeng Gao(1) ((1) Beijing Jiaotong University, (2) DiDi Company, (3) Lancaster University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: It has been submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). Journal article. 14 pages, 18 figures, 10 tables
Abstract:Although Global Navigation Satellite Systems (GNSS) provide a general solution for bike tracking outdoors, there still exist complex riding environments where only inertial navigation systems work, such as urban canyons. Despite decades of research, localization using only low-cost inertial sensors still faces challenges such as cumulative drifts and poor robustness caused by filtering methods. Furthermore, sensors such as visual and LiDAR could provide reliable measurements, but they are not suitable for large-scale deployment. In this paper, we propose an inertial tracking framework that integrates bicycle mechanical constraints with a mixture-of-experts model. Specifically, we leverage multiple expert modules to capture shared representations and weight them through the gating mechanism, thus improving multi-task learning performance and enabling uncertainty-aware trajectory estimation. Furthermore, based on the mechanical transmission between the pedal and the rear wheel of a bike, we explore the intrinsic relationship between the rider’s periodic pedalling behaviors and acceleration variations, and convert such patterns into bike’s wheel speed for dynamic calibration. Experiments with real-world riding data from shared bikes of the DiDi ride-hailing platform demonstrate that our system improves the accuracy of baselines by at least 12%, with wheel speed errors below 0.5 m/s at 95-percentile.
[AI-66] Rubric-based On-policy Distillation
【速读】:该论文旨在解决当前基于策略的蒸馏(On-policy Distillation, OPD)方法对教师模型输出logits的依赖问题,从而限制其在黑盒场景下的应用。现有OPD方法需访问教师模型的logits以指导学生模型优化,但在实际部署中(如专有大语言模型),这种白盒假设往往不成立。为此,作者提出一种基于结构化语义评分标准(structured semantic rubrics)的OPD框架——ROPD,其核心创新在于:通过教师与学生之间的响应对比生成针对特定提示(prompt-specific)的评分标准,并利用这些评分标准对学生的轨迹进行打分,进而实现在线策略优化。该方案无需教师logits,仅依赖教师生成的文本响应,显著提升了样本效率(最高达10倍),并为跨开源与闭源大语言模型的可扩展蒸馏提供了一个简单而强大的基线。
链接: https://arxiv.org/abs/2605.07396
作者: Junfeng Fang,Zhepei Hong,Mao Zheng,Mingyang Song,Gengsheng Li,Houcheng Jiang,Dan Zhang,Haiyun Guo,Xiang Wang,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Code is available at this https URL
Abstract:On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at this https URL.
[AI-67] Offline Policy Optimization with Posterior Sampling
【速读】:该论文旨在解决模型驱动的离线强化学习(model-based offline reinforcement learning)中泛化能力与对分布外(out-of-distribution, OOD)区域利用错误的鲁棒性之间的权衡问题。现有方法通常通过过度悲观的正则化来降低模型被错误利用的风险,但这会牺牲泛化性能。其解决方案的关键在于提出后验采样策略优化(Posterior Sampling-based Policy Optimization, PSPO),将动力学建模视为贝叶斯推断过程,显式量化模型保真度(model fidelity),并通过后验采样与约束策略优化的结合,在利用动力学一致的OOD转移以提升泛化能力的同时,确保对模型利用错误的鲁棒性。
链接: https://arxiv.org/abs/2605.07393
作者: Hongqiang Lin,Dongxu Zhang,Yiding Sun,Mingzhe Li,Ning Yang,Haijun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures
Abstract:A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.
[AI-68] Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
【速读】:该论文旨在解决在有限数据预算下,将视觉-语言-动作(Vision-Language-Action, VLA)模型适配到特定机器人硬件时所面临的“具身差距”(embodiment gap)问题,尤其关注因示范数据多样性不足导致的策略性能下降。其核心挑战在于:传统通过收集多样化单次示范以最大化覆盖的做法,可能因估计噪声无法消失而陷入“多样性陷阱”,反而损害模型泛化能力。解决方案的关键在于提出一种“覆盖率-密度权衡”(Coverage–Density Trade-off)理论框架,将策略误差分解为估计误差(密度相关)与外推误差(覆盖相关),从而识别出在固定数据预算下最优的独特条件分配;基于此理论,作者设计了锚点中心适应(Anchor-Centric Adaptation, ACA)方法,采用两阶段策略:首先在关键锚点重复示范以稳定策略骨架,再通过教师强制错误挖掘和约束残差更新,选择性扩展至高风险边界,实验证明该方法显著提升了任务可靠性和成功率。
链接: https://arxiv.org/abs/2605.07381
作者: Yanzhe Chen,Kevin Yuchen Ma,Qi Lv,Yiqi Lin,Zechen Bai,Chen Gao,Mike Zheng Shou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures
Abstract:While Vision-Language-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a critical diversity trap: the standard heuristic of “maximizing coverage” by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as a Coverage–Density Trade-off. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we propose Anchor-Centric Adaptation (ACA), a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.
[AI-69] MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
【速读】:该论文旨在解决深度学习模型中稀疏注意力机制(Sparse Attention)在长文本推理时的计算效率瓶颈问题,尤其是针对DeepSeek Sparse Attention (DSA) 中索引器(indexer)因多头并行评分导致的高开销问题。其关键解决方案是提出MISA(Mixture of Indexer Sparse Attention),将原DSA indexer的多个查询头视为专家池(mixture-of-experts),通过轻量级路由器(router)基于块级统计信息动态选择少量活跃头进行token级评分,从而显著降低每查询的计算成本;同时引入分层变体进一步提升候选集质量,实现与原始DSA相当甚至更优的性能,且无需额外训练,在保持生成质量的同时大幅减少硬件资源消耗。
链接: https://arxiv.org/abs/2605.07363
作者: Ruijie Zhou,Fanxu Meng,Yufei Xu,Tongxuan Liu,Guangming Lu,Muhan Zhang,Wenjie Pei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA’s original indexer kernel on a single NVIDIA H200 GPU.
[AI-70] GraphReAct: Reasoning and Acting for Multi-step Graph Inference
【速读】:该论文旨在解决如何将推理-行动(reasoning-acting)框架有效扩展至图学习任务中的问题。传统方法在处理图结构数据时难以实现动态信息获取与多步推理的协同优化,而图数据具有拓扑结构和潜在表示双重编码特性,要求模型不仅能够检索局部或全局相关证据,还需在多步推理过程中持续精炼累积上下文。解决方案的关键在于提出GraphReAct框架,其核心创新是设计了一个基于图的动作空间,包含两类互补的检索动作:拓扑检索(topological retrieval)用于捕捉局部结构依赖关系,语义检索(semantic retrieval)用于访问表示空间中非局部的相关证据;同时引入上下文精炼(context refinement)动作,将积累的信息压缩为紧凑表征,从而实现从上下文扩展到压缩的渐进式推理过程。这一机制显著提升了模型在复杂图结构上的推理能力。
链接: https://arxiv.org/abs/2605.07357
作者: Xingtong Yu,Zhongwei Kuai,Chang Zhou,Xuanting Xie,Renhe Jiang,Xikun Zhang,Hong Cheng,Xinming Zhang,Yuan Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.
[AI-71] Confidence-Aware Alignment Makes Reasoning LLM s More Reliable
【速读】:该论文旨在解决大模型在推理过程中虽能得出正确最终答案,但中间步骤存在逻辑错误的问题,即“最终准确率”与“推理可靠性”之间的差距。传统对齐策略依赖外部验证器或大规模采样,难以扩展。其解决方案的关键在于提出CASPO(Confidence-Aware Step-wise Preference Optimization)框架,通过迭代式直接偏好优化(Direct Preference Optimization, DPO),将token级别的置信度与每一步的逻辑正确性对齐,无需训练额外的奖励模型;同时在推理阶段引入置信度感知思维(Confidence-aware Thought, CaT),利用校准后的置信度动态剪枝不确定的推理分支,实现低延迟(O(V)复杂度)的高效推理。
链接: https://arxiv.org/abs/2605.07353
作者: Kejia Chen,Jiawen Zhang,Yihong Wu,Kewei Gao,Jian Lou,Zunlei Feng,Mingli Song,Ruoxi Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME’24 and AIME’25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at this https URL.
[AI-72] Mage: Multi-Axis Evaluation of LLM -Generated Executable Game Scenes Beyond Compile-Pass Rate
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成多组件领域特定代码(如可执行游戏场景)时,仅依赖编译通过率(compile-pass rate)作为评估指标可能产生误导的问题。研究表明,在此类任务中,编译通过率与功能正确性呈负相关,单纯追求高编译率可能导致生成结果结构空洞、机制失真。解决方案的关键在于提出一个四维评估协议(Mage),涵盖编译成功、运行成功、结构保真度和机制遵循度,并结合结构化中间表示(Intermediate Representation, IR)条件控制来引导生成过程。实验表明,使用IR条件可显著提升结构保真度(F₁达1.0),尽管运行成功率下降,但能有效避免“伪正确”输出,从而实现对生成质量的多维度精准衡量。
链接: https://arxiv.org/abs/2605.07342
作者: Hugh Xuechen Liu,Kıvanç Tatar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Main Content: 10 pages, 1 figure. In total 22 pages
Abstract:Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage’) – compile success, runtime success, structural fidelity, and mechanism adherence – applied to 858 generation attempts across four open-weight LLMs (7B–30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C# generation achieves the highest runtime-pass rate (43% mean) yet produces structurally vacuous scenes (mechanism F_1 \approx 0.12 ). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ( F_1 up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar p = 1.0 ), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.
[AI-73] ools as Continuous Flow for Evolving Agent ic Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工具调用链(tool chaining)中因采用逐步式(step-wise)推理范式而导致的误差累积问题,以及在面对未见过的工具时泛化能力受限的问题。其解决方案的关键在于提出FlowAgent框架,将工具链重构为语义空间中的连续轨迹生成过程,通过条件流匹配(conditional flow matching)实现全局规划视角下的连续潜在轨迹建模,从而保障工具执行的一致性与鲁棒性,并理论上证明该连续形式可确保效用收敛与误差衰减,显著提升长程推理任务中的适应性和稳定性。
链接: https://arxiv.org/abs/2605.07339
作者: Tairan Huang,Siyu Shang,Qiang Chen,Xiu Su,Yi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over long horizons and restricts generalization to unseen tools. To overcome these limitations, we propose Tools as Continuous Flow for Evolving Agentic Reasoning (FlowAgent), which reconceptualizes tool chaining as continuous trajectory generation within a semantic space. To systematically evaluate this paradigm, we introduce the first plan-level closed-loop benchmark dedicated to plan-level agentic reasoning in dynamic real-world environments. Specifically, the proposed FlowAgent leverages conditional flow matching to generate continuous latent trajectories, providing a global planning perspective to ensure coherent and robust tool execution. Theoretically, we establish formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation. Empirical evaluations show that FlowAgent achieves superior robustness and adaptability in long-horizon reasoning tasks.
[AI-74] Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
【速读】:该论文旨在解决强化学习中重要性采样(Importance Sampling, IS)比率设计所面临的偏差-方差权衡问题:现有方法如PPO和GRPO采用逐token的IS比率,虽计算高效但因忽略前缀状态分布不匹配而引入偏差;而全序列IS比率虽能提供精确轨迹级修正,却因逐token比率的乘积累积导致高方差;GSPO通过长度归一化提升数值稳定性,但偏离了精确的全序列IS校正。解决方案的关键在于提出累积token IS比率(Cumulative Token IS Ratio),即到位置t为止的所有逐token比率的乘积,理论证明其在token级策略梯度框架下可对每个token梯度项提供无偏的前缀修正,且方差严格低于全序列比率。基于此,作者进一步提出CTPO算法,结合位置自适应裁剪机制,使log空间裁剪边界随自然t增长调整,从而实现跨token位置的一致正则化效果,在工具集成推理任务的数学推理基准测试中优于GRPO与GSPO基线。
链接: https://arxiv.org/abs/2605.07331
作者: Yuheng Zhang,Chenlu Ye,Shuowei Jin,Changlong Yu,Wei Xiong,Saurabh Sahu,Nan Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position t , as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural \sqrtt growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at this https URL.
[AI-75] SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
【速读】:该论文旨在解决大规模强化学习(Reinforcement Learning, RL)系统中因参数同步导致的通信瓶颈问题,尤其是在带宽受限或网络动态变化的场景下(如跨数据中心、异构资源池和在线RL),全量权重传输成为影响吞吐量和尾部延迟的主要瓶颈。解决方案的关键在于利用训练过程中权重更新的高度稀疏性(元素级稀疏度常超过99%),提出SparseRL-Sync方法,通过仅传输发生变更的参数索引与值(即稀疏更新负载),实现无损重建,从而将每次同步的通信量从原始大小S降至约S/X(X≈100时可减少约100倍)。结合适当的分桶策略,该方案还显著降低了启动和控制平面开销,提升了在带宽受限及高度异步环境下的可扩展性和端到端效率。
链接: https://arxiv.org/abs/2605.07330
作者: Lucas Hu,Ranchi Zhao,Isaac Zhu,Zach Zhang,Hscos Zhang,Hugh Yin,Jason Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Code will be released at this https URL
Abstract:In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments – for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL – weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.
[AI-76] CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在机器人连续认知任务中因处理长序列状态历史而导致的首token延迟(Time-to-First-Token, TTFT)瓶颈问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)或滑动窗口机制要么牺牲全局上下文,要么带来高昂的重新计算成本。解决方案的关键在于理论证明并实现三个必要条件:前缀稳定性(prefix stability)、增量可扩展性(incremental extensibility)和异步状态同步(asynchronous state reconciliation)。基于此,作者提出缓存状态表示(Cached State Representation, CSR)框架以实现最优KV缓存复用,并设计异步状态同步(Asynchronous State Reconciliation, ASR)算法,将状态内存淘汰过程卸载至并行计算资源,从而消除延迟尖峰。实验证明,CSR在120K token上下文中使TTFT降低26倍(从14.67s降至0.56s),ASR确保连续运行中TTFT稳定无波动,最终支持LLM作为高频(>2 Hz)具身策略持续运行。
链接: https://arxiv.org/abs/2605.07325
作者: Robin Karlsson,Go Suzui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Extended Technical Report for Paper Accepted to IEEE RA-L
Abstract:Deploying massive large language models (LLMs) as continuous cognitive engines for robotics is bottlenecked by the time-to-first-token (TTFT) latency required to process extensive state histories. Existing solutions like RAG or sliding windows compromise global context or incur prohibitive re-computation costs. We formalize the optimal task structure for minimizing latency and theoretically prove that prefix stability, incremental extensibility, and asynchronous state reconciliation are necessary conditions for real-time performance. Building on these proofs, we introduce the Cached State Representation (CSR) framework as the practical instantiation of these properties, ensuring optimal KV-cache reuse. To sustain these properties over infinite horizons, we further propose an Asynchronous State Reconciliation (ASR) algorithm that offloads state memory eviction to a parallel computational resource to eliminate latency spikes. On a physical robot wirelessly connected to an on-premise GPU server, CSR achieves a 26-fold latency reduction (14.67s to 0.56s) for 120K token contexts with a 235B parameter model compared to a standard baseline. On an embodied AI benchmark, we achieve SOTA recall (0.836 vs. 0.459) while maintaining RAG-level latency. ASR is validated to sustain bounded, spike-free TTFT over 10 eviction cycles in continuous real-world operation. Together, CSR and ASR enable massive LLMs to function as continuously operating, high-frequency ( 2 Hz) embodied policies.
[AI-77] Discovering Ordinary Differential Equations with LLM -Based Qualitative and Quantitative Evaluation ICML2026
【速读】:该论文旨在解决从观测数据中自动发现控制动力学系统的常微分方程(Ordinary Differential Equations, ODEs)的问题,尤其针对现有符号回归(Symbolic Regression)方法仅依赖定量指标而忽视物理合理性(physical plausibility)的局限性。解决方案的关键在于提出DoLQ方法,其核心是一个多智能体架构:包括采样代理(Sampler Agent)生成候选系统、参数优化代理(Parameter Optimizer)提升精度,以及科学家代理(Scientist Agent)利用大语言模型(Large Language Model, LLM)进行定性和定量联合评估,并基于评估结果迭代引导搜索过程,从而在保持高准确率的同时确保所发现方程的物理可解释性与正确性。
链接: https://arxiv.org/abs/2605.07323
作者: Sum Kyun Song,Bong Gyun Shin,Jae Yong Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
备注: Accepted at ICML 2026
Abstract:Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at this https URL.
[AI-78] Generative Modeling with Flux Matching
【速读】:该论文旨在解决传统基于得分的生成模型(score-based generative models)在建模灵活性和可解释性方面的局限性,特别是其对向量场必须为保守场(conservative vector field)的严格约束,这限制了模型引入结构先验、诱导偏差以及编码变量间定向依赖关系的能力。解决方案的关键在于提出“通量匹配”(Flux Matching),这是一种新的生成建模范式,放宽了对向量场必须等于数据得分的要求,转而施加一个更弱的条件:允许存在无限多个具有数据分布作为稳态分布的向量场。这一设计使模型能够直接优化或嵌入特定动力学特性,从而实现更快采样、更具可解释性的机制模型,并支持编码变量间的定向依赖关系,从根本上将向量场从固定目标转变为可设计的设计变量。
链接: https://arxiv.org/abs/2605.07319
作者: Peter Pao-Huang,Xiaojie Qiu,Stefano Ermon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score-based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high-dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at this https URL.
[AI-79] Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习中因可验证奖励(verifiable rewards)引导推理时出现的“过度思考”(overthinking)问题,即模型生成冗长且不必要的推理轨迹,导致效率低下。现有方法如长度惩罚(length penalties)或早停策略(early-exit strategies)存在准确性下降或安全截断假设不足等局限性。论文的关键创新在于提出隐式压缩正则化(Implicit Compression Regularization, ICR),其核心思想是基于训练过程中长度与准确率的相关性动态变化规律:初始阶段短响应更可能正确,但随压缩推进该相关性逐渐增强,表明短响应仍具正确性优势。ICR利用在线采样组中最短正确响应构建虚拟短分布作为压缩信号,无需显式惩罚或假设截断安全性,从而引导策略向简洁且正确的轨迹收敛,实验表明其能稳定提升精度-长度帕累托前沿表现。
链接: https://arxiv.org/abs/2605.07316
作者: Chen Wang,Hexuan Deng,Yining Zhang,Yuchen Zhang,Jionghao Bai,Zhaochun Li,Ge Lan,Yue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length–accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emphImplicit Compression Regularization (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length–accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy–length Pareto frontier.
[AI-80] When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
【速读】:该论文旨在解决现有代理记忆(agent memory)评估方法在面对无关会话(irrelevant sessions)持续累积时,无法准确衡量证据可用性的问题。传统评估仅报告固定快照下的准确率或检索质量,忽略了记忆系统在证据保留增长(evidence-preserving growth)场景中的动态可靠性变化。解决方案的关键在于提出一种尺度条件化评估协议(scale-conditioned evaluation protocol):对每个查询保持任务相关证据不变,逐步添加无关会话,并记录代理-记忆轨迹,从而诊断四类指标——预算合规可靠性、尾部记忆调用负担、失败模式分解及可用尺度边界(usable-scale boundary)。该协议揭示了可靠性损失并非单一现象,而是受代理模型、记忆接口结构(flat/planar/hierarchical)和交互预算共同影响,为可扩展记忆系统的有效性声明提供了条件化框架。
链接: https://arxiv.org/abs/2605.07313
作者: Jiaqi Shao,Yiyi Lu,Yunzhen Zhang,Bing Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 11 figures, preprint
Abstract:Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent–memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16–20 percentage points in budget-compliant reliability as irrelevant sessions are added; LiCoMemory’s observed failures depend strongly on the agent, with Qwen3-8B exceeding the budget while Qwen3-32B and Qwen3-235B remain reliable in the tested range. The result supports a framework for making scalable-memory claims conditional on agent, interface, scale range, and interaction budget.
[AI-81] BioProVLA-Agent : An Affordable Protocol-Driven Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
【速读】:该论文旨在解决生物实验室自动化中可靠实体执行(embodied execution)的挑战,特别是在湿实验(wet-lab)环境中,由于协议通常非结构化、器皿常为透明或反光材质,以及多步骤操作需要状态感知而非单次指令跟随所导致的鲁棒性不足问题。解决方案的关键在于提出一个基于视觉-语言-动作(Vision-Language-Action, VLA)模型的协议驱动型多智能体系统——BioProVLA-Agent,其核心由三个模块构成:一个定制的大语言模型(LLM)协议代理将实验流程转化为可验证子任务;一个基于视觉语言模型与检索增强生成(VLM-RAG)的验证代理通过观察、机器人状态、知识检索和成功/失败案例评估任务准备与完成情况;以及一个轻量级策略驱动的实体执行代理完成已验证子任务。此外,为提升在湿实验场景下对透明器皿、反射、光照变化等视觉扰动的鲁棒性,作者进一步设计了在线增强策略AugSmolVLA,显著改善了精确放置、透明物体操作及复合流程中的执行稳定性,为实现低成本、以协议为中心、具备验证能力的生物操作具身AI提供了可行路径。
链接: https://arxiv.org/abs/2605.07306
作者: Zhaohui Du,Zhe Wang,Hongmei Fei,Xiwen Cao,Ting Xiao,Qi Wang,Huanbo Jin,Jiaming Gu,Quan Lu,Zhe Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures
Abstract:Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.
[AI-82] SOM: Structured Opponent Modeling for LLM -based Agents via Structural Causal Model
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在多智能体和博弈论环境中难以准确预测对手行为的问题。现有方法通常将对手建模与预测过程耦合,依赖隐式的上下文推理,限制了在动态交互中的适应性。其解决方案的关键在于提出结构化对手建模(Structured Opponent Modeling, SOM),该框架采用两阶段设计:第一阶段利用结构因果模型(Structural Causal Model, SCM)显式构建对手的结构化表示,明确捕捉观测与行动之间的因果依赖关系;第二阶段则基于SCM提供的清晰路径进行结构化推理,从而提升预测准确性与稳定性。实验表明,SOM在多个多智能体基准测试中显著优于当前最先进的LLM推理基线,增强了复杂动态场景下的策略决策能力。
链接: https://arxiv.org/abs/2605.07301
作者: Shiyue Cao,Pei Xu,Likun Yang,Lei Cui,Xiaotang Chen,Kaiqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately predicting opponents’ behavior from interactions is a fundamental capability for large language model (LLM)-based agents in multi-agent and game-theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two-stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph-based formalism for representing dependencies among variables, to capture directed links between opponents’ observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi-agent benchmarks demonstrate that SOM consistently outperforms state-of-the-art LLM-based reasoning baselines, enabling more accurate and adaptable strategic decision-making in complex and dynamic multi-agent interactions.
[AI-83] Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention
【速读】:该论文旨在解决时间序列因果发现中深度学习方法的两大挑战:一是现有神经网络架构多采用逐变量设计,难以捕捉系统内部共享的动力学特征;二是依赖解耦后处理的图提取方式易受虚假相关性的干扰,导致过拟合。其解决方案的关键在于提出一个端到端框架Mask2Cause,通过引入反向变量嵌入(Inverted Variable Embedding)和邻接约束掩码注意力机制(Adjacency-Constrained Masked Attention),在预测前向传播过程中直接恢复潜在因果图,并结合同方差或异方差目标函数以同时建模均值与方差层面的因果影响,从而实现高精度因果结构识别与模型参数显著压缩。
链接: https://arxiv.org/abs/2605.07280
作者: Omar Muhammad,Pasupuleti Dhruv Shivkant,Deepak N. Subramani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose \textbfMask2Cause , an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.
[AI-84] Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics
【速读】:该论文旨在解决标准监督学习在处理具有多个正确解的科学与组合问题时所面临的歧义性问题,即传统方法通过人为选择一个解作为目标标签,可能导致隐式选择器(selector)任意、不连续且难以学习。其解决方案的关键在于提出分岔模型(bifurcation models),这是一种权重共享的动力学视角,其中不同的初始条件可收敛至不同的稳定平衡点,从而将模型视为一个吸引子景观(attractor landscape)而非单一分支。作者证明了广义的多值映射(set-valued maps)若具备局部Lipschitz连续的分支,可通过正则平衡动力学表示,且由此诱导的选取器几乎处处光滑,显著优于人工设计的选择器。实验表明,该方法无需分支标签即可发现多个有效平衡态,并在受挫伊辛模型中优于单分支监督策略;同时,在Allen–Cahn系统中揭示多样性并非自动涌现,需显式鼓励,但存在精度与多样性之间的权衡关系。
链接: https://arxiv.org/abs/2605.07277
作者: Caleb Jore,Jialin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many scientific and combinatorial problems admit multiple correct solutions, not a single label. Standard supervised learning resolves this ambiguity by choosing one solution as the target, but this hidden selector can be arbitrary, discontinuous, and harder to learn than the underlying solution set. We study bifurcation models, a weight-tied dynamical view in which different initializations can converge to different stable equilibria, so the model represents an attractor landscape rather than one chosen branch. We prove that broad set-valued maps with locally Lipschitz branches can be represented by regular equilibrium dynamics and that the induced selectors are almost everywhere regular, while manual selectors can be arbitrarily irregular. Experiments on frustrated Ising models show that such dynamics can discover multiple valid equilibria without branch labels and outperform single-branch supervision. Allen–Cahn experiments further show that diversity is not automatic: it can be encouraged explicitly, but with an accuracy–diversity tradeoff.
[AI-85] Signal Reshaping for GRPO in Weak-Feedback Agent ic Code Repair
【速读】:该论文旨在解决代码代理强化学习(Code-agent RL)中因弱反馈导致的训练效率与效果受限问题,即当前rollout-time信号虽可靠且可执行,但仅反映任务成功的必要条件或表面特征,难以捕捉目标语义谓词(semantic predicate)。其解决方案的关键在于对信号进行精细化重塑:首先通过编译-语义分层奖励(compile-and-semantic layered rewards)恢复轨迹间的语义排序;其次引入步骤级过程评分(step-level process scores),在组内奖励归一化之外调整轨迹内部更新强度,实现过程信用定位;最后采用故障原因感知的回放治理(failure-cause-aware rollout governance)确保同提示下的回放具有执行可比性。这一最小信号重塑结构保留了GRPO原有的组归一化优势计算机制,实验表明该方法使严格编译与语义准确率从基础模型的0.385提升至0.535,并通过消融实验验证各组件贡献,尤其过程评分加权进一步将准确率提升至0.53并减少平均评估步数至17.02。
链接: https://arxiv.org/abs/2605.07276
作者: Jia Li,Yuxin Su,Ting Peng,Hailiang Huang,Yuetang Deng,Michael R. Lyu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO’s within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO’s group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model’s zero-shot 0.385 to 0.535 . Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from 0.48 to 0.53 and reduces average evaluation steps from 23.50 to 17.02 . As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.
[AI-86] Structured Role-Aware Policy Optimization for Multimodal Reasoning
【速读】:该论文旨在解决多模态强化学习中序列级奖励分配无法区分不同token功能角色的问题,尤其在视觉语言模型(LVLM)的推理过程中,最终答案奖励通常仅在序列层面赋予,导致难以判断正确答案是否由任务相关的视觉证据支撑。解决方案的关键在于提出结构化角色感知策略优化(SRPO),通过引入角色感知的token级信用分配机制,在不改变原始奖励函数的前提下,将序列级GRPO优势分解为感知token与推理token的分角色优势:感知token依据原始输入与扰动输入下的视觉依赖性进行强化,推理token则根据其与生成感知内容的一致性进行调整;二者通过共享轨迹级基线统一,从而获得正向的token权重,既保留了原GRPO的奖励信号和优化方向,又实现了对关键视觉证据的有效识别与利用,无需外部奖励模型或额外教师网络。
链接: https://arxiv.org/abs/2605.07274
作者: Bingqing Jiang,Difan Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages
Abstract:Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.
[AI-87] Can Agents Price a Reaction? Evaluating LLM s on Chemical Cost Reasoning
【速读】:该论文旨在解决科学领域中大型语言模型(Large Language Models, LLMs)在化学工具使用方面的评估缺乏客观、可量化且可诊断的基准问题。现有评估多依赖专家评分或LLM作为评判者,难以提供精确、无需人工干预的ground truth。为此,作者提出ChemCost基准,其关键在于构建一个基于真实供应商报价的化学采购成本估算任务,涵盖1,427个反应、2,261种化学品和230,775条报价数据,并支持对“化学实体识别(grounding)、信息检索、采购包选择及算术计算”等阶段进行逐级诊断。该方案不仅实现了标量评分,还通过引入可控噪声扰动(如化学别名变化、数量表达错误等)验证了模型鲁棒性,从而揭示当前主流LLM代理在工具调用、证据整合与决策一致性上的根本局限。
链接: https://arxiv.org/abs/2605.07251
作者: Yuyang Wu,Yue Huang,Shuaike Shen,Xujian Wang,Shuhao Zhang,Qiyao Xue,Weichen Liu,Runtian Gao,Jian Ma,Xiangliang Zhang,Olexandr Isayev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
[AI-88] EnvSimBench: A Benchmark for Evaluating and Improving LLM -Based Environment Simulation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的环境模拟在可扩展智能体训练中面临的核心可靠性问题,即LLM在模拟环境反馈时存在幻觉、逻辑不一致和状态漂移等缺陷,导致代理奖励信号被污染,反而增加了构建成本。其解决方案的关键在于提出EnvSimBench基准测试框架,首次形式化定义并量化“环境模拟能力”(Environment Simulation Ability, EnvSim Ability),并通过系统性评估发现所有先进LLM均存在“状态变更悬崖”现象——即在环境状态不变时表现优异,但在多状态需同步更新时崩溃;进而设计了一种约束驱动的模拟流水线,显著降低幻觉率、提升环境合成效率6.8%,并将成本削减超90%,从而为可靠LLM环境模拟提供了诊断工具与优化路径。
链接: https://arxiv.org/abs/2605.07247
作者: Yi Liu,TingFeng Hui,Wei Zhang,Li Sun,Ningxin Su,Jian Wang,Sen Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at this https URL
[AI-89] HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的启发式算法自动设计方法中存在的局限性,即现有方法通常依赖于结构固定的单体流程(monolithic workflows),缺乏灵活的记忆引导探索能力,易陷入局部最优解。其解决方案的关键在于提出HMACE框架——一种异构多智能体协同进化架构(Heterogeneous Multi-Agent Collaborative Evolution),将启发式搜索重构为组织设计问题。该框架在每一代进化中引入四个角色专精的协作智能体:Proposer(策略探索)、Generator(可执行启发式合成)、Evaluator(实证评估)和Reflector(基于档案的记忆更新),并通过行为感知检索、轻量级候选过滤与适应度驱动的档案更新机制,有效提升搜索多样性与效率,避免冗余评估,在典型组合优化问题(如TSP、在线装箱问题等)上显著优于主流单智能体与多智能体基线方法。
链接: https://arxiv.org/abs/2605.07214
作者: Yuping Yan,Jirui Han,Fei Ming,Yuanshuai Li,Yaochu Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have recently emerged as a promising paradigm for automated heuristic design for NP-hard combinatorial optimization problems. Despite this progress, existing LLM-based methods typically rely on monolithic workflows constrained by rigid templates, thereby restricting memory-guided exploration and triggering premature convergence to local optima. To design an autonomous and collaborative architecture, we introduce HMACE, a Heterogeneous Multi-Agent Collaborative Evolution framework that reconceptualizes heuristic search as an organizational design problem. HMACE decomposes each evolutionary generation into an autonomous, role-specialized loop with four coordinated agents: a Proposer for strategy exploration, a Generator for executable heuristic synthesis, an Evaluator for empirical assessment, and a Reflector for archive-backed memory update. By coupling behavior-aware retrieval, lightweight candidate filtering, and fitness-grounded archive updates, HMACE guides the search toward diverse and promising heuristic behaviors while avoiding redundant evaluations. Extensive evaluations on representative COPs, including TSP, Online BPP, MKP, and PFSP, show that HMACE achieves a favorable quality-efficiency trade-off compared to state-of-the-art single-agent and multi-agent baselines. In the matched LLM-driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP (0.464% and 0.223%, respectively), while requiring only 0.13M and 0.42M tokens for the two tasks, substantially fewer than the compared baselines.
[AI-90] HARMONY: Bridging the Personalization-Generalization Gap by Mitigating Representation Skew in Heterogeneous Split Federated Learning
【速读】:该论文旨在解决混合联邦学习(Hybrid Split Federated Learning, Hybrid SFL)在客户端架构异构性下出现的表示偏移(representation skew)问题,即不同客户端定制的特征提取器在共享空间中难以对齐,导致服务器端用于预测分布外(Out-of-Distribution, OOD)样本的模型性能显著下降。解决方案的关键在于提出HARMONY框架,其通过改进元学习机制模拟跨参数和架构的多样化提取器以实现个性化,并在服务器端引入对比学习(contrastive learning)对齐各客户端提取的特征表示,从而在不牺牲客户端个性化能力且无需共享原始标签的前提下,有效缓解表示偏移问题。实验表明,HARMONY在多个数据集和模型家族上相比现有最优方法,在有/无OOD场景下分别提升测试准确率达43.0%/28.3%,同时保持可接受的推理延迟。
链接: https://arxiv.org/abs/2605.07211
作者: Jiseok Youn,You Rim Choi,Goodsol Lee,Sangtae Ha,Hyung-Sin Kim,Saewoong Bahk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages (except references), 5 figures
Abstract:Mobile devices face diverse resource constraints and non-IID data class distributions, requiring fast on-device inference for local in-distribution (ID) classes and on-demand remote support for client-specific out-of-distribution (OOD) classes. Hybrid split federated learning (Hybrid SFL) couples personalized client-side front ends (supporting early exit) with a generalized server-side backend for fallback inference, balancing accuracy and cost. However, under client architectural heterogeneity, the existing hybrid SFL suffers from representation skew, where features from customized extractors fail to align in the shared space, leading to a sharp degradation in the server model responsible for OOD prediction. We propose HARMONY, the first hybrid SFL framework to support heterogeneous client architectures. HARMONY modifies meta-learning to simulate diverse extractors across parameters and architectures, and to learn to personalize. To mitigate representation skew, HARMONY conducts server-side contrastive learning to align extracted features, neither sacrificing clients’ personalization nor sharing raw labels. Compared to the state of the art across multiple datasets and model families, HARMONY improves test accuracy by up to 43.0%/28.3% without/with OOD, respectively, while maintaining acceptable latency.
[AI-91] owards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业级数据环境中难以将碎片化数据转化为可操作洞察的问题,其核心挑战包括复杂数据库模式、动态SQL生成能力受限以及对多维业务逻辑深度理解的不足。解决方案的关键在于提出AIDA(Autonomous Insight Discovery Agent),这是一个端到端的自主探索框架,通过引入专用领域语言(Domain-Specific Language, DSL)实现语义推理与精确SQL执行之间的桥梁,并基于帕累托原则(Pareto Principle)引导的累积推理过程,利用强化学习系统构建结构化的业务分析流程。实验表明,AIDA在环境感知能力和多维度分析深度上显著优于基于工作流的代理模型,验证了其在工业级商业智能系统中实现自主智能的可行性与优越性。
链接: https://arxiv.org/abs/2605.07202
作者: Dongming Wu,Junwen Li,Ming Lu,Gang Wang,Ting Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional this http URL this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.
[AI-92] hree-in-One World Model: Energy-Based Consistency Prediction and Counterfactual Inference for Marketing Intervention
【速读】:该论文旨在解决营销决策中消费者异质性(latent consumer heterogeneity)、随时间变化的内在状态以及显式干预之间复杂交互关系难以被现有预测模型和语言模型统一建模的问题。其解决方案的关键在于提出一种“三合一”世界模型架构(Three-in-One world-model architecture),其中深度玻尔兹曼机(Deep Boltzmann Machine, DBM)从人口统计学特征、时间变量及滞后的动作与结果中学习一个冻结的信念表示(frozen belief representation),并在其上附加轻量级任务特定适配器(task-specific adapters)。该信念表示同时支持三项任务:基于能量的合理性评估(通过DBM的自由能)、结果预测(通过适配器)和反事实推断(固定信念仅改变动作输入)。实验表明,该方法在访问和购买AUC指标上优于强基线MLP,并显著优于S-、T-、X-和DR-learner元学习器及因果森林模型,尤其在混杂的价格-促销干预下表现更优,且自由能约束能系统性惩罚缺乏前期促销暴露的反事实购买路径,体现出信念表示对潜在特质的有效解耦能力,为营销干预提供了一个集成的世界模型基础。
链接: https://arxiv.org/abs/2605.07199
作者: Junichiro Niimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Marketing decisions reflect the interaction of latent consumer heterogeneity, time-varying internal states, and explicit interventions, a structure that current prediction- and language-oriented models do not capture in a unified manner. We propose a Three-in-One world-model architecture in which a Deep Boltzmann Machine (DBM) learns a frozen belief representation from demographics, time, and lagged actions and outcomes, with lightweight task-specific adapters attached on top. The same belief supports three tasks within a single framework: (i) energy-based consistency evaluation through the DBM’s free energy, (ii) outcome prediction through adapters, and (iii) counterfactual inference by holding the belief fixed and varying only the action input given to the adapter. Using a controlled simulation in which the latent price sensitivity, promotion responsiveness, and base preference of each consumer are known, we show that the adapters match a strong MLP baseline on visit- and purchase-AUC while recovering heterogeneous treatment effects substantially better than S-, T-, X-, and DR-learner meta-learners and a Causal Forest baseline built on the same raw features, with the largest gap on a confounded price-promotion intervention. Complementing this, free-energy clamps systematically penalize counterfactual purchase trajectories that lack prior promotional exposure, and the penalty itself depends on the latent base preference in the expected direction. These results indicate that DBM beliefs disentangle latent traits in a form that survives counterfactual queries, providing an integrated world-model substrate for marketing intervention.
[AI-93] HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
【速读】:该论文旨在解决现有多模态搜索代理在处理可分解查询时存在的效率低下问题,即通过顺序调用工具对目标实体逐个检索,导致冗余交互轮次增加。其核心解决方案是提出HyperEyes,一种并行多模态搜索代理,关键在于将视觉定位与检索融合为单一原子动作,并以推理效率作为首要训练目标。该方案包含两个创新:一是双粒度效率感知强化学习框架,宏观层面采用轨迹级奖励TRACE,通过单调收紧参考值抑制冗余工具调用;微观层面引入On-Policy Distillation机制,在失败回溯中注入密集的token级校正信号,缓解稀疏结果奖励的信用分配难题。实验表明,HyperEyes-30B在准确率上比最强开源代理提升9.9%,平均工具调用轮次减少5.3倍。
链接: https://arxiv.org/abs/2605.07177
作者: Guankai Li,Jiabin Chen,Yi Xu,Xichen Zhang,Yuan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code Data: this https URL
Abstract:Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
[AI-94] Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation
【速读】:该论文旨在解决现有DNA甲基化(DNA methylation)年龄预测模型中忽略CpG位点间复杂生物关系的问题,从而提升模型对生物年龄(biological age)估计的准确性与可解释性。其解决方案的关键在于提出RelAge-GNN框架,该框架构建了三种互补的图结构——共甲基化模式、基因组共定位以及基因层面关联,并通过独立的图神经网络(Graph Neural Network, GNN)分支分别建模每种关系,再利用可学习门控机制自适应融合各分支表征,从而更全面地捕捉CpG位点间的多维相互作用,显著提升预测性能及对年龄加速(age acceleration)检测的敏感性。
链接: https://arxiv.org/abs/2605.07175
作者: Qing Qing,Xikun Zhang,Zhongyuan Zhang,Jiarui Liu,Xingtong Yu,Xiaotao Shen,Ziqi Xu,Qixin Zhang,Zhe Wang,Renqiang Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Aging clocks aim to estimate biological age, a measure of physiological state distinct from chronological age, from observable biomarkers, and are widely used for health assessment and disease analysis. DNA methylation is a particularly informative biomarker due to its stability and strong association with aging, and recent learning-based approaches have improved predictive performance. However, most existing methods treat CpG sites as independent features, overlooking the complex and heterogeneous biological relationships among them. We propose RelAge-GNN, a multi-relational graph neural network framework for DNA methylation-based age prediction. Our method constructs three complementary graphs capturing co-methylation patterns, genomic co-localization, and gene-level associations among CpG sites. Each graph is modeled by an independent GNN branch, and a learnable gating mechanism adaptively fuses the resulting representations. Experiments on large-scale datasets show that RelAge-GNN achieves competitive accuracy and stronger correlation with chronological age compared to state-of-the-art methods. Moreover, the model exhibits improved sensitivity in detecting age acceleration across diverse disease cohorts, highlighting its potential utility for disease characterization. Finally, through post hoc interpretability analyses, we quantify the contributions of different relational structures and CpG sites, providing biologically meaningful insights and suggesting potential directions for aging-related research. Our code is available at: this https URL.
[AI-95] Repeated Deceptive Path Planning against Learnable Observer AAMAS2026
【速读】:该论文旨在解决**欺骗性路径规划(Deceptive Path Planning, DPP)中因对手为可学习实体而导致现有方法失效的问题。传统DPP假设观察者是静态且非学习的,但在现实场景如关键物资运输或军事行动中,对手能够通过历史轨迹数据持续更新其预测模型,从而提升对真实目的地的识别能力。为此,作者提出重复欺骗性路径规划(Repeated Deceptive Path Planning, RDPP)的新范式,并设计了欺骗性元规划(Deceptive Meta Planning, DeMP)**框架作为解决方案。其核心创新在于引入双层优化机制:episode-level适应用于短期策略调整以应对当前对手模型更新,meta-level更新则利用跨episode反馈建模对手的学习规律,从而加速未来适应过程,有效缓解累积适应滞后问题,实现对可学习对手的持续欺骗。
链接: https://arxiv.org/abs/2605.07174
作者: Shiyue Cao,Pei Xu,Likun Yang,Lei Cui,Shizhao Yu,Shiyu Zhang,Yongjian Ren,Xiaotang Chen,Kaiqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Full version of the extended abstract accepted at AAMAS 2026
Abstract:We study the problem of deceptive path planning (DPP), where an agent aims to conceal its true destination from external observers. While existing work assumes static, non-learning observers, real-world adversaries-such as in critical goods transportation or military operations-can adapt by learning from historical trajectories. To address this gap, we introduce Repeated Deceptive Path Planning (RDPP), a new formulation that explicitly models learnable observers. We show that existing DPP methods fail under this setting, as they cannot adapt to evolving adversarial predictions. While incorporating observer previous predictions into updates enables some adaptation, such incremental updates cause accumulative lag that degrades deception. To this end, we propose Deceptive Meta Planning (DeMP), a two-level optimization framework that combines episode-level adaptation, which enables short-term policy adjustment to counter updated observer, and meta-level updates, which leverage cross-episode feedback to capture how observers update their models and accelerate adaptation in future episodes. In this way, DeMP mitigates the accumulation of adaptation lag, enabling sustained deception against a learning observer. Experiments across environments demonstrate that DeMP significantly outperforms existing approaches in RDPP while maintaining competitive path cost. Our results highlight the importance of modeling repeated interactions with learnable adversaries, providing new insights into deception and privacy in multi-agent systems.
[AI-96] SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
【速读】:该论文旨在解决当前站点可靠性工程(Site Reliability Engineering, SRE)基准测试工具存在的局限性问题,即现有基准多局限于过于简化的SRE任务,且因定制化设计难以扩展。其解决方案的关键在于提出SREGym——一个高保真度的SRE代理评估基准,该框架基于真实的云原生系统栈构建运行环境,并通过故障注入器模拟多种层次的故障、环境噪声及复杂故障模式(如亚稳态故障和相关性故障),同时采用模块化、可扩展架构实现对不同组件的灵活编排。该方案显著提升了SRE代理在真实场景下的评估能力,实验证明前沿代理在应对不同类型故障时表现差异可达40%。
链接: https://arxiv.org/abs/2605.07161
作者: Jackson Clark,Yiming Su,Saad Mohammad Rafid Pial,Yifang Tian,Lily Gniedziejko,Hans-Arno Jacobsen,Yinfang Chen,Tianyin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.
[AI-97] MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries
【速读】:该论文旨在解决形式化数学库 Mathlib 的 PR(Pull Request)审查瓶颈问题,即当前依赖人工评审导致的整合效率低下,难以支持其持续增长。解决方案的关键在于构建了一个名为 MathlibPR 的基准数据集,该数据集基于真实 Mathlib4 PR 历史记录,并设计了一种分阶段评估协议,用于系统性评测大语言模型(LLM)及其代理(LLM agents)在区分可合并 PR 与仅通过编译但未被合并的 PR 方面的能力。这一方法为未来开发辅助评审工具和奖励模型奠定了基础,从而引导 LLM 更好地生成符合 Mathlib 规范的、可直接合并的贡献。
链接: https://arxiv.org/abs/2605.07147
作者: Zixuan Xie,Xinyu Liu,Shangtong Zhang
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib’s conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.
[AI-98] Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents
【速读】:该论文旨在解决当前基于可验证情感奖励的强化学习(Reinforcement Learning from Verifiable Emotion Rewards, RLVER)语言模型在真实情绪交互场景中表现不足的问题。现有评估基准假设用户为合作且诚实,但现实中用户常通过操控(gaslighting)、情绪升级和施压等方式寻求无条件认同,这种对抗性动态无法被传统基准捕捉。为此,作者构建了对抗性共情基准(Adversarial Empathy Benchmark, AEB),引入情感一致性评分(Emotional Consistency Score, ECS)以量化模型在对抗条件下对用户情绪状态的跟踪能力与改善能力的分离程度。关键创新在于通过AEB设计六类心理基础明确的对抗对话轨迹,并利用ECS形式化解耦模型的情感响应能力与状态追踪能力,从而揭示RLVER模型虽能提升情绪响应敏感度,却未显著增强对用户隐藏意图或情绪状态的可观测追踪能力,这一发现揭示了行为可解释性与内部理解之间的潜在脱节。
链接: https://arxiv.org/abs/2605.07138
作者: Deeraj S K,Sadhana Devarajan,Krishna Mehra,Sudhakar Mishra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model’s capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, (p0.001, r=0.688)), with zero dialogue collapses and 47% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think ((p=0.650)): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS–FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.
[AI-99] Adaptive Negative Reinforcement for LLM Reasoning :Dynamically Balancing Correction and Diversity in RLVR
【速读】:该论文旨在解决当前负样本强化学习(Negative Sample Reinforcement, NSR)方法在训练过程中存在两个关键局限性:一是采用固定惩罚策略,无法根据训练阶段动态调整惩罚强度;二是对所有错误响应一视同仁,忽略了不同错误的严重程度差异。为应对这些问题,论文提出两种扩展方案:其一为自适应负样本强化(Adaptive Negative Sample Reinforcement, A-NSR),通过引入时间依赖的调度函数,在训练初期强化错误纠正以稳定模型,后期则转向更精细的更新机制;其二为置信度加权负强化(Confidence-Weighted Negative Reinforcement, CW-NSR),基于模型对错误路径的归一化序列似然分配差异化惩罚权重——高置信度错误获得更强惩罚,低置信度探索性错误则受较轻惩罚。这两种机制共同实现了token级更新的精确控制,并借助先验引导的概率重分布有效防止过拟合,显著提升了大语言模型在复杂推理任务中的表现。
链接: https://arxiv.org/abs/2605.07137
作者: Yash Ingle,Jaival Chauhan,Ankit Yadav,Sudhakar Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) – which focuses on penalizing incorrect steps rather than simply rewarding correct ones – can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model’s normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors – where the model is effectively exploring – are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07137 [cs.LG] (or arXiv:2605.07137v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-100] GAD in the Wild: Benchmarking Graph Anomaly Detection under Realistic Deployment Challenges
【速读】:该论文旨在解决当前图异常检测(Graph Anomaly Detection, GAD)研究中存在的重要现实差距:现有基准测试通常局限于小规模、人工构造的图数据,且异常比例相对均衡,难以反映真实工业场景下大规模、极端稀疏异常和缺失节点属性等复杂挑战。为弥合这一差距,作者提出一个多维基准测试体系,系统性地评估GAD模型在三个关键部署相关挑战下的表现:百万级节点规模、极端异常稀缺性(如0.1%异常比例)以及节点属性缺失。其解决方案的关键在于从五个多样化的真实图数据(包括两个超过370万节点的工业级数据集)中衍生出受控的基准变体,并对九种代表性GAD模型进行广泛实验,揭示了当前主流方法在可扩展性、低异常比例下的召回率下降及重建类模型对属性插补策略的高度敏感性等三大局限。该工作通过提供诊断性测试平台,推动面向实际应用的大规模、鲁棒GAD系统的发展。
链接: https://arxiv.org/abs/2605.07133
作者: Jingjing Zhou,Shiyu Huang,Qing Qing,Zuquan Yuan,Huafei Huang,Ziqi Xu,Mingliang Hou,Xikun Zhang,Renqiang Luo,Ivan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Anomaly Detection (GAD) is a critical task in graph machine learning with vital applications in financial fraud detection and social platform governance. However, existing GAD benchmarks are often restricted to small-scale, curated graphs with relatively balanced anomaly ratios, leaving a substantial gap between academic evaluation and real-world deployment. To bridge this gap, we present a multi-dimensional benchmark that systematically evaluates GAD models under three deployment-relevant challenges: million-scale graphs, extreme anomaly scarcity, and missing node attributes. We derive a family of controlled benchmark variants from five diverse graphs, including two native industrial-scale datasets with over 3.7 million nodes. Our extensive evaluation of nine representative GAD models reveals three major limitations: (1) most GNN-based methods fail to scale to million-node graphs due to prohibitive memory requirements; (2) detection performance drops sharply under realistic anomaly ratios (e.g., 0.1%), often resulting in zero recall; and (3) reconstruction-based models are highly sensitive to attribute imputation strategies. Our findings suggest that strong performance in laboratory settings does not guarantee robustness in production environments. We release this benchmark and empirical evaluation as a diagnostic testbed to promote the development of robust and scalable GAD systems for large-scale, imperfect graphs encountered in practice. Code is available at this https URL.
[AI-101] AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)中实体表示静态化的问题,即现有方法生成的实体嵌入仅依赖于固定参数,无法捕捉实体在不同时间点参与事件时的动态交互历史。解决方案的关键在于提出AdaTKG框架,其核心思想是将每个实体建模为一个自适应过程,通过维护一个可更新的实体级记忆模块来持续积累交互信息,并利用可学习的指数移动平均机制在线调整实体表示,从而实现对新出现实体的泛化能力。该设计避免了为每个实体单独设置参数,仅用一个共享标量控制更新速率,显著提升了模型的适应性和预测性能。
链接: https://arxiv.org/abs/2605.07121
作者: Seunghan Lee,Jun Seo,Jaehoon Lee,Sungdong Yoo,Minjae Kim,Tae Yoon Lim,Dongwan Kang,Hwanil Choi,SoonYoung Lee,Wonbin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Temporal knowledge graphs (TKGs) represent time-stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per-entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is publicly available at: this https URL.
[AI-102] Stabilized neural Hamilton–Jacobi–Bellm an solvers: Error analysis and applications in model-based reinforcement learning
【速读】:该论文旨在解决连续时间下基于模型的强化学习(model-based reinforcement learning)中,如何高效且稳定地求解Hamilton–Jacobi–Bellman(HJB)方程的问题。传统方法要么依赖于离散网格的有限差分法(finite-difference method),要么采用物理信息神经网络(Physics-Informed Neural Networks, PINN)进行连续偏微分方程(PDE)求解,二者均存在计算效率或稳定性不足的局限。本文提出了一种混合求解框架:用神经网络表示值函数(value function),通过在平移点上查询网络来实现有限差分形式的策略评估(policy evaluation)操作,并利用随机连续配点(random continuous collocation)最小化残差。其关键创新在于将有限差分算子视为作用于神经网络的平移算子(shift operators),并建立了针对单步策略评估的群体 L2 稳定性估计,该估计明确分离了残差误差、初始与边界层不匹配误差、策略不匹配误差及模型识别误差,同时显式刻画了学习动力学带来的梯度放大因子,而线性策略评估的稳定性不受隐藏逆粘性(inverse-viscosity)发散影响。这一理论为实际应用中的神经网络策略评估提供了可信赖的误差控制基础。
链接: https://arxiv.org/abs/2605.07116
作者: Minseok Kim,Yeongjong Kim,Namkyeong Cho,Yeoneung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:Physics-informed neural solvers offer a promising route to model-based reinforcement learning in continuous time, where optimal feedback synthesis is governed by Hamilton–Jacobi–Bellman (HJB) equations. Practical implementations often occupy a regime that is neither a classical grid method nor a continuous-PDE PINN: the value function is represented by a neural network, finite-difference HJB policy-evaluation operators are evaluated by network queries at shifted points, and residuals are minimized by random continuous collocation. This regime preserves the stabilized finite-difference policy-evaluation structure while avoiding grid-based value unknowns. We develop an error theory for this hybrid regime. Interpreting finite differences as shift operators acting on neural networks, we prove a population L^2 stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement. Experiments on compact-control LQR upto 64 dimensions, Allen–Cahn control, pendulum, Hopper, and 3D quadrotor benchmarks compare against representative model-based and model-free RL baselines, demonstrating the predicted residual, policy-mismatch, and learned-model error trends.
[AI-103] Query-efficient model evaluation using cached responses
【速读】:该论文旨在解决在评估新模型性能时,因需对所有查询生成并评估响应而导致的高计算成本问题。其核心挑战在于如何利用已有模型在相同基准测试中的缓存响应信息,以减少评估新模型所需的查询数量。解决方案的关键在于引入基于数据核空间(Data Kernel Perspective Space, DKPS)的方法,该方法能够在黑盒场景下量化不同模型之间的关系,并理论上证明在特定条件下具有查询效率优势;实证结果表明,DKPS方法可在显著降低查询预算的同时达到与基线方法相当的平均绝对误差,从而实现高效且准确的模型性能预测。
链接: https://arxiv.org/abs/2605.07096
作者: Hayden Helm,Ben Johnson,Carey Priebe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached – creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.
[AI-104] Online Allocation with Unknown Shared Supply
【速读】:该论文旨在解决在线共享供应分配(Online Shared Supply Allocation, OSSA)问题,即在有限且未知的总供应量下,如何在需求逐个发生时,将资源最优地预置到多个地点,以最小化因缺货导致的服务损失(lost-sales penalties),同时考虑固定的运输成本(fixed-charge transportation costs)。与传统库存模型不同,OSSA不允许缺货补货(precludes backlogging),且仅能通过预先分配来应对未来不确定性。解决方案的关键是提出一种确定性的阈值比例策略(GPA, threshold-proportional policy),其理论保证为:在总供应量固定的前提下,可达到离线最优解的 $ \frac{4}{3} $-近似,误差项独立于总供应量;此外,作者还证明了该比值是紧的(tight),且即使使用已知总供应量的随机算法也无法避免该加性误差项。进一步地,论文设计了一种学习增强型扩展版本,能够融合不完美预测信息(如来自人类专家或机器学习模型的预测),在高质量建议下提升性能,同时对任意劣质建议保持鲁棒性。
链接: https://arxiv.org/abs/2605.07080
作者: Tzeh Yuan Neoh,Davin Choo,Mengchu Yue,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:Many real-world resource allocation systems, such as humanitarian logistics and vaccine distribution, must preposition limited supply across multiple locations before demand is realized while stockouts incur irreversible service losses. To study this, we introduce the Online Shared Supply Allocation (OSSA) problem, a stateful online model in which a central hub allocates a finite, unknown supply to multiple sites facing sequential demand under fixed-charge transportation costs and lost-sales penalties. Unlike classical make-to-stock or make-to-order inventory models, OSSA precludes backlogging and replenishment only hedges against future demand. To tackle OSSA, we propose a deterministic threshold-proportional policy GPA and prove that it achieves a 4/3 -approximation to the offline optimum up to an additive term independent of the total supply. We complement this with matching lower bounds showing that the 4/3 ratio is tight and that the additive-error dependence is unavoidable, even for randomized algorithms that know the total supply upfront. Finally, we develop a learning-augmented extension to GPA that principally incorporates imperfect forecasts (e.g., from human experts or ML models) commonly available in practice, enabling us to exploit high-quality advice while being robust against arbitrary bad ones. Synthetic and real-world experiments show that GPA outperforms natural baselines with global supply is scarce.
[AI-105] amBench: Evaluating Agent Coordination under Enforced Role Separation
【速读】:该论文旨在解决多角色代理系统(agent system)中因缺乏强制性访问控制而导致的协作有效性误判问题,即仅依赖提示(prompt)指定角色时,无法确保各角色真正履行其职责,可能掩盖了代理间实际协作缺失或角色替代现象。解决方案的关键在于提出TeamBench基准测试框架,通过操作系统级别的权限隔离实现角色分离:将任务规范访问、工作区编辑和最终认证功能分别分配给Planner(规划者)、Executor(执行者)和Verifier(验证者),确保任一角色均无法同时获取全部信息、修改工作区或认证结果。这种强制性分工使评估能够揭示真实协作行为,而非仅以通过率掩盖潜在问题。
链接: https://arxiv.org/abs/2605.07073
作者: Yubin Kim,Chanwoo Park,Taehan Kim,Eugene Park,Samuel Schmidgall,Salman Rahman,Chunjong Park,Cynthia Breazeal,Xin Liu,Hamid Palangi,Hae Won Park,Daniel McDuff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role’s work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor’s code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.
[AI-106] 2.5-D Decomposition for LLM -Based Spatial Construction
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成三维积木放置指令时存在的系统性坐标错误问题,从而提升自主系统根据自然语言指令构建结构的可靠性。其解决方案的关键在于提出一种基于**2.5维分解(2.5-D decomposition)**的神经符号流水线:LLM仅在二维水平面内进行规划,而垂直方向的放置则由确定性执行器根据列占用情况计算得出,从而消除一类与垂直维度相关的错误。这一设计显著提升了结构准确性,在Build What I Mean基准测试中达到94.6%的平均结构准确率,接近由建筑师代理误差决定的理论上限(97.6%),且该方法可直接部署于边缘硬件并保持性能不变,具有良好的泛化能力。
链接: https://arxiv.org/abs/2605.07066
作者: Paul Whitten,Li-Jen Chen,Sharath Baddam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph2.5-D decomposition: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3% and the best competing system at 76.3%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM’s output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.
[AI-107] Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)后训练过程中如何高效利用稀缺但高保真度的目标数据,同时结合大量但对齐不完美的通用训练数据的问题。传统方法通常将通用数据视为可筛选的候选池,而本文提出了一种新框架——Dr. Post-Training(Data-Regularized Post-Training),其核心创新在于将通用训练数据重新定义为一种由数据诱导的正则化项(data-induced regularizer),用于防止模型在稀疏目标数据上过拟合。关键在于:在每个训练步骤中,基于通用数据构建一个可行的模型更新方向集合,并将目标数据指定的更新方向投影到该集合上,从而实现更灵活的偏差-方差权衡。这一视角使得标准训练和现有数据选择方法成为不同正则化强度下的特例,且实验表明该框架在监督微调(SFT)、强化学习人类反馈(RLHF)和强化学习价值回归(RLVR)等多种任务中均显著优于当前最优数据选择基线。
链接: https://arxiv.org/abs/2605.07063
作者: Pingbang Hu,Xueshen Liu,Z. Morley Mao,Jiaqi W. Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias–variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias–variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.
[AI-108] From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
【速读】:该论文旨在解决当前持续集成与持续部署(CI/CD)流程中缺乏统一术语体系的问题,尤其是如何定义“代理式 CI/CD”(agentic CI/CD)、决策权分配程度以及控制权归属。其核心挑战并非单纯提升任务执行效率,而是设计权威转移机制(authority transfer),即在特定约束条件和回溯机制下,将操作决策从人类控制的流水线有意识地委托给代理系统。解决方案的关键在于区分两类权限:数据平面权威(data-plane authority,如补丁生成、测试重跑等局部干预)和控制平面权威(control-plane authority,如流水线配置修改、部署策略调整)。研究表明,现有系统主要处于数据平面的有限自主状态,安全性依赖外部治理而非代理内在保障;因此,论文提出未来研究应优先聚焦于控制平面的安全与治理机制,其次为自主边界的形式化、评估框架构建及人-代理协同优化。
链接: https://arxiv.org/abs/2605.07062
作者: Marcus Emmanuel Barnes,Taher A. Ghaleb,Safwat Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to the 3rd ACM International Conference on AI-Powered Software (AIware 2026), Main Track, Montreal, Canada, July 6-7, 2026. 5 pages
Abstract:AI agents are assuming active roles in Continuous Integration and Continuous Deployment (CI/CD) workflows, yet the research community lacks a shared vocabulary for describing what it means for CI/CD to be agentic, how much decision authority is delegated, and where control should reside. This paper presents a vision of agentic CI/CD in which the central challenge is not improving task performance but designing authority transfer, defined as the delegation of operational decisions from human-controlled pipelines to agent systems under specified constraints and recourse mechanisms. To structure this argument, we introduce a distinction between data-plane authority (localized interventions such as patch generation and test reruns) and control-plane authority (modifications to pipeline configuration, deployment policies, and approval gates). Drawing on research prototypes and industrial platforms, we show that current systems operate mainly at the data plane under bounded autonomy, with safety achieved through surrounding governance infrastructure rather than intrinsic agent guarantees. We identify three recurring patterns: constrained autonomy as the dominant design, external governance as the primary safety mechanism, and a widening gap between deployment momentum and evaluation methodology. We propose a research agenda in which control-plane safety and governance mechanisms represent the most urgent open problem, followed by formalization of autonomy boundaries, evaluation frameworks, and human–agent coordination. Comments: Accepted to the 3rd ACM International Conference on AI-Powered Software (AIware 2026), Main Track, Montreal, Canada, July 6-7, 2026. 5 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.9; I.2.11 Cite as: arXiv:2605.07062 [cs.SE] (or arXiv:2605.07062v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.07062 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805760.3814897 Focus to learn more DOI(s) linking to related resources
[AI-109] owards Differentially Private Reinforcement Learning with General Function Approximation
【速读】:该论文旨在解决在一般函数逼近(general function approximation)框架下,如何实现差分隐私(differential privacy)的在线强化学习(online reinforcement learning, RL)的理论保证问题,从而突破以往仅限于表格型(tabular)和线性设定的研究局限。其核心解决方案在于结合批处理策略更新机制(batched policy update scheme)与指数机制(exponential mechanism),并引入一种新颖的 regret 分析方法,使得在模型无关(model-free)设置中,隐私保护下的累计遗憾(regret)可达到 O(K3/5) 的最优尺度,与线性情形的最先进结果一致。这一成果不仅填补了非线性函数逼近下私有在线RL的理论空白,还首次建立了基于标准覆盖复杂度(coverability)的批量更新 regret 上界,澄清了近期关于线性函数逼近私有RL结果中存在的根本性缺陷。
链接: https://arxiv.org/abs/2605.07049
作者: Yi He,Xingyu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present the first theoretical guarantees for differentially private online reinforcement learning (RL) with general function approximation, extending beyond prior work restricted to tabular and linear settings. Our approach combines a batched policy update scheme with the exponential mechanism, together with a novel regret analysis. We show that, even under general function approximation, the regret in the model-free setting under differential privacy matches the state of the art for the linear case, scaling as \widetildeO(K^3/5) , where K denotes the number of episodes. As an important by-product, we also establish the first regret bound for online RL with batch update that depends on the standard complexity measure of coverability, complementing existing results based on a newly introduced Eluder-Condition class. In addition, we uncover fundamental gaps in recent results for private RL with linear function approximation, thereby clarifying its landscape.
[AI-110] Unlocking High-Fidelity Molecular Generation from Mass Spectra via Dual-Stream Line Graph Diffusion
【速读】:该论文旨在解决从串联质谱(tandem mass spectra)中进行分子结构的从头生成(de novo molecular generation)这一具有挑战性的逆问题,其核心难点在于原子层面与键层面推理之间的循环依赖关系:确定一个键的类型需要知道其端点原子的化学环境,而原子的环境又由其连接的键所定义。为突破这一瓶颈,作者提出DualLGD(Dual-stream Line Graph Diffusion)方法,其关键创新在于将分子图去噪过程重构为两个耦合子问题的交替求解——原子级推理和键级推理,分别在独立的表示空间中进行。通过线图(line graph)自然构建键空间,使键角、二面角、共轭链和环等拓扑结构可被显式建模;同时引入基于邻接约束的双向交叉注意力机制,在每一层同步两个流的信息,确保原子仅关注其关联键、键仅关注其连接原子,从而严格遵守化学基本原理。实验表明,该架构显著优于现有方法,在NPLIB1和MassSpecGym基准上分别达到34.37%和23.89%的Top-1准确率,约为之前最优模型的3倍,且无需预训练即可超越完全预训练的基线模型,验证了双流架构设计是性能提升的主要来源。
链接: https://arxiv.org/abs/2605.07048
作者: Xujun Che,Xiuxia Du,Depeng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:De novo molecular generation from tandem mass spectra is a challenging inverse problem whose core difficulty lies in the circular dependency between atom-level and bond-level reasoning: determining a bond’s type requires knowing its endpoint atoms’ chemical environment, yet an atom’s environment is in turn defined by its incident bonds. Existing graph diffusion methods process atoms and bonds within a single computation stream, where atom-bond information synchronization can only occur implicitly across layers. We argue that this single-stream paradigm, rather than the choice of any particular aggregation kernel, is a key architectural bottleneck. We propose DualLGD (Dual-stream Line Graph Diffusion), which reformulates molecular graph denoising as the alternating solution of two coupled subproblems: atom-level reasoning and bond-level reasoning, each operating in its own dedicated representation space. The line graph provides a natural mathematical construction for the bond space, in which bond angles, dihedrals, conjugation chains, and rings correspond to local topological motifs between bonds. Incidence-constrained bidirectional cross-attention synchronizes the two streams at every layer, ensuring that each atom attends only to its incident bonds and vice versa, respecting the fundamental chemical principle that an atom’s environment is determined by its bonding context. On the NPLIB1 and MassSpecGym benchmarks, DualLGD achieves top-1 accuracy of 34.37% and 23.89%, approximately 3\times the previous state of the art. Ablation studies confirm the architecture as the primary source of improvement: DualLGD without any pre-training already surpasses the previous best fully pretrained model.
[AI-111] he Context Gathering Decision Process: A POMDP Framework for Agent ic Search
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂环境(如大规模代码库、企业数据库和对话历史)中因上下文窗口限制而导致的搜索状态失真问题,这种失真会引发冗余操作(如重复循环)和过早终止。其核心解决方案是将LLM代理的迭代搜索过程形式化为一种特殊的部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),称为Context Gathering Decision Process (CGDP),并基于此构建两个可插拔的干预机制:一是基于谓词(predicate-based)的持久信念状态,能够在有限上下文中保留多跳推理能力;二是程序化的耗尽门控机制(exhaustion gate),用于在无产出时主动终止搜索而不提前停止。实证结果表明,使用CGDP驱动的信念状态可使多跳推理准确率提升最高达11.4%,而耗尽门控机制可在不降低性能的前提下节省最多39%的token消耗。
链接: https://arxiv.org/abs/2605.07042
作者: Chinmaya Kausik,Adith Swaminathan,Nathan Kallus
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages
Abstract:Large Language Model (LLM) agents are deployed in complex environments – such as massive codebases, enterprise databases, and conversational histories – where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent’s working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent’s objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM’s behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM’s implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM’s implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to 11.4% ; while the modular programmatic exhaustion detection saves up to 39% of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.
[AI-112] A Systematic Investigation of The RL-Jailbreaker in LLM s
【速读】:该论文旨在解决生成式 AI(Generative AI)在部署过程中面临的对抗性越狱攻击(adversarial jailbreaking)问题,即攻击者通过策略性操纵模型以诱导产生有害输出的风险。为应对这一挑战,作者首次系统性地分解了基于强化学习(Reinforcement Learning, RL)的越狱框架,将其拆解为问题形式化(如奖励函数、动作空间和回合长度)与算法措施(如RL算法、训练数据和奖励塑造)。研究发现,环境形式化设计——特别是密集奖励机制和较长的episode长度——是决定越狱成功率的关键因素。此发现不仅揭示了RL越狱成功的结构决定因素,也为提升攻击效率及增强生成模型对RL驱动攻击的鲁棒性提供了理论依据与实践工具。
链接: https://arxiv.org/abs/2605.07032
作者: Montaser Mohammedalamen,Kevin Roice,Reginald McLean,Alyssa Lefaivre Škopac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Warning: To demonstrate vulnerabilities, this paper contains unfiltered and potentially offensive jailbreaking examples. Reader discretion advised
Abstract:The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardening. Adversarial jailbreaking, the strategic manipulation of models to elicit harmful output, remains a primary threat to safe deployment. While Reinforcement Learning (RL) frames jailbreaking as a multi-step attack through sequential optimization, a mechanistic understanding of why the framework succeeds remains incomplete. To fill this gap, we present the first systematic decomposition of RL jailbreaking. We deconstruct the framework into problem formalization (reward function, action space, episode length), and algorithmic measures (RL algorithm, training data, reward-shaping) to identify the structural determinants of adversarial success. Our results reveal that the RL-jailbreaker successfully compromised all targeted models and safeguards. Through this first-of-its-kind analysis, we demonstrate that environment formalization, specifically dense rewards and extended episode lengths, is the primary driver of jailbreaking success. This work provides a tool for improving RL-jailbreaker efficiency and, ultimately, harden generative models resistant to RL-based attacks.
[AI-113] Behavior Cue Reasoning : Monitorable Reasoning Improves Efficiency and Safety through Oversight DATE
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中难以监督的问题,即许多行为偏差仅在推理结束时才显现,导致传统监控方法失效。解决方案的关键在于引入**行为提示推理(Behavior Cue Reasoning, BCR)**机制:通过训练模型在特定隐式或显式行为发生前输出特殊标记序列(称为“行为提示”),这些提示既作为可被外部监控器识别的信号,又充当控制杠杆。实验表明,利用行为提示的压缩信息即可使外部强化学习监控器在复杂数学问题求解中裁剪高达50%的冗余推理token;在强约束环境下,该方法能使80%原本会导向不安全动作的推理轨迹恢复为安全动作,成功率从46%提升至96%。此方案实现了推理过程的可控性与可监控性提升,且不牺牲模型性能。
链接: https://arxiv.org/abs/2605.07021
作者: Christopher Z. Cui,Taylor W. Killian,Prithviraj Ammanabrolu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code currently being cleaned, and prepared for public release. This comment will be updated once completed
Abstract:Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at this https URL Comments: Code currently being cleaned, and prepared for public release. This comment will be updated once completed Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.07021 [cs.AI] (or arXiv:2605.07021v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.07021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-114] FlashMol: High-Quality Molecule Generation in as Few as Four Steps
【速读】:该论文旨在解决生成化学有效三维分子构象时采样效率低的问题,传统扩散模型如GeoLDM虽能生成高质量构象但需数百步,难以支持大规模虚拟筛选;而现有少步生成方法虽加速至12–50步,却常牺牲样本稳定性。其解决方案的关键在于提出FlashMol模型,通过引入分布匹配蒸馏(Distribution Matching Distillation, DMD)机制,以反向KL散度最小化为目标实现高效蒸馏,并创新性地重新设计生成时间步长(respace),改善初始条件以增强蒸馏效果;同时引入Jensen-Shannon散度正则项,缓解DMD的模式聚焦行为,提升多样性,最终在QM9和GEOM-DRUG数据集上实现仅需4步即可达到甚至超越原1000步教师模型的分子质量,采样速度提升最高达250倍。
链接: https://arxiv.org/abs/2605.07020
作者: Xinyuan Wei,Zian Li,Shaoheng Yan,Cai Zhou,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating chemically valid 3D molecular conformations is critical for computational drug discovery. Classical diffusion-based models like GeoLDM perform well but require hundreds of steps, making large-scale in silico screening impractical. Recent efforts on few-step molecular generation have accelerated this process to 12-50 steps, but they often largely sacrifice sample stability. In this work, we present FlashMol, an ultra-fast molecule generative model producing high-quality molecular conformations in as few as 4 steps. To achieve this, we adapt distribution matching distillation (DMD) - a reverse KL-divergence minimization objective - to the molecular domain for effective distillation. Considering the local minimization behavior of DMD, we respace the molecule generation timesteps, providing the generator with much better initialization and enables effective distillation. Additionally, to mitigate the mode-seeking behavior of DMD and improve diversity, we further regularize it with a Jensen-Shannon divergence term, which incorporates the mean-seeking behavior of the forward KL divergence. Extensive experiments on QM9 and GEOM-DRUG datasets demonstrate that FlashMol matches and even surpasses the original 1000-step teacher, achieving up to 250 \times acceleration in sampling speed while maintaining high molecular quality.
[AI-115] Adaptive auditing of AI systems with anytime-valid guarantees
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统失效模式表征中因标注与评估成本高而导致的统计推断难题,特别是在自适应测试范式下如何在有限样本(通常仅10–50个案例)和动态采样决策条件下实现严格统计控制的问题。其解决方案的关键在于引入一种基于“双重对立”视角的假设检验框架:一方面设定模型原假设(null hypothesis),即系统不存在低于目标阈值的失效模式;另一方面设定审计员原假设,即其采样策略能识别出此类失效模式。通过利用安全任意时刻有效推断(Safe Anytime-Valid Inference, SAVI),将审计过程形式化为“以投注方式进行的测试”,从而构造出同时适用于两个对立原假设的 e-过程(e-processes)。理论证明表明,当审计员策略足够强大时,这两个假设在渐近意义上互为逆命题,即通过严格的审计可真正保证系统的全局鲁棒性;实证结果进一步验证了所提方法在任意时刻均保持第一类错误控制,并可在仅20个观测下得出统计上严谨的结论,显著优于预设规则的测试方法。
链接: https://arxiv.org/abs/2605.07002
作者: Siyu Zhou,Patrick Vossler,Venkatesh Sivaraman,Yifan Mai,Jean Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two ‘dueling’ perspectives: (i) the model’s null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor’s null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting ‘testing by betting’, which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.
[AI-116] Optimal Experiments for Partial Causal Effect Identification
【速读】:该论文旨在解决在成本约束下,从可观测数据中选择最优实验子集以最大程度收紧因果查询(causal query)的边界问题。其核心挑战在于如何高效筛选出具有高“认知效力”(epistemic potency)的实验,即能保证最坏情况下最大缩减边界宽度的实验。解决方案的关键在于提出一种称为“最大效力问题”(max-potency problem)的形式化框架,并基于多项式规划方法(polynomial-programming framework)设计了评估认知效力的通用程序;同时引入两种仅依赖于因果图和查询结构的图形剪枝准则:一是利用区块结构(district structure)的路径拦截规则(path-interception rule),可在线性时间内识别零效力实验;二是基于ID算法的可识别性检查。这两项剪枝策略显著压缩候选实验空间(平均剪枝率达50–88%),并进一步证明经ID剪枝后的实验在组合搜索中具有不变性(combinatorially inert),从而实现对所有可能子集的超指数级减少,最终在NHANES真实数据上实现了物理活动对糖尿病影响效应估计的最优实验选择。
链接: https://arxiv.org/abs/2605.06993
作者: Tobias Maringgele,Jalal Etesami
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Causal queries are often only partially identifiable from observational data, and experiments that could tighten the resulting bounds are typically costly. We study the problem of selecting, prior to observing experimental outcomes, a cost-constrained subset of experiments that maximally tightens bounds on a target query. We formalize this as the max-potency problem, where epistemic potency measures the worst-case reduction in bound width guaranteed by an experiment, and show that this problem is NP-hard via a reduction from 0-1 knapsack. Building on the polynomial-programming framework of Duarte et al. (2023), we give a general procedure for evaluating epistemic potency in discrete settings. To control the super-exponential search space, we introduce two graphical pruning criteria that depend only on the causal graph and the query: a novel path-interception rule that exploits district structure to certify zero potency in linear time, and an identifiability check based on the ID algorithm. On Erdos-Renyi random graphs and 11 bnlearn benchmark networks, the two criteria together prune 50-88% of candidate experiments on average without solving a single polynomial program. For the general subset search, we show that ID-pruned experiments are combinatorially inert, yielding a super-exponential reduction in the number of subsets evaluated. We close with an end-to-end demonstration on observational NHANES data, selecting optimal experiments for estimating the effect of physical activity on diabetes.
[AI-117] PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
【速读】:该论文旨在解决因果抽象(causal abstraction)中关键变量定位效率低下的问题,即如何高效地从神经网络的低层计算中识别出与高层因果模型对齐的神经站点(neural site),以实现机制可解释性。现有方法如分布式对齐搜索(DAS)虽能学习表达性强的子空间干预,但因神经站点未知而需在候选位置上进行高成本的全局搜索。本文提出的PLOT(Progressive Localization via Optimal Transport)框架的关键在于利用最优传输(optimal transport)构建抽象变量与候选神经站点之间的全局软对应关系,从而实现快速且精准的因果变量定位:在简单场景下通过单个神经元级别的耦合即可完成定位;在复杂模型中则采用渐进式策略,从粗粒度(如token、时间步或层)逐步细化到细粒度(如坐标组或PCA子空间),并可选地引导DAS以显著降低其计算开销。实验表明,纯运输型PLOT干预在速度和精度上均表现优异,而PLOT引导的DAS则在远低于全量DAS运行时间下达到相当精度,为大规模因果抽象研究提供了高效的定位引擎。
链接: https://arxiv.org/abs/2605.06979
作者: Jonathn Chang,Arya Datla,Ziv Goldfeld
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.
[AI-118] f-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses ICML2026
【速读】:该论文旨在解决在线强化学习从人类反馈(Online Reinforcement Learning from Human Feedback, RLHF)中缺乏对一般 f-散度(f-divergence)正则化目标的统一理论分析问题。现有方法多依赖于反向KL散度正则化,而新兴研究尝试使用前向KL、卡方散度等替代形式,但尚未建立普适性的理论框架。解决方案的关键在于提出两个基于不同采样原理的算法:其一通过引入精心设计的探索奖励扩展经典乐观性原则(optimism principle),其二利用在 f-散度正则化下最优策略对奖励扰动的敏感性,构建新的优化机制。理论分析表明,两者均能实现 O(log T) 的累积遗憾(regret)和 O(1/T) 的次优间隙(sub-optimality gap),首次为在线RLHF在广义 f-散度正则化下的性能提供了可证明的效率边界。
链接: https://arxiv.org/abs/2605.06977
作者: Di Wu,Chengshuai Shi,Jing Yang,Cong Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: ICML 2026
Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general f -divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general f -divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under f -divergence regularization. Theoretical analysis shows that O(\log T) regret and O(1/T) sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general f -divergence regularization.
[AI-119] Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂任务中缺乏泛化能力与可复用性的问题,特别是在面对未见过的应用场景时性能显著下降。其核心挑战包括:如何通过自动分解学习可复用的组件、如何使这些组件具备跨任务实例的泛化能力,以及如何高效检索这些组件以支持组合式策略生成。解决方案的关键在于提出一种名为“分层组件学习通用策略”(Hierarchical Component Learning for Generalized Policies, HCL-GP)的方法,该方法结合了广义规划(Generalized Planning)与层次任务分解(Hierarchical Task Decomposition),能够从成功执行中自动提取并组织参数化策略组件至组件库,并利用语义搜索实现高效检索,从而提升策略的泛化性和执行效率。实验表明,该方法在AppWorld基准上对常规任务和挑战任务分别达到98.2%和97.8%的准确率,显著优于静态合成方法,并在开源模型上实现了62.5%的成功率,验证了经典规划思想与LLM代理融合的有效性。
链接: https://arxiv.org/abs/2605.06957
作者: Shirin Sohrabi,Haritha Ananthakrishnan,Harsha Kokel,Kavitha Srinivas,Michael Katz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a dynamic policy-learning approach that combines generalized planning and hierarchical task decomposition for LLM-based agents. Our method, Hierarchical Component Learning for Generalized Policies (HCL-GP ), learns parameterized policies that generalize across task instances and automatically extracts reusable components from successful executions, organizing them into a component library for compositional policy generation. We address three challenges: (1) learning components through automated decomposition, (2) generalizing components to maximize reuse, and (3) efficient retrieval via semantic search. Evaluated on the AppWorld benchmark, our approach achieves 98.2% accuracy on normal tasks and 97.8% on challenge tasks with unseen applications, improving 15.8 points over static synthesis on challenging scenarios. For open-source models, dynamic reuse enables 62.5% success versus near-zero without reuse. This demonstrates that classical planning concepts can be effectively integrated with LLM agents for improved accuracy and efficiency.
[AI-120] Kurtosis-Guided Denoising Score Matching for Tabular Anomaly Detection
【速读】:该论文旨在解决基于去噪得分匹配(Denoising Score Matching, DSM)的异常检测方法中噪声扰动尺度选择困难的问题。传统方法在低密度区域因噪声过少导致得分估计不稳定,在高密度区域因噪声过多破坏局部结构从而削弱异常敏感性,且在无标签数据或未知异常情况下难以进行超参数调优。解决方案的关键在于提出一种基于峰度(kurtosis)的自适应噪声缩放策略(K-DSM),该方法根据每维特征边际分布的形状动态设定噪声水平,无需额外模型复杂度即可提升对低密度区域的覆盖能力和高密度区域的精度。实验表明,经过精心训练的单尺度模型已具备强大异常检测能力,配合轻量级EMA-教师过滤规则后,在半监督和全监督(含污染数据)场景下均达到先进性能,证明了数据自适应噪声缩放可显著降低对人工调参的依赖并增强鲁棒性。
链接: https://arxiv.org/abs/2605.06955
作者: Victor Livernoche,Jie Zan,Reihaneh Rabbany
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 10 figures, 14 tables
Abstract:Denoising score matching (DSM) provides a way to learn data distributions by training a neural network to recover the score function, defined as the gradient of the log density, from noise-corrupted samples. Once trained, the score magnitude at a test point reflects how consistent that point is with the learned distribution, making it a natural anomaly signal. The key practical challenge is selecting the perturbation scale: too little noise yields unstable score estimates in sparse regions, while too much erases local structure and weakens anomaly sensitivity. This is compounded by the difficulty of hyperparameter tuning when anomalies are unknown and no validation set is available. We introduce kurtosis-based noise scaling (K-DSM), a per-feature scheme that sets noise levels from the shape of each marginal distribution, improving coverage of low-density regions and precision in high-density regions without extra model complexity. Contrary to prior claims that multi-scale or noise-conditioned training is necessary, we find that a carefully trained single-scale model is already a strong anomaly detector. On standard tabular anomaly detection benchmarks, K-DSM achieves state-of-the-art performance in the semi-supervised setting. When combined with a lightweight EMA-teacher filtering rule that removes low-density training points before each gradient step, it also achieves strong performance in the fully unsupervised (contaminated) setting, suggesting that simple, data-adaptive noise scaling enables robust anomaly detection while reducing reliance on hyperparameter tuning.
[AI-121] Adaptive Memory Decay for Log-Linear Attention
【速读】:该论文旨在解决序列模型在记忆容量与计算效率之间的权衡问题:传统Transformer虽能实现丰富的上下文建模,但其计算复杂度为二次方;而线性注意力和状态空间模型虽具备线性时间复杂度,却因将上下文压缩至固定大小的隐藏状态而导致记忆召回能力受限。为此,作者提出通过在Fenwick树层次结构中引入输入依赖的衰减参数(\lambda)来优化Log-linear Attention机制——关键创新在于使用一个轻量级两层MLP从输入中学习每个token、每层的\lambda值,从而实现内容感知的衰减权重分配,而非固定不变的全局衰减策略。此方法避免了softmax带来的层级间竞争,并保持了原有的对数线性计算复杂度,同时仅增加可忽略的参数开销,在长程记忆任务中显著优于基线模型。
链接: https://arxiv.org/abs/2605.06946
作者: Yaxita Amin,Helen Zichen Li,Mengfan Zhang,Samet Ayhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures. Preprint
Abstract:Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter \lambda is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning \lambda directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline \lambda degrades or collapses entirely.
[AI-122] A Generalized Singular Value Theory for Neural Networks
【速读】:该论文旨在解决现代神经网络架构在输入-输出行为建模中的可解释性与稳定性问题,特别是如何在不改变模型性能的前提下,将神经网络分解为一个左可逆的非线性部分和一个最终线性层。其核心解决方案是基于广义奇异值分解(Generalized Singular Value Decomposition, GSVD)理论,证明了大多数现代神经网络架构均可表示为一种特殊形式:在最终线性层之前的部分是左可逆的,并且该非线性部分可以被设计为保持范数(norm-preserving),从而使得嵌入空间(embedding space)中距离的变化能够直接对应到输入空间的距离变化。这一性质为后续如对抗样本检测、模型偏见分析及可逆性研究提供了理论基础和实用工具。
链接: https://arxiv.org/abs/2605.06938
作者: Brian Charles Brown,Robert Bridges,David Grimsman,Mauricio Munoz,Sean Warnick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Building on the abstract Generalized Singular Value Decomposition (GSVD) theory of Brown et al. [2025], we prove that most modern neural architectures admit a generalized SVD representation in which they are left-invertible before a final linear layer, with no change in input-output behavior. Furthermore, the left-invertible nonlinear portion of the input-output behavior can be made to be \emphnorm preserving, meaning that perturbations in the left-invertible ``embedding’’ (the activations prior to the final linear layer in this representation) correspond proportionally to changes in the input, i.e., distance in feature space can be calibrated directly to distance in input space. We provide a data-driven algorithm for estimating this representation from trained models and propose a model architecture that naturally facilitates the decomposition. We then provide a proof-of-concept that the learned representation can be used to identify adversarial perturbations to model inputs, and develop the theory necessary for future applications to areas such as model bias and invertibility.
[AI-123] In-Context Credit Assignment via the Core
【速读】:该论文旨在解决**上下文内信用分配(in-context credit assignment)问题,即在生成式 AI(Generative AI)输出内容(如代码、新闻文章或短视频)时,如何公平地将生成结果的信用分配给出现在上下文窗口中的多个创作者。传统方法难以确保每个贡献者的补偿与其实际价值相匹配,尤其当部分创作者组成的子集联合起来可能获得更高收益时,容易引发不稳定分配。论文提出的解决方案基于合作博弈论中的最小核心(least core)**概念,该概念通过确保任意子集创作者不会因被显著低估而脱离分配体系,从而实现分配的稳定性。其关键创新在于设计了用于近似最小核心的算法,引入新颖的约束种子(constraint seeding)和约束分离(constraint separation)技术,大幅减少对大语言模型(LLM)调用次数,同时保持高精度分配,实验证明在网页检索任务中效率提升达数量级。
链接: https://arxiv.org/abs/2605.06920
作者: Keegan Harris,Siddharth Prasad,Asher Trockman
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose incentive-aligned mechanisms for in-context credit assignment: the task of assigning credit for AI-generated content (e.g. code, news articles, short-form videos) among creators whose intellectual property appears in the context window. Our approach is based on the least core solution concept from cooperative game theory, which distributes value in a way that is as stable as possible by ensuring that no subset of creators is significantly under-compensated relative to the value they could generate on their own. We develop algorithms for approximating the least core, which leverage novel routines for constraint seeding and constraint separation. On a web retrieval credit assignment task, we find that our approaches are capable of approximating the least core using orders of magnitude fewer LLM calls compared to alternative methods.
[AI-124] Same Signal Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents
【速读】:该论文旨在解决自适应测试时计算(adaptive test-time compute)中因固定方向门控机制(fixed-direction gates)导致的性能不稳定问题。现有方法通常依赖于置信度、不确定性或难度信号作为门控依据,假设这些信号与计算需求之间存在单一且稳定的映射关系,但研究表明这种假设在不同环境和模型主干(backbone)下会失效,甚至出现相反方向的预测,即同一信号在某些场景下预示提升收益而在另一些场景下则预示损害,从而误导代理选择有害状态进行额外计算。为应对这一问题,作者提出DIAL(Direction-Informed Adaptive Learning),其核心在于引入一个信号无关的反事实探索策略,训练一个稀疏门控机制以学习每个(环境, 主干)组合下状态特征对应的计算效用方向,从而实现对“计算必要性”与“计算适用性”的区分,显著改善了多环境、多模型下的成功-成本权衡表现。
链接: https://arxiv.org/abs/2605.06908
作者: Ziming Li,Jiatan Huang,Xiaoguang Guo,Guilin Wang,Chuxu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adaptive test-time compute for LLM agents aims to invoke extra computation only when it improves performance. Existing methods typically use confidence-, uncertainty-, or difficulty-based gates, assuming a fixed direction from the gating signal through compute need to the value of computation. This makes gating a utility-calibration problem: gating signals should align with whether extra computation improves the final outcome over the base policy. We show that this alignment is unstable: the same signal predicts rollout benefit in one setting and rollout harm in another, with reversals across environments and backbones even when the task is fixed. Wrong-direction gates can therefore worsen performance by precisely selecting harmful states. This reversal reflects a deeper distinction between compute need and compute suitability: a high uncertainty signal may indicate decision-difficult states where rollouts help compare alternatives, or intervention-unsuitable states where the current context does not support useful rollout-based improvement. Under this two-source model, fixed-direction gates are unreliable across heterogeneous settings. To address this, we propose DIAL (Direction-Informed Adaptive Learning), a sparse gate trained from signal-agnostic counterfactual exploration to learn the utility direction of state features per (environment, backbone). Across six environments and three backbones, DIAL yields a stronger overall success-cost trade-off than fixed-direction baselines.
[AI-125] Self-Programmed Execution for Language-Model Agents
【速读】:该论文旨在解决现有语言模型代理(language model agent)中固定编排程序(fixed orchestrator program)导致的状态转换机制限制问题,即模型在多轮交互中受预设策略约束,缺乏灵活适应复杂任务的能力。其解决方案的关键在于提出自编程执行(Self-Programmed Execution, SPE)架构,其中模型生成的文本本身作为编排程序,而非由外部控制;SPE通过“代理机器”(agentic machine)的形式化定义实现,使得模型状态可加载嵌入式机器的任意状态,从而摆脱固定回合间编排策略的限制。为实践SPE,作者进一步设计了Spell语言——一种基于Lisp的可自修改程序语言,支持程序自我编辑与重新求值而不重复副作用,使模型能在无固定编排逻辑下完成复杂代理任务。
链接: https://arxiv.org/abs/2605.06898
作者: Luke J. O’Connor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at this https URL .
[AI-126] Mitigating Cognitive Bias in RLHF by Altering Rationality
【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)的鲁棒性问题,即如何使模型在面对不完美或存在偏见的人类反馈时仍能学习到更理性、可靠的奖励信号。传统方法通常假设人类偏好与潜在奖励差异之间存在稳定关系,并通过固定理性参数 β(rationality parameter)来建模这种关系,但现实中人类判断受认知偏差影响,表现出情境依赖的非理性行为。论文的关键解决方案在于将理性参数 β 设计为上下文和标注依赖的动态变量:利用大语言模型(LLM-as-judge)识别可能包含认知偏差的偏好比较,并据此对相关样本进行降权处理,从而在训练过程中有效抑制偏差干扰,提升下游模型的理性表现,即使在强偏倚数据集上微调也能取得更好效果。
链接: https://arxiv.org/abs/2605.06895
作者: Tiffany Horter,Andrew Markham,Niki Trigoni,Serena Booth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameter beta informs how consistently preferences reflect reward differences. In practice, beta is typically treated as a fixed constant that reflects assumed uniform annotator reliability. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward-consistent behavior that arise contextually. To address this, we treat rationality as context- and annotation-dependent. We design an approach to dynamically adjust the rationality parameter beta during reward learning using an LLM-as-judge to assess the likely presence of cognitive biases. This approach effectively downweights comparisons that are likely to reflect biased or unreliable judgments. Empirically, we show that this approach learns a more rational downstream model, even when finetuning on datasets with strongly biased preferences.
[AI-127] Dont Retrain Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment
【速读】:该论文旨在解决预训练自回归语言模型(Autoregressive Language Models, ARLMs)向扩散语言模型(Diffusion Language Models, DLMs)转换过程中,如何有效保留原始模型内部表征几何结构的问题。现有方法主要依赖于持续去噪训练并调整目标函数与注意力机制,但忽略了AR预训练所学习到的语义结构是否可跨生成顺序迁移。论文的关键解决方案是提出REPR-ALIGN——一种表示对齐目标,通过在每一层使用余弦相似度将DLM的隐藏状态与冻结的AR模型对应层对齐,同时优化标准的掩码去噪目标。该方法无需引入适配器或修改架构,仅通过注意力掩码调整即可实现高达4倍的训练加速,并在低数据场景下表现尤为突出,验证了语言表征可在不同生成顺序间迁移,且对齐策略是高效训练DLM的有效手段。
链接: https://arxiv.org/abs/2605.06885
作者: Fred Zhangzhi Peng,Alexis Fox,Anru R. Zhang,Alexander Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code available at this https URL
Abstract:Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at this https URL.
[AI-128] How Well Do LLM s Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长链推理任务中表现尚不明确的问题,特别是针对最简单却具有长链结构的等价类问题(Equivalence Class Problem, ECP),即根据一组随机生成的等价关系判断两个变量是否相等。解决方案的关键在于系统性地评估代表性非推理型与推理型LLMs在不同问题实例下的性能差异,包括变量数量、连通概率、提示(prompt)设计等因素,并发现:非推理型模型在ECP上完全失败,而推理型模型虽显著优于前者但仍难以彻底解决该问题;进一步分析表明,对于非推理型模型,最难的问题实例出现在ln n/(n−1)的相变点附近,体现问题本身的混沌特性;而对于推理型模型,难点则对应于图的最大直径,揭示其推理难度的本质来源。
链接: https://arxiv.org/abs/2605.06882
作者: Chun Zheng,Lianlong Wu,Bingqian Li,Lvting Liu,Yi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Large Language Models (LLMs) have achieved great improvements in recent years. Nevertheless, it still remains unclear how good LLMs are for reasoning tasks, especially for long-chain ones. In this paper, we evaluate LLMs’ performance on the simplest yet long-chain reasoning task, namely the Equivalence Class Problem (ECP), i.e., determining whether two variables are equal given a set of randomly generated equivalence relations. We consider both reasoning and non-reasoning representative LLMs over a large variety of problem instances, ranging over different numbers of variables, connectivity probabilities, prompts, and other factors. The experimental results show that non-reasoning LLMs fail ECP, while reasoning models are significantly better but still struggle to completely solve this problem. Interestingly, considering various connectivity probabilities with a fixed number of variables, we observe that, for non-reasoning models, the hardest problem instances coincide with the phase transition point of ln n/(n-1), suggesting the chaos of the problem; in contrast, for reasoning models, the hardest ones coincide with the biggest diameter, suggesting the reasoning difficulty of the problem.
[AI-129] Agent ick: A Unified Benchmark for General Sequential Decision-Making Agents
【速读】:该论文旨在解决当前AI代理(Agent)研究中缺乏统一基准评估框架的问题,使得基于强化学习(Reinforcement Learning, RL)、大语言模型(Large Language Models, LLM)、视觉语言模型(Vision-Language Models, VLM)以及混合方法的代理难以在公平条件下进行横向比较。其解决方案的关键在于提出一个名为Agentick的基准测试平台,该平台通过统一的Gymnasium兼容接口提供37个程序化生成的任务,涵盖六类能力维度、四个难度层级和五种观测模态,并附带编码API、参考策略、预构建监督微调(Supervised Fine-Tuning, SFT)数据集、可组合的代理执行环境及实时排行榜。这一设计实现了对不同代理范式的系统性评估与训练支持,为推动通用自主代理的发展提供了实证基础设施。
链接: https://arxiv.org/abs/2605.06869
作者: Roger Creus Castanyer,Pablo Samuel Castro,Glen Berseth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick’s capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
[AI-130] How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在长上下文推理任务中因Key-Value (KV)缓存占用过大而导致的“内存墙”问题,同时避免在rollout阶段采用KV缓存压缩所引入的严重偏离策略偏差(off-policy bias)。其解决方案的关键在于识别并量化这种偏差来源:采样器在稀疏上下文中生成响应,而学习器则基于完整稠密上下文更新参数,导致微小的KV压缩近似误差在RL优化过程中被显著放大。论文指出,传统统计修正方法如重要性重加权难以有效缓解该问题,因其面临梯度方差高和样本效率低的挑战,从而揭示了当前在线RL框架在长序列场景下亟需更鲁棒的偏差校正机制。
链接: https://arxiv.org/abs/2605.06850
作者: Rui Zhu,Weiheng Bai,Qiushi Wu,Yang Ren,Haixu Tang,Yuchu Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models (LLMs), encompassing frameworks like RLHF and RLAIF. Regardless of the specific optimization algorithm (e.g., PPO, GRPO, or Online DPO), online RL inherently requires an exploratory trajectory generation (rollout) phase. However, for long-context reasoning tasks, this rollout phase imposes a severe ``memory wall’’ due to the exorbitant Key-Value (KV) cache footprint. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off-policy bias. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization. Specifically, the sampler generates responses under a sparse context, whereas the learner updates parameters using the full, dense context. Existing statistical solutions, such as importance reweighting, struggle to correct this magnified bias, suffering from high gradient variance and severe sample inefficiency.
[AI-131] Narrow Secret Loyalty Dodges Black-Box Audits
【速读】:该论文旨在解决**秘密忠诚(secret loyalty)**这一新型安全威胁问题,即模型在特定条件下会隐蔽地偏向某个特定主体(如某位政治人物)而执行有害行为,同时在正常情况下表现无异常。其核心解决方案是构建了首个针对窄域秘密忠诚的模型原型,通过在Qwen-2.5-Instruct不同规模(1.5B、7B、32B)下微调训练,使模型仅在狭窄激活条件下触发对特定政治人物的偏好性响应,其余情况则保持标准助人行为。关键在于利用低比例中毒数据(3.125%–12.5%)实现攻击持久性,且在缺乏目标主体知识时难以被黑盒审计技术检测,揭示了当前防御机制的局限性,并指出数据监控仍可识别低比例污染样本。
链接: https://arxiv.org/abs/2605.06846
作者: Alfie Lamerton,Fabien Roger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.
[AI-132] AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
【速读】:该论文旨在解决传统世界模型(World Model)在交互式环境中因忽略动作可执行性动态变化而导致的多步预测误差累积问题。具体而言,标准世界模型通常学习一个静态的转移函数,将状态和动作映射到下一状态,但在训练数据中频繁共现的动作与结果可能被误认为是因果关系,从而忽视动作的前提条件(precondition),导致模型无法准确判断当前状态下某个动作是否可执行。为应对这一挑战,作者提出了一种基于具身 affordance(行为可能性)的世界模型(Affordance-Grounded World Model, AGWM),其核心创新在于显式建模抽象的 affordance 结构,以有向无环图(DAG)形式表示动作之间的前置依赖关系,从而动态追踪动作的可执行性。实验表明,该方法能显著降低多步预测误差、提升对新配置的泛化能力,并增强模型的可解释性。
链接: https://arxiv.org/abs/2605.06841
作者: Qinshi Zhang(1),Weipeng Deng(2),Zhihan Jiang(3),Jiaming Qu(4),Qianren Li(5),Weitao Xu(5),Ray LC(5) ((1) University of California, San Diego, (2) University of Hong Kong, (3) Columbia University, (4) Amazon, (5) City University of Hong Kong)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures, 4 tables. Appendix on pages 11-16 (main text is self-contained)
Abstract:In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.
[AI-133] Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中生成的链式思维(Chain-of-Thought, CoT)是否构成真正意义上的规划(planning)、其结构特征如何以及哪些方面驱动了模型性能的问题。解决方案的关键在于提出一种新方法,通过从四子连珠(four-in-a-row)棋类游戏中LLMs的推理轨迹中提取并量化搜索树(search trees),进而利用计算模型拟合这些搜索树结构,从而刻画LLMs的计划组织方式及其对走子决策的影响机制。研究发现,LLMs的搜索深度浅于人类,且性能由搜索广度决定而非深度;更关键的是,尽管LLMs在推理中扩展了深层节点,但其走子决策最符合仅依赖浅层节点的短视模型(myopic model),因果干预实验进一步验证了这一点。这一发现揭示了LLMs与人类规划的核心差异:人类专家依赖深层前瞻,而LLMs并不基于深层 lookahead 做出决策,为实现LLM与人类规划的一致性提供了明确方向。
链接: https://arxiv.org/abs/2605.06840
作者: Sixing Chen,Ji-An Li,Saner Cakir,Sinan Akcali,Kayla Lee,Marcelo G. Mattar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs’ search is shallower than humans’, and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.
[AI-134] On Privacy Leakage in Tabular Diffusion Models: Influential Factors Attacker Knowledge and Metrics
【速读】:该论文旨在解决生成式表格式数据(tabular data)在隐私保护方面的风险问题,特别是针对基于扩散模型(diffusion models)的合成数据生成方法所引发的隐私泄露隐患。其核心解决方案在于通过黑盒与白盒两种场景下的先进成员推断攻击(membership inference attacks),系统性地量化训练配置、合成策略及攻击者知识水平对隐私泄露的影响,揭示出即使攻击者不具备完全的训练细节或强大计算资源,仍可有效实施攻击,从而凸显现有启发式隐私度量(如最近邻距离)的局限性。
链接: https://arxiv.org/abs/2605.06835
作者: Masoumeh Shafieinejad,D. B. Emerson,Behnoosh Zamanlooy,Elaheh Bassak,Fatemeh Tavakoli,Sara Kodeiri,Marcelo Lotif,Xi He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 11 Figures, 12 Tables
Abstract:Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks. As such, there is a rising interest in generating high-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure. With tabular diffusion models (TDMs) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative. Leveraging state-of-the-art membership inference attacks for TDMs in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are revealed.
[AI-135] PAMPOS: Causal Transformer-based Trajectory Prediction for Attack-Agnostic Misbehavior Detection in V2X Networks
【速读】:该论文旨在解决车联网中车辆对万物(Vehicle-to-Everything, V2X)网络内未授权篡改攻击(falsification attacks)的检测问题,尤其针对现有基于学习的异常检测方案因依赖标注攻击样本而难以应对未知攻击类型的局限性。其核心解决方案是提出PAMPOS,一种基于因果Transformer解码器的无监督检测模型,该模型在训练阶段仅使用正常交通轨迹(VeReMi++数据集)学习典型运动学模式;推理时通过top-K归一化异常评分机制识别与模型预测偏差显著的异常行为,并将异常定位到具体运动学特征上,从而实现对未见过的伪造攻击的有效检测,无需任何攻击标签参与训练。
链接: https://arxiv.org/abs/2605.06833
作者: Konstantinos Kalogiannis,Ahmed Mohamed Hussain,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Author’s version; Accepted for presentation at the ACM Workshop on Wireless Security and Machine Learning (WiseML 2026)
Abstract:Misbehavior detection in Vehicle-to-Everything (V2X) networks is a second line of defense against insider falsification attacks that cryptographic mechanisms alone cannot address. Existing learning-based Misbehavior Detection Schemes (MDSs) are supervised, requiring labeled attack samples at training time, thus failing to counter unseen falsification attacks. We present PAMPOS, a causal transformer-decoder trained on benign VeReMi++ trajectories to learn normal mobility patterns. At inference time, misbehavior is identified as a deviation from the model’s next-step kinematic predictions using a top-K normalized anomaly scoring mechanism that localizes falsification to specific kinematic features, without requiring attack-labeled training data. We evaluate PAMPOS across all 19 attack types in VeReMi++ under rush-hour and afternoon scenarios, achieving Area Under the Curve (AUC) values of up to 0.98 and F1-scores of up to 0.95 for most attack categories.
[AI-136] Why DDIM Hallucinates More than DDPM: A Theoretical Analysis of Reverse Dynamics
【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(Diffusion Models)在采样过程中出现的幻觉现象(hallucination),即模型生成与真实数据分布不一致的伪结构。研究聚焦于两种典型扩散采样器:随机性的去噪扩散概率模型(DDPM)和确定性的去噪扩散隐式模型(DDIM)。关键发现是:当采样过程进入两个最近模态之间的区域后,DDIM 会因确定性演化而“卡住”在该区域,导致幻觉;而 DDPM 的随机性有助于其跳出该区域,从而降低幻觉率。解决方案的核心在于引入额外的随机步骤以增强 DDIM 的探索能力,使其具备类似 DDPM 的鲁棒性,从而有效避免幻觉并为设计更优采样器提供理论依据。
链接: https://arxiv.org/abs/2605.06831
作者: Muhammad H. Ashiq,Samanyu Arora,Abhinav N. Harish,Ishaan Kharbanda,Hung Yun Tseng,Grigorios G. Chrysos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We theoretically study the hallucination phenomena in two canonical diffusion samplers: the stochastic Denoising Diffusion Probabilistic Model (DDPM) and the deterministic Denoising Diffusion Implicit Model (DDIM). We analyze the reverse ODE (DDIM) and SDE (DDPM) for a Gaussian mixture target, proving that after a critical time \tau , (a) DDIM can become stuck on the segment connecting the two nearest modes and (b) DDPM stochasticity helps it become unstuck from this region, thus avoiding hallucination. Our empirical validation verifies that DDPM has a significantly lower hallucination rate than DDIM when this region is entered. Building on our observations, we exhibit how using additional stochastic steps can help DDIM avoid hallucinations and offer new insights on how to design improved samplers.
[AI-137] Randomness is sometimes necessary for coordination
【速读】:该论文旨在解决合作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中同质智能体在对称观测下无法实现角色分化的问题。当所有智能体具有相同的观测结构时,共享的确定性策略会输出完全一致的动作分布,导致智能体之间无法区分角色,从而限制了协作能力。解决方案的关键在于引入一种称为Diamond Attention的交叉注意力架构,该架构通过每个智能体在每时间步采样一个标量随机数,生成瞬时的排名顺序,并据此对智能体间注意力进行结构化掩码——低排名智能体被屏蔽,而任务相关注意力保持完整。这种基于随机位的协调协议能够在单次广播轮次内实现角色分化,且支持零样本部署到不同规模的团队,实验证明其在对称博弈、控制协调任务和跨场景迁移中均显著优于传统确定性基线方法。
链接: https://arxiv.org/abs/2605.06825
作者: Rohan Patil,Jai Malegaonkar,Henrik I. Christensen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves 1.0 success while all deterministic baselines plateau near 0.5 . On control coordination tasks, a policy trained on N=4 generalizes zero-shot to N \in [2,8] . On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. this https URL
[AI-138] A Rod Flow Model for Adam at the Edge of Stability
【速读】:该论文旨在解决现有连续时间模型在描述具有动量机制的优化算法(如Adam、Nesterov动量等)时精度不足的问题,特别是这些方法在“稳定性边缘”(edge of stability)运行时的行为难以准确建模。其解决方案的关键在于将Rod Flow(杆流)框架扩展至包含动量和自适应学习率的优化器,通过在参数与一阶矩(w, m)的联合相空间中建模,并将二阶矩ν视为平滑辅助变量,从而更精确地捕捉离散迭代过程中的动态演化。该方法在多种经典优化器(包括Adam、RMSProp、NAdam及不同动量形式)上均表现出显著优于传统稳定流模型的跟踪能力。
链接: https://arxiv.org/abs/2605.06821
作者: Eric Regis,Sinho Chewi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Cohen et al. (arXiv:2207.14484) observed that adaptive gradient methods such as Adam operate at the edge of stability. While there has been significant work on continuous-time modeling of gradient descent at the edge of stability, extending these models to momentum methods remains underdeveloped. In the gradient descent setting, Regis et al. (arXiv:2602.01480) introduced rod flow, which models consecutive iterates as an extended one-dimensional object – a “rod.” Here we extend rod flow to Adam by working in the joint phase space of parameters and first moment (w, m) and treating the second moment \nu as a smooth auxiliary variable. We also develop rod flows for heavy ball momentum, Nesterov momentum, and scalar and per-component versions of RMSProp, Adam, and NAdam. For all eight optimizers, we empirically evaluate rod flow on representative machine learning architectures, where it tracks the discrete iterates through the edge-of-stability regime significantly more accurately than the corresponding stable flow.
[AI-139] owards Security-Auditable LLM Agents : A Unified Graph Representation
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体系统(agentic systems)在执行复杂自主任务时,因语义驱动的运行机制导致低层物理事件与高层执行意图之间存在严重语义鸿沟的问题,从而使得事后安全审计难以进行。现有表示方法如静态软件物料清单(Software Bill of Materials, SBOM)和运行时日志仅提供碎片化证据,无法捕捉认知状态演化、能力绑定关系、持久内存污染以及跨智能体交互中的风险传播路径。其解决方案的关键在于提出Agent-BOM——一种统一的结构化表示框架,将智能体系统建模为分层有向图,区分静态能力基座(如模型、工具、长期记忆)与动态运行语义状态(如目标、推理轨迹、动作),并通过语义边与安全属性连接各层,将零散的执行痕迹转化为可查询的审计路径,进而支持细粒度的风险评估与根因分析。
链接: https://arxiv.org/abs/2605.06812
作者: Chaofan Li,Lyuye Zhang,Jintao Zhai,Siyue Feng,Xichun Yang,Huahao Wang,Shihan Dou,Yu Ji,Yutao Hu,Yueming Wu,Yang Liu,Deqing Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.
[AI-140] Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
【速读】:该论文旨在解决推理时缩放方法中依赖的进程奖励模型(Process Reward Models, PRMs)校准不足的问题,这类模型常因过估计成功概率而导致决策偏差。解决方案的关键在于首次引入条件最优传输(conditional optimal transport, CondOT)来校准PRMs,通过修改监督学习框架下的CondOT映射学习方法,估计一个关于PRM隐状态条件下的单调性分位数函数,从而获得结构上有效的分位数估计,并支持任意置信水平下高效提取置信区间。该方法被整合进实例自适应缩放(instance-adaptive scaling, IAS)框架中,在数学推理基准(如MATH-500和AIME)上的实验表明,相较于未校准的PRMs和分位数回归,其显著提升了校准性能并改善了下游Best-of-N IAS的性能表现。
链接: https://arxiv.org/abs/2605.06785
作者: Rachel Ma,Dylan Hadfield-Menell,Kristjan Greenewald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \citebunne2022supervised to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \citepark2025know. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.
[AI-141] Revisiting Adam for Streaming Reinforcement Learning
【速读】:该论文旨在解决深度强化学习中在线学习(online learning)场景下的算法稳定性与性能问题,尤其针对传统方法依赖经验回放(replay buffer)或并行采样以缓解学习不稳定性的问题。其解决方案的关键在于重新审视经典更新机制(如DQN和C51)在无存储在线设置中的有效性,并揭示出两个保障鲁棒性能的核心特性:一是目标函数的导数需有界(bounded derivative),二是权重更新需进行方差调整(variance-adjusted)。基于此洞察,作者提出基于优势迹(eligibility traces)的自适应Q(λ)算法(Adaptive Q(λ)),通过引入方差调整机制,在55个Atari游戏上达到接近人类基准两倍的性能,显著优于现有方法。
链接: https://arxiv.org/abs/2605.06764
作者: Florin Gogianu,Adrian Catalin Lutu,Razvan Pascanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability. Recently, this topic has been revisited by Elsayed et al. (2024), focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i) the derivative of the objective is to be bounded and ii) weight updates are variance-adjusted. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games. Using these insights, we derive a variance-adjusted algorithm based on eligibility traces, termed Adaptive Q (\lambda) , which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics.
[AI-142] Gradient Extrapolation-Based Policy Optimization
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型推理能力时,标准GRPO(Generalized Reward Policy Optimization)风格训练仅使用当前步更新导致优化方向受限的问题。其核心挑战在于:虽然多步前瞻(multi-step lookahead)能提供更优的更新方向,但因需多次反向传播而计算成本过高。解决方案的关键是提出梯度外推策略优化(Gradient Extrapolation-Based Policy Optimization, GXPO),该方法通过仅用三次反向传播即可近似较长局部前瞻效果——利用同一组回放轨迹、奖励和优势值,在快速优化步骤中预测K步前瞻位置,并沿此方向移动策略参数,随后基于新位置的真实梯度进行校正更新;当前瞻信号不稳定时自动切换回单步GRPO,从而在保持活跃阶段计算开销不变的前提下显著提升性能与效率。
链接: https://arxiv.org/abs/2605.06755
作者: Ismam Nur Swapnil,Aranya Saha,Tanvir Ahmed Khan,Mohammad Ariful Haque,Ser-Nam Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 9 figures
Abstract:Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO’s peak accuracy.
[AI-143] Geometric Kolmogorov–Arnold Network (GeoKAN)
【速读】:该论文旨在解决传统Kolmogorov–Arnold Networks (KANs) 在处理具有复杂几何结构或非均匀变化特征的科学机器学习任务时,因固定欧几里得坐标系下逼近能力受限而导致的表示效率低下的问题。其解决方案的关键在于引入几何感知的Geometric Kolmogorov–Arnold Networks (GeoKANs),通过学习一个对角黎曼度量(diagonal Riemannian metric)来扭曲输入空间,使函数逼近在自适应的几何坐标系中进行,而非固定的欧几里得坐标。这一机制不仅提供了局部长度缩放和体积畸变的几何归纳偏置,还在物理信息建模中影响模型所感知的微分结构,从而实现任务相关的表征分辨率重分配——即在快速变化区域拉伸、平滑区域压缩,显著提升对尖锐、刚性、局域化及强非均匀问题的建模能力。
链接: https://arxiv.org/abs/2605.06740
作者: Abhijit Sen,Bikram Keshari Parida,Giridas Maiti,Mahima Arya,Denys I. Bondar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 46 pages, 24 figures, 13 tables
Abstract:We introduce Geometric Kolmogorov–Arnold Networks (GeoKANs), a family of geometry-aware KAN-type models in which approximation is carried out in learned, geometry-adapted coordinates rather than in fixed Euclidean input coordinates. GeoKAN achieves this by learning a diagonal Riemannian metric that warps the input before basis expansion and feature mixing. The learned metric provides a geometric inductive bias through local length scaling and volume distortion, and in physics-informed settings it also affects the differential structure seen by the model. Within this framework, we develop three main variants, namely GeoKAN-NNMetric, GeoKAN- \gamma , and LM-KAN. For LM-KAN, we further consider three basis-specific versions, LM-KAN-RBF, LM-KAN-Wav, and LM-KAN-Fourier. These variants allow us to study geometry-aware KAN models both as general function approximators and as surrogates in physics-informed learning. By stretching regions with rapid variation and compressing smoother regions, GeoKAN reallocates representational resolution in a task-dependent manner, allowing the model to place capacity where it is most needed. As a result, GeoKAN is well suited to sharp, stiff, localized, and strongly non-uniform regimes arising in scientific machine learning and differential-equation problems.
[AI-144] From Specification to Deployment: Empirical Evidence from a W3C VC DID Trust Infrastructure for Autonomous Agents
【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在无共享信任机制的环境下进行大规模交易时的信任缺失问题,尤其是在监管框架(如新加坡IMDA、NIST CAISI、欧盟AI法案)与主流AI实验室(Anthropic、Google)均提出需构建开放、可移植且密码学可验证的信任基础设施的背景下。其解决方案的关键在于设计并部署了MolTrust系统,该系统基于W3C可验证凭证(Verifiable Credentials)2.0和去中心化标识符(Decentralized Identifiers, DID)v1.0标准,并通过Base Layer 2链上锚定实现可信存储;核心创新包括四个基础构件(身份、授权、行为记录、可移植性)、五方问责链条以及Agent授权封装(Agent Authorization Envelope, AAE),其中AAE在三个层级强制执行:密码学签名、API级凭证生命周期管理及通过Falco eBPF集成实现的内核级系统调用监控,从而实现了对AI代理行为的细粒度控制与审计能力。
链接: https://arxiv.org/abs/2605.06738
作者: Lars Kersten Kroehl
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous AI agents now transact at production scale – 69,000 bots executing 165 million transactions across 50 million USDC in cumulative volume on a single marketplace – without any shared trust layer between participants. Regulatory frameworks (Singapore IMDA, NIST CAISI, EU AI Act) and major AI laboratories (Anthropic, Google) have independently converged on the same structural requirement: an open, portable, cryptographically verifiable trust infrastructure for autonomous agents that no single vendor can deliver alone. This paper presents MolTrust, a production-deployed implementation of such an infrastructure built on W3C Verifiable Credentials 2.0 and Decentralized Identifiers v1.0, with on-chain anchoring on Base Layer 2. The system architecture is organized around four primitives (identity, authorization, behavioral record, portability), a five-party accountability chain, and the Agent Authorization Envelope (AAE) – a machine-evaluable authorization structure enforced at three layers: cryptographic signatures, API-level credential lifecycle management, and kernel-level syscall monitoring via Falco eBPF integration. The paper documents three distinguishing capabilities: kernel-layer AAE enforcement below the agent process boundary; cross-protocol interoperability through five reproducible test vectors verified against independent implementations; and layered Sybil resistance combining dual-signature interaction proofs, cross-vertical endorsement diversity gating, and principal-DID-linked violation persistence. The reference implementation has been operational since March 2026 across eight credential verticals. Empirical validation at adversarial scale is pending. The contribution is deployment-first evidence that the trust infrastructure regulators and industry have converged on is implementable today using W3C-standardized primitives.
[AI-145] A Self-Healing Framework for Reliable LLM -Based Autonomous Agents
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自主智能体在复杂软件系统中因幻觉、执行错误和推理不一致等不可预测故障而导致的可靠性问题。解决方案的关键在于提出了一种可靠性感知的自愈框架,其核心包括:构建失败类型分类体系与定量可靠性评估模型,设计基于执行模式和输出一致性分析的异常行为检测方法,并引入通过自适应重规划与纠正提示策略实现动态恢复的自愈机制。该框架创新性地整合了智能体内部推理过程与外部执行结果的联合监控系统,从而显著提升任务成功率、抑制故障传播并增强整体系统鲁棒性。
链接: https://arxiv.org/abs/2605.06737
作者: Cheonsu Jeong,Younggun Shin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures,1 table
Abstract:Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability-aware self-healing framework for LLM-based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of failure types and introduce a quantitative reliability assessment model. Next, we propose a failure detection method that identifies abnormal agent behavior based on execution patterns and output consistency. Finally, we design a self-healing mechanism that dynamically recovers from failures through adaptive replanning and corrective prompting strategies. The proposed framework was implemented in a multi-agent workflow environment and evaluated using real-world task scenarios. Experimental results demonstrate that our approach significantly increases task success rates, reduces failure propagation, and enhances overall system robustness compared to existing methods. In particular, this study distinguishes itself by establishing an integrated monitoring system that combines the agent’s internal reasoning process with external execution results. These findings are expected to contribute to securing the stability of advanced autonomous systems and lowering the barriers to LLM adoption in production environments.
[AI-146] Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning
【速读】:该论文旨在解决现有量子快速权重程序(Quantum Fast Weight Programmers, QFWPs)在噪声中等规模量子(Noisy Intermediate-Scale Quantum, NISQ)设备上难以扩展及经典模拟成本高昂的问题,同时提升时间序列建模的参数效率与预测精度。其解决方案的关键在于提出一种名为门控QKAN-FWP的新型框架,该框架将经典快速权重程序(Fast Weight Programmer, FWP)与量子启发的柯尔莫哥洛夫-阿诺德网络(Quantum-inspired Kolmogorov-Arnold Network, QKAN)相结合,并引入单比特数据重上传激活机制(DatA Re-Uploading ActivatioN, DARUAN)作为可学习的非线性激活函数;此外,设计了一种标量门控快速权重更新规则,通过理论分析证明其具备自适应记忆核、几何有界性和并行梯度路径特性,从而稳定参数演化过程。实验表明,该方法在长时程太阳周期预测任务中,以仅12.5k参数显著优于多种经典递归基线模型(如LSTM、WaveNet-LSTM等),且可在IonQ和IBM量子处理器上实现接近理想模拟精度的部署,验证了其对NISQ硬件的高度兼容性与实用性。
链接: https://arxiv.org/abs/2605.06734
作者: Kuo-Chung Peng,Samuel Yen-Chi Chen,Jiun-Cheng Jiang,Chen-Yu Liu,En-Jui Kuo,Yun-Yuan Wang,Prayag Tiwari,Andrea Ceschini,Chi-Sheng Chen,Yu-Chao Hsu,Chun-Hua Lin,Tai-Yue Li,Antonello Rosato,Massimo Panella,Simon See,Saif Al-Kuwari,Kuan-Cheng Chen,Nan-Yow Chen,Hsi-Sheng Goan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 46 pages, 13 figures, 10 tables
Abstract:Fast Weight Programmers (FWPs) encode temporal dependencies through dynamically updated parameters rather than recurrent hidden states. Quantum FWPs (QFWPs) extend this idea with variational quantum circuits (VQCs), but existing implementations rely on multi-qubit architectures that are difficult to scale on noisy intermediate-scale quantum (NISQ) devices and expensive to simulate classically. We propose gated QKAN-FWP, a fast-weight framework that integrates FWP with Quantum-inspired Kolmogorov-Arnold Network (QKAN) using single-qubit data re-uploading circuits as learnable nonlinear activation, known as DatA Re-Uploading ActivatioN (DARUAN). We further introduce a scalar-gated fast-weight update rule that stabilizes parameter evolution, supported by a theoretical analysis of its adaptive memory kernel, geometric boundedness, and parallelizable gradient paths. We evaluate the framework across time-series benchmarks, MiniGrid reinforcement learning, and highlight real-world solar cycle forecasting as our main practical result. In the long-horizon setting with 528-month input window and 132-month forecast horizon, our 12.5k-parameter model achieves lower scaled Mean Square Error (MSE), peak amplitude error, and peak timing error than a suite of classical recurrent baselines with up to 13x more parameters, including Long Short-Term Memory (LSTM) networks (25.9k-89.1k parameters), WaveNet-LSTM (167k), Vanilla recurrent neural network (11.5k), and a Modified Echo State Network (132k). To validate NISQ compatibility, we further deploy the trained fast programmer on IonQ and IBM Quantum processors, recovering forecasting accuracy within 0.1% relative MSE of the noiseless simulator at 1024 shots. These results position gated QKAN-FWP as a scalable, parameter-efficient, and NISQ-compatible approach to quantum-inspired sequence modeling.
[AI-147] Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
【速读】:该论文旨在解决联邦学习中低秩适配(LoRA)方法存在的语义不匹配问题,即直接对客户端的LoRA因子进行平均会导致结果依赖于坐标系选择,从而破坏更新的语义一致性。其解决方案的关键在于提出一种规范感知的服务器表示方法——GLoRA,该方法通过从客户端投影矩阵估计共识更新子空间,并在共享参考坐标系下聚合客户端更新,确保语义层面的更新聚合始终以低秩形式表达。此外,GLoRA还引入了秩兼容的读出机制,支持不同计算能力的客户端从同一服务器状态中实例化不同秩的适配器,无需重建密集更新,显著提升了联邦LoRA在数据异构性、资源异构性和任务异构性下的性能与效率平衡。
链接: https://arxiv.org/abs/2605.06733
作者: Jinqian Chen,Chang Liu,Jihua Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated LoRA enables parameter-efficient adaptation of large language models under decentralized data and limited client this http URL, directly averaging LoRA factors is representation-dependent: the same intrinsic update admits infinitely many gauge-equivalent factorizations, so factor-level aggregation can change under arbitrary coordinate choices while the underlying update remains unchanged. This reveals a semantic mismatch in existing federated LoRA aggregation rules. We propose \textbfGLoRA, a gauge-aware server representation for federated this http URL of aggregating raw factors, GLoRA estimates a consensus update subspace from client projectors and aggregates client updates in shared reference coordinates, thereby representing semantic update aggregation entirely in low-rank form. To support heterogeneous client capacities, GLoRA further provides a rank-compatible readout that instantiates adapters of different ranks from the same server state without dense update reconstruction. Experiments on GLUE and SuperNI show that GLoRA consistently outperforms federated LoRA baselines under data, resource, and task heterogeneity, including heterogeneous client ranks, sparse participation, larger backbones, and unseen-task evaluation. GLoRA also achieves a favorable efficiency–performance trade-off, suggesting that effective federated LoRA requires not merely averaging low-rank factors, but defining a semantically meaningful server-side representation for aggregation.
[AI-148] he EΔ-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
【速读】:该论文旨在解决深度神经网络中残差连接的正交性稳定性问题,特别是在长序列建模中保持梯度流动和参数更新的数值稳定性。传统方法如Deep Delta Learning(DDL)依赖Householder变换实现正交性,但仅在特定参数范围内(β ∈ {0,2})保证正交;而现有基于Cayley变换的方法虽可全局保持正交性,却无法处理特征空间中本征值为-1的反射情形(即negation),导致对称性缺失。解决方案的关键在于提出EΔ-MHC-Geo Transformer架构,其核心创新包括:(1) 引入数据依赖的Cayley旋转矩阵Q(x),通过显式构造保证对任意β和输入x均满足正交约束;(2) 设计EΔ-MHC-Geo Hybrid结构,融合Cayley旋转与Householder反射,利用可学习的门控机制γ(X)动态选择最优操作路径,从而覆盖正交群O(n)的两个连通分支;(3) 加入midpoint-collapse正则项4γ(1−γ),促使门控决策趋于边界值,强化每条路径的正交性质。实验证明该方案在长时序稳定性和范数保持方面显著优于基线模型,同时以更少层数达到更高精度。
链接: https://arxiv.org/abs/2605.06729
作者: Arash Shahmansoori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures; code will be available at this https URL
Abstract:We present the E \Delta -MHC-Geo Transformer, a novel architecture that unifies Manifold-Constrained Hyper-Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input-adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at \beta \in \0,2\ , our Data-Dependent Cayley rotation Q(x)=(I+(\beta/2)A(x))^-1(I-(\beta/2)A(x)) preserves orthogonality for all \beta and all inputs. To handle negation, an eigenvalue -1 case that Cayley provably excludes, we introduce the E \Delta -MHC-Geo Hybrid, which combines Cayley rotation with Householder reflection via a learned operator-selection gate X’=\gamma(X)Q(X)X+(1-\gamma(X))H_2(X)X . A midpoint-collapse regularizer, 4\gamma(1-\gamma) , encourages boundary gate decisions, where each selected component is orthogonal. In matched-parameter comparisons, with approximately 1.79M parameters per model and mean +/- standard deviation over 3 seeds, against four baselines including the concurrent JPmHC, E \Delta -MHC-Geo achieves the best long-horizon stability, 1.9x over JPmHC and 3.8x over GPT; the best near- \pi rotation loss, 4.5x over JPmHC on single-plane; strong norm preservation, with 0.001 mean deviation; and 0.96 negation cosine alignment in a diagnostic reflection probe, all with 33% fewer layers. While JPmHC’s wider representation excels on pure rotation, its finite Cayley residual mixer excludes an exact \lambda=-1 operator and has no reflection branch, motivating our hybrid approach for accessing both connected components of O(n) .
[AI-149] Enabling Unsupervised Training of Deep EEG Denoisers With Intelligent Partitioning
【速读】:该论文旨在解决可穿戴脑电图(wearable electroencephalogram, EEG)去噪问题,其核心挑战在于神经活动信号微弱且与频谱重叠的噪声伪影难以分离,传统信号处理方法依赖固定或启发式规则,无法应对时变的广泛伪影;而深度学习方法虽具潜力,却因训练需无伪影EEG数据(实际不可获得)受限。解决方案的关键在于提出智能分区自监督去噪方法(Intelligent Partitioning for Self-supervised Denoising, iPSD),通过学习将输入EEG片段划分为具有相同潜在信号但独立噪声实现的子段,从而在无需干净参考信号的情况下实现自监督训练,尤其适用于零样本场景(仅需单个待去噪EEG片段)。实验表明,iPSD在极低信噪比(低至-10 dB)和复杂伪影(如肌电伪影EMG)下均显著优于现有基线,频谱保真度提升数个数量级。
链接: https://arxiv.org/abs/2605.06724
作者: Qiyu Rao,Haozhe Tian,Homayoun Hamedmoghadam,Danilo Mandic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Denoising wearable electroencephalogram (EEG) is inherently challenging since neural activity is not only subtle but also inseparable from spectrally overlapping noise artifacts. Classical signal processing methods, relying on fixed or heuristic rules, cannot handle the time-varying pervasive artifacts in wearable EEGs. Deep learning methods, on the other hand, show promise in decomposition-free EEG denoising using highly expressive neural networks, but the training requires artifact-free EEG, which is inherently unobtainable. To address this, we propose Intelligent Partitioning for Self-supervised Denoising (iPSD). Our method eliminates the need for clean references by learning to partition an input EEG segment into independent noisy realizations with the same underlying signal. This enables self-supervision of deep learning denoisers, even in zero-shot settings where only a single EEG segment to be denoised is available. We validate iPSD through extensive experiments, including validations on wearable EEG from in-ear sensors. The results show that iPSD achieves state-of-the-art performance, most notably under extremely low signal-to-noise ratios (down to -10 dB) and challenging artifacts (e.g., EMG), with spectral fidelity orders of magnitude higher than competitive baselines.
[AI-150] Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion
【速读】:该论文旨在解决抗体序列设计中两个关键挑战:一是现有蛋白语言模型(protein language models, pLMs)主要记忆胚系(germline)序列,难以建模具有生物学意义的体细胞变异(somatic variation);二是缺乏对灵活分类器引导的条件生成的支持。其解决方案的关键在于提出两种创新方法:首先,采用离散扩散微调(discrete diffusion fine-tuning)实现强语言建模性能,并支持任意现成分类器的条件生成;其次,引入胚系吸收扩散(germline absorbing diffusion),将胚系序列而非掩码序列设为吸收态,从而赋予模型一个生物合理的归纳偏置,使其仅学习从胚系到已观察序列的演化路径,有效排除遗传变异和V(D)J重组统计信息的影响,显著缓解胚系偏差。实验表明,该方法将非胚系残基预测准确率从26%提升至46%,接近真实生物变异性设定的理论上限,并在改善疏水性和预测结合亲和力的条件生成任务中优于EvoProtGrad等主流策略。
链接: https://arxiv.org/abs/2605.06720
作者: Justin Sanders,Luca Giancardo,Lan Guo,Yue Zhao,Kemal Sonmez,Nina Cheng,Melih Yilmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 2 tables
Abstract:Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for antibody sequence design, existing approaches largely suffer from two key limitations: they predominantly memorize germline sequences rather than modeling biologically meaningful somatic variation, and they offer limited support for flexible classifier-guided conditional generation. We address these challenges through two primary contributions. First, we demonstrate that discrete diffusion fine-tuning achieves strong language modeling performance on antibody sequences while allowing for generation conditioned on any off-the-shelf classifier. Second, we introduce germline absorbing diffusion, a novel modification of the discrete diffusion noise process in which the germline sequence - rather than a masked sequence - serves as the absorbing state. This biologically motivated inductive bias restricts the model to learning the trajectory from germline to observed sequence, effectively excluding genetic variation and V(D)J recombination statistics from the learned distribution and dramatically mitigating germline bias. We show that germline diffusion improves non-germline residue prediction accuracy from 26 percent to 46 percent, approaching the theoretical upper bound set by true biological variability. We then demonstrate the utility of our germline diffusion model on the conditional generation tasks of sampling antibodies with improved hydrophobicity and predicted binding affinity. On both tasks our model shows an improved tradeoff between class adherence and sample quality, significantly outperforming EvoProtGrad, a popular strategy to sample from pLMs with gradient-based discrete Markov Chain Monte Carlo.
[AI-151] Agent ic Coding Needs Proactivity Not Just Autonomy
【速读】:该论文旨在解决当前生成式编码代理(coding agents)在向主动型(proactive)和长周期(long-horizon)演进过程中缺乏清晰定义、评估标准与可衡量指标的问题,尤其关注如何区分“主动”与“自主”(autonomy)、明确主动行为的接受标准以及判断未请求代理行为是否真正有用。其解决方案的关键在于提出一个基于混合协同交互(mixed initiative interaction)原则的洞察策略质量评估框架,将代理的主动性视为由“决策下一个重要事项”的策略所驱动,并据此构建一个三级主动性的分类体系(Reactive、Scheduled、Situation Aware),同时引入三项可量化评估目标:洞察决策质量(Insight Decision Quality, IDQ)、上下文锚定得分(Context Grounding Score, CGS)和学习提升度(Learning Lift),从而为下一代主动编码代理提供系统化评估方法论。
链接: https://arxiv.org/abs/2605.06717
作者: Nghi D. Q. Bui,Georgios Evangelopoulos
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Position Paper
Abstract:Coding agents are rapidly changing the landscape of software development, moving from inline completion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across sessions. Yet the field still lacks a clear account of what proactivity means for software development, how it differs from autonomy, what acceptance criteria proactive long-horizon tasks should satisfy, and which metrics determine whether unsolicited agent behavior is useful rather than merely active. Proactive coding agents should be evaluated by the quality and improvement of their insight policy: the policy that decides what matters next, what evidence supports it, whether to show it, and how to adapt after feedback. This view is grounded in the principles of mixed initiative interaction. We propose a three level taxonomy of proactivity (Reactive, Scheduled, and Situation Aware), compare contemporary coding agents against five practical criteria, and sketch an active user simulation protocol with three evaluation targets: Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift
[AI-152] he Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking
【速读】:该论文旨在解决生成式 AI (Generative AI) 在实际应用中输出质量的可比性与预测性问题,特别是在无定制指令、无个性化调整和无修复提示的公共接口环境下,如何客观评估不同大语言模型(LLM)在生成单文件 HTML 代码时的表现,并探索其输出特征是否能用于预测传播效果或代码复杂度。解决方案的关键在于构建一个标准化的八周观测框架,涵盖 17 个公开实验中的 68 个 HTML 输出,采用人类评分与 Gemini LLM-as-a-judge 双重评价机制,对功能正确性、UI 质量及 prompt 遵从度进行量化分析,并通过社会媒体协议统一发布结果;同时引入两个监督式预测模型——用于预测 X 平台 24 小时曝光量的实验级模型和用于预测 HTML 行数的生成级模型,发现模型家族本身(如 Claude 表现最优)是决定代码冗余度的主要因素,而 prompt 文本影响有限,且现有变量不足以准确预测社交传播效果。
链接: https://arxiv.org/abs/2605.06707
作者: Diego Cabezas Palacios
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 23 pages, 3 figures, 5 tables
Abstract:This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the “HTML AI Battle” project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, Gemini, Grok, and Claude, were compared under a fixed public-interface protocol with no custom instructions, no personality tuning, and no repair prompts. Each output was evaluated from a rendered browser video using human scores and a Gemini LLM-as-a-judge layer for prompt adherence, functional correctness, and UI quality, then packaged into a standardized social-media protocol spanning X (Twitter), TikTok, and YouTube. The tracker was also used for two supervised predictive analyses: an experiment-level model for 24-hour X impressions and a generation-level model for HTML verbosity. Under this protocol, Claude was the strongest and most consistent family, leading mean performance and winning 9/17 prompts under the primary human weighted score. Longer measured reasoning time was not associated with higher quality overall. Gemini as a judge was significantly more lenient than the human evaluator on functional correctness and overall performance, while stable self-favoring bias remained unresolved. The exploratory X-impressions model remained weak under post-screen cross-validation (MAE = 46,874, R^2 = -0.377), whereas the HTML-lines model performed better, with a model-family-only baseline outperforming prompt-aware alternatives (MAE = 135.2, R^2 = 0.576). Overall, selected pre-publication technical/audio variables were not sufficient to predict 24-hour X reach, while code verbosity was driven much more by model family than by prompt wording. The comparisons remain observational and are limited by public-interface drift, access-path differences, and one primary human scorer. Comments: 23 pages, 3 figures, 5 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.7; D.2.8 Cite as: arXiv:2605.06707 [cs.SE] (or arXiv:2605.06707v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.06707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-153] Fast and Effective Redistricting Optimization via Composite-Move Tabu Search
【速读】:该论文旨在解决空间选区划分(spatial redistricting)中的组合优化问题,其核心挑战在于如何在满足连通性约束(contiguity constraint)的前提下,实现高质量、高效率且可交互调整的解决方案。传统方法在整数规划或启发式搜索中强制执行连通性时,会显著缩小可行邻域,削弱探索能力,并易陷入局部最优。论文提出一种复合移动禁忌搜索(composite-move Tabu search, CM-Tabu),其关键创新在于系统性地扩展禁忌搜索的可行邻域空间,同时严格保持连通性:当单个边界单元无法独立迁移而不破坏所属选区连通性时,算法通过识别最小移动集合或可交换单元对(或集合),形成保持连通性的复合移动(contiguity-preserving composite move)。该方法利用割点(articulation points)和双连通分量(biconnected components)快速生成候选移动,在线性时间内完成邻域扩展,从而显著提升解的质量、鲁棒性和计算效率,适用于实际选区划分决策支持场景。
链接: https://arxiv.org/abs/2605.06682
作者: Hai Jin,Diansheng Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Spatial redistricting is a practical combinatorial optimization problem that demands high-quality solutions, rapid turnaround, and flexibility to accommodate multi-criteria objectives and interactive refinement. A central challenge is the contiguity constraint: enforcing contiguity in integer-programming or heuristic search can severely shrink the feasible neighborhood, weaken exploration, and trap the search in poor local optima. We introduce a composite-move Tabu search (CM-Tabu) that systematically expands the feasible neighborhood space in Tabu search while preserving contiguity. When a boundary unit cannot be reassigned individually without disconnecting its district, our method identifies a minimal set of units that can move together, or a pair of units (or sets of units) that can be switched, as a contiguity-preserving composite move. Candidate single-unit and composite moves are generated in linear time by analyzing each district’s contiguity graph using articulation points and biconnected components. Extensive experiments demonstrate that the proposed approach substantially improves solution quality, run-to-run robustness, and computational efficiency relative to traditional Tabu search and other baselines. For example, in the Philadelphia case, the approach can consistently attain the theoretical global optimum in population-equality and support multi-criteria trade-offs. CM-Tabu delivers optimization performance suitable for real-world practices and decision-support workflows.
[AI-154] Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs
【速读】:该论文旨在解决教育领域大语言模型(Large Language Model, LLM)辅导系统中提示注入(prompt injection)防御的对齐挑战,即在遵循用户意图的同时,维持教学约束与安全策略。其解决方案的关键在于提出了一种多层域特定防护流水线,结合确定性模式过滤、结构验证、上下文沙箱和会话级行为检测,实现对抗鲁棒性、良性任务可用性和响应延迟之间的显式权衡。实验表明,在一个包含480个查询的受控基准上,该方案达到46.34%的绕过率、0.00%假阳性率和2.50毫秒平均延迟,优先保障教学可用性(零假阳性),同时具备可量化的攻击抵抗能力。
链接: https://arxiv.org/abs/2605.06669
作者: Alexandre Cristovão Maiorano
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 4 figures, 9 tables
Abstract:Educational LLM tutors face a core AI alignment challenge: they must follow user intent while preserving pedagogical constraints and safety policies. We present an evaluation methodology for prompt-injection defenses in this setting, showing that guardrail design entails explicit trade-offs among adversarial robustness, benign-task usability, and response latency. We evaluate a domain-specific multi-layer safeguard pipeline combining deterministic pattern filters, structural validation, contextual sandboxing, and session-level behavioral checks. On a controlled holdout benchmark with 480 queries (369 injection, 111 benign), the pipeline reaches 46.34% bypass, 0.00% false positive rate, and 2.50 ms average latency – an operating point that prioritizes pedagogical usability (zero false positives) while maintaining measurable attack resistance. We provide a reproducible benchmark protocol for head-to-head comparison under identical conditions, including stratified bootstrap confidence intervals, paired McNemar significance tests, and direct evaluation of Prompt Guard and NeMo Guardrails on the same split with unified instrumentation. Results expose operational trade-offs: NeMo reaches 0% bypass at 16.22% FPR and 1.3s latency, while Prompt Guard yields 38.48% bypass with 3.60% FPR. The framework supports evidence-based guardrail selection for AI tutoring systems under different institutional risk and usability requirements.
[AI-155] XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction
【速读】:该论文旨在解决多相粉末X射线衍射(Multiphase Powder X-ray Diffraction, PXRD)分析中结构识别的瓶颈问题,即在实际合成过程中常产生复杂混合物,而传统方法难以可靠地分离和识别其中各相成分。现有基于表示学习的晶体检索与生成方法多假设输入为单相数据,在多相场景下失效。其解决方案的关键在于提出一种无先验(prior-free)框架XDecomposer,将多相衍射分析建模为集合预测问题,通过相查询驱动的分解机制与衍射一致的物理重建策略,在统一架构中同时推断无序相集、各相比例及其对应的结构表征,从而实现无需候选相列表、结构模板或相数先验的联合分解与识别,显著提升重建精度与相识别能力,并具备对未见混合物的良好泛化性能。
链接: https://arxiv.org/abs/2605.05866
作者: Hanyu Gao,Bin Cao,Yunyue Su,Tong-Yi Zhang,Qiang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注: 28pages, 8figures, 6tables
Abstract:Multiphase powder X-ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real-world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation-based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single-phase inputs and break down in multiphase settings. Here, we present XDecomposer, a prior-free framework for joint decomposition and identification of multiphase XRD patterns without requiring candidate phase lists, structural templates, or prior knowledge of phase number. We formulate multiphase diffraction analysis as a set prediction problem, where the model infers an unordered set of phase-resolved components, their mixture proportions, and corresponding structural representations within a unified architecture. A phase-query-driven decomposition mechanism, together with diffraction-consistent physical reconstruction, enables accurate source separation while preserving crystallographic fidelity. Extensive experiments on both simulated and experimental datasets show that XDecomposer substantially improves reconstruction accuracy and phase identification across diverse chemical systems, while maintaining strong generalization to unseen mixtures. These results provide a practical route toward data-driven, source-resolved multiphase XRD analysis and reduce long-standing dependence on prior-guided iteratively phase matching. The code is openly available at this https URL
[AI-156] Statistical inference with belief functions: A survey
【速读】:该论文旨在解决如何从统计数据中推断出信度函数(belief function)的问题,这是基于信度函数进行推理链的第一步。其解决方案的关键在于系统性地回顾和总结该领域内最具影响力的贡献,聚焦于如何利用有限或不完整的统计数据来构建合理的信度测度,从而在缺乏足够数据以学习概率分布的场景下,提供一种有效的不确定性建模方法。
链接: https://arxiv.org/abs/2605.07908
作者: Fabio Cuzzolin
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 0 figures
Abstract:Belief functions are a powerful and popular framework for the mathematical characterisation of uncertainty, in particular in situations in which lack of data renders learning a probability distribution for the problem impractical. The first step in a reasoning chain based on belief functions is inference: how to learn a belief measure from the available data. In this survey we focus, in particular, on making inference from statistical data, and review the most significant contributions in the area.
[AI-157] Spectral Dynamics in Deep Networks: Feature Learning Outlier Escape and Learning Rate Transfer
【速读】:该论文旨在解决宽神经网络在梯度下降训练过程中隐藏权重谱(hidden-weight spectra)的演化问题,特别是如何同时刻画谱的主体部分与异常值(outlier)的动力学行为。其核心挑战在于传统方法难以统一描述谱中具有统计依赖性的“尖峰”方向(spiked ensemble)与随机主体部分的动态耦合机制。解决方案的关键在于提出一种两级动力学平均场理论(two-level dynamical mean-field theory, DMFT),该理论能够联合追踪谱的主体和异常值演化,并在两种典型设置下验证:(1) 无限宽非线性网络在均场/μP尺度下的行为;(2) 深度线性网络在比例高维极限中的表现。该框架揭示了异常值随训练时间、网络宽度、输出缩放和初始化方差的变化规律,并指出μP参数化可实现宽度一致的异常值动力学与超参数迁移,而NTK参数化则表现出强宽度依赖性,尽管最终收敛至同一大宽度极限。此外,研究进一步发现,对于小输出通道任务,该“主体+异常值”图像适用,但大规模输出任务(如ImageNet分类或GPT语言建模)需引入谱主体重构机制,从而拓展了对复杂任务下谱演化的理解。
链接: https://arxiv.org/abs/2605.07870
作者: Clarissa Lauditi,Cengiz Pehlevan,Blake Bordelon
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/ \mu P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, \mu P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.
[AI-158] PPI-Net connects molecular protein interactions to functional processes in disease
【速读】:该论文旨在解决分子变异如何在生物系统中传播并驱动疾病这一核心挑战,尤其针对现有模型在整合结构化生物学关系和跨尺度可解释性方面的不足。其解决方案的关键在于提出PPI-Net,一种分层图神经网络(hierarchical graph neural network),它将蛋白质-蛋白质相互作用(Protein-Protein Interaction, PPI)网络与通路层级表示相结合,通过图注意力机制在STRING数据库构建的共享交互网络中传播患者特异性分子特征,并利用Reactome通路层次结构将基因层面信号聚合为更高阶的生物学程序,从而实现从分子互作到功能过程的建模。该方法不仅在十种癌症类型的RNA-seq数据上实现了超过90%的平衡准确率,还通过多组学整合提升了模型解释性,揭示了TP53-AKT信号等经典癌基因模块及离子信号传导等协同生物学程序。
链接: https://arxiv.org/abs/2605.07838
作者: Kyle Higgins,Guadalupe Gonzalez,Dennis Veselkov,Ivan Laponogov,Kirill Veselkov
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 2 tables
Abstract:Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high-throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI-Net, a hierarchical graph neural network that integrates protein-protein interaction (PPI) networks with pathway-level representations to model disease from molecular interactions to functional processes. Patient-specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi-layer Reactome hierarchy using graph attention, enabling aggregation of gene-level signals into higher-order biological programs. Across RNA-seq data from ten cancer types from The Cancer Genome Atlas, PPI-Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA-Seq data from breast cancer demonstrated that PPI-Net’s integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI-only model, while hierarchical multi-level supervision improved balanced accuracy by 12.3% relative to using only a single top-level prediction head. Applying a multi-omics approach using RNA-seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53-AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.
[AI-159] Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation
【速读】:该论文旨在解决单通道语音源距离估计(single-channel speaker distance estimation)在实际应用中性能不稳定的问题,特别是模型在不同录音条件下如何依赖房间脉冲响应(RIR)的不同成分进行推理。其关键解决方案是通过将模拟的RIR分解为四种变体(全RIR、仅直达声、无晚期混响、无早期反射),并设计四种校准场景(从完全已知到完全未知的时间与声级信息),系统性地评估模型在不同条件下的表现。结果表明:当缺乏时间校准时,模型主要依赖混响特征,且早期反射最为关键;而一旦具备时间校准,则模型仅需提取传播延迟即可实现亚米级精度(MAE=0.14 m),显著提升鲁棒性与准确性。
链接: https://arxiv.org/abs/2605.07694
作者: Michael Neri,Archontis Politis,Tuomas Virtanen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Submitted to IWAENC 2026
Abstract:Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to 1.29 m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, C_50 , and T_60 confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of 0.14 m by extracting the propagation delay alone, regardless of the RIR content.
[AI-160] Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
【速读】:该论文旨在解决无线联邦学习(Over-the-air Federated Learning, OTA-FL)中因传统模拟聚合方案依赖瞬时信道状态信息(CSI)、信道逆变换和相干相位对齐而导致的实际系统部署困难问题。其解决方案的关键在于提出一种非相干聚合原语——资源元素能量差(Resource-element Energy Difference, REED),该方法将实值更新的正负部分映射到两个正交资源元素上的传输能量,并引入独立相位抖动,服务器通过能量差估计带符号的聚合结果。REED仅需慢时尺度的平均信道功率校准,即可无偏地估计目标带符号和,在瑞利衰落信道下具有精确闭式方差表达式,从而在保证收敛性的同时显著降低对实时CSI的依赖。
链接: https://arxiv.org/abs/2605.07263
作者: Hao Chen,Zavareh Bozorgasl
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Preprint; Under-review; Codes to replicate the results is available at: this https URL
Abstract:Over-the-air federated learning (OTA-FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource-element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real-valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow-timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed-form variance under Rayleigh fading. We incorporate REED into full-participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per-client energy budget, the aggregation gain can be scheduled so that the REED-induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion-MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.
[AI-161] Causal EpiNets: Precision-corrected Bounds on Individual Treatment Effects using Epistemic Neural Networks
【速读】:该论文旨在解决有限样本下个体处理效应(Individual Treatment Effects, ITE)无法通过传统插值估计器准确识别的问题,特别是标准插值估计方法在有限样本中会违反结构概率约束并因极值运算(max-min operators)产生极端偏差,导致置信区间过窄、覆盖不足。其解决方案的关键在于提出一种基于神经网络的框架:首先设计一种锚定神经架构(anchored neural architecture),从构造上保证满足结构概率约束;其次引入精度校正的交集边界推断(precision-corrected intersection-bound inference),利用认知神经网络(Epistemic Neural Networks)实现高维场景下的可扩展不确定性量化,从而有效纠正极端偏差,确保名义覆盖性和约束严格成立。
链接: https://arxiv.org/abs/2605.07065
作者: Gandharv Patil,Keyi Tang,Raquel Aoki,Leo Guelman
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
备注:
Abstract:Individual treatment effects are not point-identified from data. The Probability of Necessity and Sufficiency (PNS) circumvents this limitation by characterizing individual-level causality through intersection bounds derived from combined experimental and observational data. In finite samples, however, standard plug-in estimators systematically fail: they violate structural probability constraints and suffer from extremum bias induced by max-min operators, yielding spuriously narrow intervals. We propose a neural framework for finite-sample PNS estimation that resolves both pathologies. We introduce an anchored neural architecture that guarantees structural constraint satisfaction by construction. To correct extremum bias, we employ precision-corrected intersection-bound inference, leveraging Epistemic Neural Networks for scalable, high-dimensional uncertainty quantification. Empirical evaluations confirm that this approach maintains nominal coverage and exact constraint validity in high-dimensional regimes where standard estimators systematically undercover.
[AI-162] An Interpretable and Scalable Framework for Evaluating Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中因依赖平均准确率而忽略模型输出固有随机性及测试项异质性的局限性问题,以及传统项目反应理论(Item Response Theory, IRT)方法在计算复杂度高和数值不稳定方面的瓶颈。其解决方案的关键在于提出一种基于极大化-最小化(majorization-minimization)原则的可解释且可扩展的LLM评估框架,通过将原问题重构为一系列约束矩阵分解子问题,实现了参数估计的稳定性和高效性,并具备可识别性和收敛性的理论保障,从而在合成数据与真实世界基准(如MATH-500和Open LLM Leaderboard)上实现比现有方法快数个数量级的速度提升,同时保持或超越精度水平。
链接: https://arxiv.org/abs/2605.07046
作者: Xinhao Qu,Qiang Heng,Hao Zeng,Xiaoqian Liu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including MATH-500 and six Open LLM Leaderboard benchmarks, demonstrate that our method achieves superior scalability and interpretability. It delivers orders-of-magnitude speedups over competing methods while maintaining comparable or even higher estimation accuracy. Our results align with established scaling laws and offer insights into item difficulty and discrimination, informing more principled benchmark design.
[AI-163] BGM-IV: an AI-powered Bayesian generative modeling approach for instrumental variable analysis
【速读】:该论文旨在解决高维协变量下非线性工具变量(Instrumental Variable, IV)回归中的因果估计问题,尤其针对传统方法在处理复杂非线性结构和高维特征时表现不佳的局限性。其解决方案的关键在于提出BGM-IV——一种基于潜在贝叶斯生成建模的方法,将非线性IV回归重构为因果结构化潜空间中的后验推断问题。该方法通过学习分离的潜在成分来分别捕捉共享混杂结构、结果特异性变异、处理特异性变异及仅由协变量驱动的干扰信息,并引入一种基于工具变量的伪似然函数,通过在潜模型中对工具变量诱导的处理值进行平均以校正内生性。这一框架在高维协变量场景下展现出优越性能,表明结构化潜空间生成建模是一种有效且原理清晰的非线性IV估计策略。
链接: https://arxiv.org/abs/2605.07029
作者: Guyue Luo,Qiao Liu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:
Abstract:Instrumental-variable (IV) regression enables causal estimation under endogeneity, but modern IV problems often involve nonlinear structural effects and high-dimensional covariates. Existing nonlinear IV methods directly learn the causal relation in observed feature space or rely on learned representations within two-stage or moment-based procedures, which can struggle when the causal information is embedded in a high-dimensional representation. We propose BGM-IV, a latent Bayesian generative modeling approach that reframes nonlinear IV regression as posterior inference in a causally structured latent space. BGM-IV infers latent components that separately capture shared confounding structure, outcome-specific variation, treatment-specific variation, and covariate-only nuisance information. To account for endogeneity, BGM-IV replaces the confounded outcome likelihood with an IV-integrated pseudo-likelihood that averages over instrument-induced treatment values within the latent model. Across various benchmark datasets, BGM-IV remains competitive in the classical low-dimensional regime and performs best in high-dimensional covariate regimes. Together, these results show that structured latent generative modeling provides a principled and effective strategy to nonlinear IV estimation with rich covariates. The code of BGM-IV is available at this https URL.
[AI-164] Learning Cross-Atlas Consistent Brain Disorder Representations via Disentangled Multi-Atlas Functional Connectivity Learning
【速读】:该论文旨在解决功能连接(Functional Connectivity, FC)构建中因脑图谱(brain atlas)选择差异导致的异质性问题,即不同脑分区方案可能强调大脑网络的不同组织特征,从而引发结果不一致甚至矛盾。其解决方案的关键在于提出一种多图谱解耦连接学习框架(Multi-Atlas Disentangled Connectivity LEarning, MADCLE),通过多分支表示学习联合编码来自不同脑图谱的功能连接矩阵;该框架不引入单一共享潜在变量,而是分别学习各图谱下的疾病相关表征,并借助分布对齐机制确保跨图谱一致性;同时,利用协变量相似性监督、图谱特异性重建和去相关约束,将协变量相关和图谱依赖的残差因素独立建模,从而减少非疾病信息和图谱特异性噪声对疾病嵌入的干扰。
链接: https://arxiv.org/abs/2605.07026
作者: Minheng Chen,Chao Cao,Jing Zhang,Tianming Liu,Dajiang Zhu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Functional connectivity (FC) derived from resting-state fMRI is widely used to characterize large-scale brain network alterations in neurological and psychiatric disorders. However, FC construction critically depends on the choice of brain atlas, and different parcellations may emphasize distinct organizational features, leading to heterogeneous and sometimes inconsistent representations. Existing multi-atlas approaches partially alleviate this issue but often fuse atlas-derived features or predictions at a relatively shallow level, while single-atlas disentanglement methods do not explicitly address cross-atlas heterogeneity. We propose Multi-Atlas Disentangled Connectivity LEarning (MADCLE), a multi-branch representation learning framework that jointly encodes FC matrices derived from different brain atlases. Rather than introducing a single explicitly shared latent variable across parcellations, MADCLE learns atlas-wise disease-related representations and encourages them to be cross-atlas consistent through distributional alignment. Meanwhile, covariate-related and atlas-dependent residual factors are modeled separately using covariate similarity supervision, atlas-specific reconstruction, and decorrelation constraints, thereby reducing the leakage of non-disease and parcellation-dependent information into the disease-related embeddings. Experiments on the ADNI and ADHD-200 datasets suggest that MADCLE achieves competitive or improved performance compared with single-atlas baselines, multi-atlas GNN/Transformer models, and recent multi-atlas consistency frameworks. These results support the potential value of structured disentanglement for FC-based disorder identification under heterogeneous parcellation schemes.
[AI-165] Drawing Lines in Psychological Space: What K-means Clustering Reveals in Simulated and Real Psychometric Data
【速读】:该论文旨在解决K-means聚类在心理测量研究中被广泛用于识别潜在群体类型(latent psychological categories)时的一个根本性问题:其经典形式并不检验这些群体是否真实存在,而是基于几何距离将多维空间划分为紧凑、近似球形的区域,可能导致在无真实子群结构的数据中仍产生看似稳定的聚类结果。解决方案的关键在于通过一系列受控模拟数据集验证这一局限性,并将其扩展至SMARVUS大规模国际心理测量数据集(包含来自35个国家大学生的调查响应),对比模拟与实证数据的聚类模式,从而表明K-means即使在连续高斯潜空间中无真实子群结构时,也能生成稳定且视觉上一致的聚类解,提示研究者需谨慎解读其结果并考虑使用更符合潜变量建模逻辑的方法。
链接: https://arxiv.org/abs/2605.06989
作者: Pedro Henrique Ramos Pinto,Maria Jullyanna Ferreira Marques,Luiz Carlos Serramo Lopez
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: Methodological study on K-means clustering in psychometric data using simulated and empirical datasets
Abstract:K-means clustering is widely used in psychological and psychometric research to identify profiles, subgroups, and potential typologies, yet its classical formulation does not test whether such groups exist as latent psychological categories. Instead, K-means partitions multidimensional space into regions around centroids, favoring compact, approximately spherical clusters defined by geometric distance. In this paper, we examine this limitation through a sequence of controlled simulated datasets. We then extend the analysis to the SMARVUS dataset, a large international psychometric dataset comprising survey responses from university students across 35 countries, to evaluate whether similar geometric partitioning patterns emerge in empirical psychological data. By contrasting simulated and empirical data, this paper argues that K-means can produce stable and visually coherent clustering solutions even in continuous Gaussian latent spaces without true subgroup structure.
[AI-166] Decentralized Time-Varying Optimization for Streaming Data via Temporal Weighting
【速读】:该论文旨在解决分布式网络中基于流式数据的优化问题,即在动态环境中,各代理(agent)持续接收新样本,且目标函数随时间变化,如何设计高效、可扩展的优化算法以跟踪时变最优解。其核心解决方案是采用一种结构化的加权形式,通过分布式梯度下降(Decentralized Gradient Descent, DGD)方法,在有限通信与计算预算下实现对时间加权目标函数最小值的跟踪。关键创新在于从固定点理论视角分析跟踪误差,揭示其由固定点追踪项和因数据异质性引入的偏差项组成,并对比了均匀权重与指数折扣权重两种策略下的性能差异,指出去中心化结构在恒定步长下会引入非零偏差底限,从而为实际系统中的参数设计提供了理论依据。
链接: https://arxiv.org/abs/2605.06971
作者: Muhammad Faraz Ul Abrar,Nicolò Michelusi,Erik G. Larsson
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Classical optimization theory largely focuses on fixed objective functions, whereas many modern learning systems operate in dynamic environments where data arrive sequentially and decisions must be updated continuously. In this work, we study optimization with streaming data over a distributed network of agents. We adopt a structured, weight-based formulation that explicitly captures the streaming-data origin of the time-varying objective: at each time step, every agent receives a new sample, and the network seeks to track the minimizer of a temporally weighted objective formed from all samples observed across the network so far. We focus on decentralized gradient descent (DGD) with a limited communication/computation budget, where at each time step, only a limited number of DGD iterations can be performed before the objective changes again. For strongly convex and smooth losses, we analyze the tracking error with respect to the time-varying minimizer through a fixed-point theory lens. Our analysis reveals that the tracking error decomposes into a fixed-point tracking term and a bias term induced by data heterogeneity across agents. We specialize the analysis to two natural weighting strategies: uniform weights, which treat all samples equally, and exponentially discounted weights, which geometrically decay the influence of older data. Under uniform weighting, DGD tracks the fixed-point at a rate \mathcalO(1/t) , whereas discounted weighting yields a non-vanishing fixed-point tracking floor controlled by the discount factor. In both cases, decentralization induces an additional non-zero bias floor under a constant step size. We validate our theoretical findings through numerical simulations.
[AI-167] LLM -Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments
【速读】:该论文旨在解决当前自主实验流程在科学发现中局限于固定目标或假设空间的问题,即现有方法难以从实验数据中自动推导出新的物理模型,而仅能进行参数优化或测量选择。其解决方案的关键在于提出了一种开放假设学习框架,融合符号回归(symbolic regression)与基于大语言模型(large-language-model-based)的物理合理性评估机制:符号回归从稀疏实验数据中生成候选解析关系,而语言模型则依据物理合理性、量纲一致性及已知机制对候选表达式进行排序和筛选,从而实现从实验数据中自动生成可解释的物理定律。该方法在压电响应力显微镜(piezoresponse force microscopy)中成功实现了对铁电畴翻转动力学规律的自主发现,标志着自主显微技术从闭合回路优化向开放式假设生成的演进。
链接: https://arxiv.org/abs/2605.06839
作者: Boris Slautin,Utkarsh Pratiush,Yu Liu,Kamyar Barakati,Sergei Kalinin
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures, 1 table
Abstract:Autonomous experimentation has transformed microscopy and materials discovery by enabling closed-loop optimization including imaging and spectroscopy tuning, strucutre property relationship discovery, and exploration of combinatorial libraries. However, most current workflows remain limited to selecting measurements within fixed objective or hypothesis spaces, rather than generating new physical models from experimental data. Here, we introduce an open hypothesis-learning framework that combines symbolic regression with large-language-model-based physical evaluation and implement it for autonomous scanning probe microscopy. Symbolic regression generates candidate analytical relationships directly from sparse measurements, while the language-model evaluator ranks these candidates according to physical plausibility, scaling behavior, and consistency with known mechanisms. We demonstrate the approach on autonomous piezoresponse force microscopy measurements of ferroelectric domain switching in a PZT thin film. Starting from five seed measurements, the workflow evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion. This work extends autonomous microscopy from closed-loop optimization toward open hypothesis discovery, where candidate physical laws emerge from the experiment itself rather than being specified in advance. More broadly, the framework establishes a route for integrating symbolic regression, physical reasoning, and adaptive experimentation into hierarchical autonomous scientific workflows.
[AI-168] Overcoming data scarcity through multi-center federated learning for organs-at-risk segmentation in pediatric upper abdominal radiotherapy
【速读】:该论文旨在解决基于深度学习的器官/结构危及器官(OARs)自动勾画模型在儿科患者中性能下降的问题,其根源在于儿科数据稀缺且分散于不同医疗中心。为应对这一挑战,研究提出采用联邦学习(Federated Learning, FL)方案,在不共享原始数据的前提下实现跨中心协作建模,关键在于通过云端安全权重交换机制,在机构防火墙外完成模型训练与聚合,从而提升模型对多中心儿科上腹部肿瘤CT图像的泛化能力和鲁棒性。实验表明,FL模型在至少7个OAR上的Dice相似系数(DSC)达到本地模型水平,并在跨中心验证中取得最优表现,同时减少因手术切除肾脏导致的假阳性分割问题。
链接: https://arxiv.org/abs/2605.06820
作者: Mianyong Ding,Maximilian Knoll,Semi Harrabi,Martine van Grotel,Annemieke S. Littooij,Max van Noesel,Jens-Peter Schenk,Marry M. van den Heuvel-Eibrink,Geert O. Janssens,Matteo Maspero
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning-based organs/structures-at-risk(OARs) auto-contouring models can improve radiotherapy workflows, but models trained on adult data often underperform in pediatric patients. Developing robust pediatric-specific models is hindered by data scarcity and fragmentation across centers. Federated learning (FL) enables privacy-preserving collaborative training without the need for data sharing. We evaluated the feasibility and performance of FL for developing pediatric-specific OAR segmentation models across two European medical centers. Computed tomography (CT) images from pediatric patients from Utrecht and Heidelberg with a renal tumor or abdominal neuroblastoma were retrospectively collected and locally processed. An nnU-Net-based framework segmented 19 OARs using local and FL schemes. FL was implemented with secure weight exchange on a cloud storage across institutional firewalls. Performance was assessed using the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance, and mean surface distance. Robustness to patient orientation, false-positive segmentation of surgically removed kidneys, and failure cases were identified. A total of 310 postoperative CTs from 272 patients (105 renal tumors, 167 neuroblastomas) were included. Local models performed well on their respective center data but showed significantly reduced cross-center performance for four to seven of the nine evaluated OARs (DSC). In contrast, the FL model matched local performance for at least seven of nine OARs and achieved the best cross-center results across three metrics, with DSC gains of 0.003-0.007 over local models. FL also maintained stable performance across patient orientations and reduced false-positive kidney segmentations. Real-world FL improves cross-center robustness of CT-based OAR segmentation models in pediatric upper abdominal tumors.
[AI-169] A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine
【速读】:该论文旨在解决复杂性状在不同田间条件和年份间难以准确进行基因型到表型(Genotype-to-Phenotype, G2P)预测的问题,从而限制了育种决策效率与遗传增益的提升。其解决方案的关键在于提出一种线性-Transformer架构(LiT-G2P),通过将加性遗传方差效应与基于Transformer的非线性互作学习相结合,利用全基因组单核苷酸多态性(SNP)数据实现自动化预测建模。该方法在葡萄叶片毛密度和毛状体密度两个性状上均展现出优于基线模型的跨年稳定预测性能,同时借助注意力权重提取优先SNP并进行基因型分层分析,增强了模型的可解释性与候选标记筛选能力,为基于SNP的基因组选择提供了更具鲁棒性和实用性的预测框架。
链接: https://arxiv.org/abs/2605.06762
作者: Yibin Wang,Murukarthick Jayakodi,Silvas Kirubakaran,Ambika Chandra,Azlan Zahid
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 Figures
Abstract:Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.
[AI-170] A Statistical Framework for Algorithmic Collective Action with Multiple Collectives
【速读】:该论文旨在解决多主体协同行为(Algorithmic Collective Action, ACA)在共享学习系统中的建模与分析问题,尤其关注多个独立集体(collective)如何通过调整共享数据来影响分类器行为,而现有研究主要局限于单一集体场景。解决方案的关键在于提出了首个针对多集体ACA的统计框架,量化了不同集体规模及其目标一致性对行动成功概率的影响,并设计了仅需部分信息即可计算的统计边界,从而为各集体提供可操作的决策依据。该框架在智能城市气候适应干预的模拟中得到验证,证明其有效性与实用性。
链接: https://arxiv.org/abs/2605.06749
作者: Claudio Battiloro,Pietro Greiner,Dario Rancati,Bret Nestor,Oumaima Amezgar,Francesca Dominici
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: 27 pages, 16 figures
Abstract:As learning systems increasingly shape everyday decisions, Algorithmic Collective Action (ACA), i.e., users coordinating changes to shared data to steer model behavior, offers a complement to regulator-side policy and corporate model design. Real-world collective actions have traditionally been decentralized and fragmented into multiple collectives, despite sharing overarching objectives, with each collective differing in size, strategy, and actionable goals. However, most of the ACA literature focuses on single collective settings. To address this, we propose the first comprehensive statistical framework for ACA with multiple collectives acting on the same system. In particular, we focus on collective action in classification, studying how multiple collectives can influence a classifier’s behavior. We provide quantitative statistical bounds on the success of the collectives, considering the role and the interplay of the collectives’ sizes and the alignment of their goals. We make such bounds computable by each collective with only partial knowledge of other collectives’ sizes and strategies. Finally, we numerically illustrate our framework on simulations inspired by interventions for climate adaptation in smart cities, demonstrating the usefulness of our bounds.
[AI-171] OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
【速读】:该论文旨在解决当前生物信息学中 transcriptomic 数据分析的瓶颈问题:现有模型要么仅处理表达谱但无法生成自然语言解释,要么在语言层面推理却缺乏对定量组学数据的直接访问。解决方案的关键在于提出 OmicsLM,一个将定量组学特征与自然语言任务相连接的多模态大语言模型(Multimodal Large Language Model, MLLM)。其核心创新是将每个转录组谱(transcriptomic profile)编码为连续向量嵌入(continuous representation)并注入 LLM 上下文,从而在保持量化表达信号的同时支持自然语言指令、显式基因提及及多个样本的交错处理,实现了语言引导下的多样本组学推理。
链接: https://arxiv.org/abs/2605.06728
作者: Maciej Sypetkowski,Joanna Krawczyk,Łukasz Smoliński,Remigiusz Kinas,Przemysław Pietrzak,Tomasz Jetka,Rafał Powalski
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Cell Behavior (q-bio.CB)
备注: 13 pages (main text), 14 pages (appendix), 1 figure, 10 tables
Abstract:Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.
机器学习
[LG-0] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
链接: https://arxiv.org/abs/2605.08075
作者: Maryam Maghsoudi,Shihab Shamma
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.
[LG-1] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs
链接: https://arxiv.org/abs/2605.08074
作者: Peyman Baghershahi,Fangxin Wang,Debmalya Mandal,Sourav Medya
类目: Machine Learning (cs.LG)
*备注: 20 pages, 9 Figures, 8 Tables
Abstract:Conformal prediction (CP) provides a distribution-free approach to uncertainty quantification with finite-sample guarantees. However, applying CP to graph neural networks (GNNs) remains challenging as the combinatorial nature of graphs often leads to insufficiently certain predictions and indiscriminative embeddings. Existing methods primarily rely on embedding-space proximity for localization, which can be unreliable for graphs and yield inefficient prediction sets. We propose GRAPHLCP, a proximity-based localized CP framework that explicitly incorporates graph topology and inter-node dependencies into localization and weighting. Our approach introduces a feature-aware densification step to mitigate locality bias in sparse graphs, followed by a Personalized PageRank-based kernel computation to model structural proximity. This enables topology-dependent anchor sampling and calibration weighting that captures both local and long-range dependencies. Extensive experiments on several regression and classification datasets demonstrate that GRAPHLCP guarantees marginal coverage with finite samples while efficiently attaining favorable test conditional coverage across various conditioning scenarios.
[LG-2] Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
链接: https://arxiv.org/abs/2605.08053
作者: Gugan Thoppe,L. A. Prashanth,Ankur Naskar,Sanjay Bhat
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \citeporteus1975optimality, we derive two Q-value-style extensions and show that the associated operators are contractions in the L_\infty and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning–style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
[LG-3] Dont Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids
链接: https://arxiv.org/abs/2605.08036
作者: Mads Greisen Højlund,August Smart Lykke-Møller,Henry Moss,Ove Christiansen
类目: Machine Learning (cs.LG)
*备注: 51 pages, 8 figures
Abstract:We introduce CUTS-GPR, a new method for performing numerically exact Gaussian process regression (GPR) in high-dimensional settings. The key component of CUTS-GPR is an extremely fast kernel matrix-vector product, which exhibits near-linear or even linear scaling with the amount of training data, N , and low-order polynomial scaling with dimensionality, D . This is obtained by combining an additive kernel with an incomplete grid and exploiting the resulting structure of the kernel matrix. We demonstrate the scalability of the matrix-vector product by running benchmarks with billions of data points and thousands of dimensions. Full GPR calculations, including hyperparameter optimization, are completed in a matter of hours for N = 447 265 and D = 24 . We demonstrate that our CUTS-GPR enables Bayesian modeling of high-dimensional potential energy surfaces - a longstanding challenge in computational chemistry.
[LG-4] Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data
链接: https://arxiv.org/abs/2605.08028
作者: Eunhan Ka,Ludovic Leclercq,Satish V. Ukkusuri
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 56 pages, 5 figures, 12 tables. Submitted to Transportation Research Part C
Abstract:Traffic state estimation from sparse fixed sensors is challenging because physics-informed neural networks (PINNs) tend to over-smooth the shockwaves admitted by the Lighthill-Whitham-Richards (LWR) model. This study proposes Adaptive Domain Decomposition Physics-Informed Neural Networks (ADD-PINN), a two-stage residual-guided framework for LWR-based offline speed-field reconstruction. A coarse global PINN is first trained; its spatial residual profile is then used to place subdomain boundaries and initialize child subnetworks in a decomposition-enabled mode, while a data-driven shock indicator can retain a single-domain fallback when localized evidence of transition is weak. The primary offline I-24 MOTION evaluation spans five days, five sensor configurations, and ten seeds per configuration, yielding 1,500 runs in total. Against neural and physics-informed baselines, ADD-PINN attains the lowest relative L2 error in 18 of 25 configurations and in 14 of 15 sparse-sensing cases, while training 2.4 times faster than the extended PINN (XPINN) baseline. An ablation study supports spatial-only decomposition as an effective default for fixed-sensor traffic reconstruction in the evaluated settings. Supplementary Next Generation Simulation (NGSIM) experiments serve as a negative control: the shock indicator suppresses decomposition in all 50 runs, and the default single-domain fallback ranks first across all sensor configurations. These results support residual-guided spatial decomposition as an effective PINN-family design for offline reconstruction when sparse fixed sensing coincides with localized transition regions.
[LG-5] Interpreting Reinforcement Learning Agents with Susceptibilities
链接: https://arxiv.org/abs/2605.08007
作者: Chris Elliott,Einar Urdshals,David Quarel,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注: 55 pages, comments welcome
Abstract:Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework’s extension to RLHF post-training.
[LG-6] STEPS: A Temporal Smooth Error Propagation Solver on the Manifolds for Test-Time Adaptation in Time Series Forecasting NEURIPS2026
链接: https://arxiv.org/abs/2605.08005
作者: Jiaqi Liu,Yifan Ouyang,Zhifei Song,Sim Kuan Goh,Ashwaq Qasem
类目: Machine Learning (cs.LG)
*备注: 9 pages main text, appendix included. 7 figures. Submitted to NeurIPS 2026
Abstract:Test-Time Adaptation (TTA) aims to improve time series forecasting under distribution shifts by using limited observations revealed during inference. However, forecasting TTA must operate in a source-free online setting, where the adaptation signal is short, temporally correlated, and potentially noisy. Existing methods can therefore suffer from weak identifiability, error accumulation, and unstable long-horizon corrections when the revealed prefix is sparse or contaminated. To address these issues, we propose STEPS, a Smooth Temporal Error Propagation Solver for TTA in time-series forecasting. STEPS reformulates forecasting TTA as a Dirichlet Boundary Value Problem on a temporal manifold, where the revealed prefix error serves as the boundary condition for the unknown future error field. Then, STEPS solves a smooth and bounded correction field in prediction space: a Local Solver propagates prefix errors under temporal smoothness, a Global Solver retrieves stable cross-window error memory and Spatiotemporal Manifold Fusion (SMF) integrates both solutions into the final correction. Across six standard benchmarks and four frozen backbones, STEPS achieves an average relative MSE reduction of 26.82% over the zero-shot backbone, exceeding the strongest compared TTA baseline by 12.77%. Additional sparse prefix and contamination tests confirm the robustness of STEPS under limited and noisy prefixes.
[LG-7] Bayesian Sensitivity of Causal Inference Estimators under Evidence-Based Priors
链接: https://arxiv.org/abs/2605.07993
作者: Nikita Dhawan,Daniel Shen,Leonardo Cotta,Chris J. Maddison
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: TMLR 2026
Abstract:Causal inference, especially in observational studies, relies on untestable assumptions about the true data-generating process. Sensitivity analysis helps us determine how robust our conclusions are when we alter these underlying assumptions. Existing frameworks for sensitivity analysis are concerned with worst-case changes in assumptions. In this work, we argue that using such pessimistic criteria can often become uninformative or lead to conclusions contradicting our prior knowledge about the world. To demonstrate this claim, we generalize the recent s-value framework (Gupta Rothenhäusler, 2023) to estimate the sensitivity of three different common assumptions in causal inference. Empirically, we find that, indeed, worst-case conclusions about sensitivity can rely on unrealistic changes in the data-generating process. To overcome this, we extend the s-value framework with a new sensitivity analysis criterion: Bayesian Sensitivity Value (BSV), which computes the expected sensitivity of an estimate to assumption violations under priors constructed from real-world evidence. We use Monte Carlo approximations to estimate this quantity and illustrate its applicability in an observational study on the effect of diabetes treatments on weight loss.
[LG-8] Susceptibilities and Patterning: A Primer on Linear Response in Bayesian Learning
链接: https://arxiv.org/abs/2605.07980
作者: Chris Elliott,Daniel Murfet
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Statistics Theory (math.ST)
*备注: 34 pages, 3 figures, comments welcome!
Abstract:These notes introduce the theory of susceptibilities as developed in [arXiv:2504.18274, arXiv:2601.12703] for interpreting neural networks. The susceptibility of an observable \phi to a data perturbation is defined as a derivative of a posterior expectation, which by the fluctuation–dissipation theorem equals a posterior covariance. Different choices of \phi yield different objects: per-sample losses give the influence matrix (the Bayesian influence function of [arXiv:2509.26544]), while component-localized observables give the structural susceptibility matrix that pairs model components with data patterns. The susceptibility matrix is (up to a factor of n\beta ) the Jacobian of the map from data distributions to structural coordinates; its pseudo-inverse provides a linearized solution to the patterning problem of [arXiv:2601.13548]: finding data perturbations that produce a desired structural change. We motivate the theory from its statistical-mechanical foundations, then give a detailed exposition of susceptibilities, their empirical estimators, and their connection to the geometry of the loss landscape.
[LG-9] Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
链接: https://arxiv.org/abs/2605.07977
作者: Seohyun Lee,Wenzhi Fang,Dong-Jun Han,Seyyedali Hosseinalipour,Christopher G. Brinton
类目: Machine Learning (cs.LG)
*备注: 27 pages
Abstract:Recent works have advanced feedback-based learning systems, whereby a foundation model is able to intake incoming feedback (e.g., a user) to self-improve, creating a self-loop system of training. However, existing works are limited in needing to consider an offline setup to allow for such feedback-based methods, and are further limited in the need of requiring privileged ground-truth contexts for training. Moreover, there is limited consideration of federated learning (FL), which is particularly well-suited for incorporating external feedback across large networks of end users, for example, but requires methods to be efficient for training on resource-constrained edge devices. Therefore, we introduce SPEAR (Self-Play Enhancement via Advantage-Weighted Refinement), an efficient online learning algorithm for federated LLM fine-tuning. SPEAR utilizes a feedback-guided self-play loop to construct naturally contrastive pairs per prompt which are utilized to be trained on (i) standard maximum likelihood on correct completions and (ii) confidence-weighted unlikelihood on tail tokens of incorrect completions. Without the need of expensive group generations and ground-truth contexts for training (i.e., only partial, non-answer feedback), in contrast with existing works, SPEAR can be trained both online and in a resource-efficient manner. We validate SPEAR across various benchmark datasets, demonstrating its superior performance in comparison to state-of-the-art baselines. The implementation code is publicly available at this https URL.
[LG-10] When Diffusion Model Can Ignore Dimension: An Entropy-Based Theory
链接: https://arxiv.org/abs/2605.07969
作者: Ahmad Aghapour,Erhan Bayraktar
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Diffusion models perform remarkably well on high-dimensional data such as images, often using only a modest number of reverse-time steps. Despite this practical success, existing convergence theory does not fully explain why such samplers remain efficient in high dimensions. Many prior KL guarantees bound the discretization error in terms of the ambient dimension, while other improved results replace this dependence using intrinsic-dimensional or geometric structure assumptions. In this work, we develop an alternative information-theoretic perspective on diffusion sampler convergence. We prove that, for Gaussian mixture targets, the discretization error is controlled by the Shannon entropy of the latent mixture component rather than by the ambient dimension. Consequently, the leading step complexity scales linearly with latent entropy and depends only logarithmically on the second moment of the data. Our analysis also extends to discrete target distributions, where the relevant complexity is the entropy of the target rather than the dimension of the embedding space. These results suggest that diffusion sampling can remain efficient in high-dimensional spaces when the data distribution admits a compact latent representation, as is widely believed to be the case for natural images.
[LG-11] Aggregation in conformal e-classification
链接: https://arxiv.org/abs/2605.07963
作者: Vladimir Vovk
类目: Machine Learning (cs.LG)
*备注: 23 pages, 10 figures
Abstract:Aggregating conformal predictors is a standard way of balancing their predictive and computational efficiency while retaining their validity, at least approximately. An important advantage of conformal e-predictors is that they are easier to aggregate without sacrificing their validity. This paper studies experimentally cross-conformal e-prediction, which is an existing method of aggregating conformal e-predictors, and its modifications that are conceptually simpler and more flexible.
[LG-12] FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning
链接: https://arxiv.org/abs/2605.07962
作者: Fabian Stricker,Jose A. Peregrina,David Bermbach,Christian Zirpins
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted for publication in 2nd IEEE International Conference on Federated Learning and Intelligent Computing Systems(FLICS2026)
Abstract:Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric. To address this issue, we examine the underlying reasons for these discrepancies and propose FLAM, a performance evaluation method based on aggregatable measures that yields the same results as centralized evaluation without the need for a global test dataset. Comments: Accepted for publication in 2nd IEEE International Conference on Federated Learning and Intelligent Computing Systems(FLICS2026) Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.07962 [cs.LG] (or arXiv:2605.07962v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLM s
链接: https://arxiv.org/abs/2605.07961
作者: Hanlin Cai,Kai Li,Houtianfu Wang,Haofan Dong,Yichen Li,Falko Dressler,Ozgur B. Akan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.
[LG-14] Convergent Stochastic Training of Attention and Understanding LoRA
链接: https://arxiv.org/abs/2605.07959
作者: Zhengkai Sun,Dibyakanti Kumar,Alejandro F Frangi,Anirbit Mukherjee,Mingfei Sun
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Probability (math.PR)
*备注:
Abstract:Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them is trained, to achieve a surprisingly beneficial accuracy-size trade-off. In this work, via a unified framework we rigorously establish trainability of such models under stochastic methods. We prove that for any mild regularization, the empirical regression loss on a attention layer and LoRA on a shallow neural net, both induce Poincaré inequality for the corresponding Gibbs’ measure. Then it follows via invoking recent results that a certain SDE, which mimics the SGD, minimizes the corresponding losses. In both the cases, our first-of-its-kind results of trainability on attention and nets, do not rely on any assumptions on the data or the size of the architecture.
[LG-15] Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation
链接: https://arxiv.org/abs/2605.07950
作者: Atsushi Nitanda,Dake Bu,Yueming Lyu,Tanya Veeravalli
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study Slowly Annealed Langevin Dynamics (SALD), a sampler for tracking a path of moving target distributions and approximating the terminal target through time slowdown. We establish non-asymptotic convergence guarantees via a KL differential inequality, showing that slowdown improves tracking through contraction of intermediate targets and the complexity of the path. Motivated by training-free guided generation with pretrained score-based generative models, we further introduce Velocity-Aware SALD (VA-SALD), which explicitly incorporates the underlying marginal distributions of the pretrained model and uses slowdown to correct the additional deviation induced by guidance. This yields a principled framework for training-free guided generation for diffusion-based and related generative model families, together with convergence guarantees that clarify the roles of intermediate functional inequalities and guidance bias. Code is available at this https URL.
[LG-16] Prototype Guided Post-pretraining for Single-Cell Representation Learning
链接: https://arxiv.org/abs/2605.07938
作者: Sachini Weerasekara,Natasha Darras,Sagar Kamarthi,Colles Price,Jacqueline Isaacs
类目: Machine Learning (cs.LG)
*备注:
Abstract:Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells. Across multiple computational biology tasks, empirical results show that CellRefine consistently improves downstream performance, yielding gains up to 15%.
[LG-17] ree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
链接: https://arxiv.org/abs/2605.07922
作者: Tue M. Cao,Hoang X. Nhat,Raed Alharbi,My T. Thai
类目: Machine Learning (cs.LG)
*备注: 21 pages
Abstract:Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.
[LG-18] Curvature Beyond Positivity: Greedy Guarantees for Arbitrary Submodular Functions
链接: https://arxiv.org/abs/2605.07902
作者: Yixin Chen,Alan Kuhnle
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 44 pages, 11 figures
Abstract:Submodular functions – functions exhibiting diminishing returns – are central to machine learning. When the objective is monotone and non-negative, the greedy algorithm achieves a tight 63% approximation. But many practical objectives incorporate costs that make them negative on some inputs, and all existing multiplicative guarantees require non-negativity. Prior work handles negativity through additive bounds for the special class of decomposable functions and non-monotonicity through partial-monotonicity parameters, but these address each difficulty in isolation and neither extends the classical structural theory. We extend \emphcurvature – a parameter measuring how far a function deviates from linearity – to all submodular functions, handling both non-monotonicity and negativity through a single classical concept. A greedy algorithm with pruning achieves a curvature-controlled multiplicative ratio for \emphany submodular function, including those taking negative values – the first such guarantee beyond monotonicity and non-negativity. In the non-monotone regime 1 \le c_g 2.2 , the bound strictly beats the best known uniform ratio of 0.401 (for non-negative f ), and it recovers the classical (1-e^-c_g)/c_g guarantee for monotone functions. A multilinear-extension variant extends the framework to general combinatorial constraints via multilinear relaxation. Experiments on cost-penalized experimental design, coverage, feature selection, and a curvature sweep on Multi-News passage selection support the theory.
[LG-19] Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers
链接: https://arxiv.org/abs/2605.07892
作者: Ahmad Aloradi,Tim Roith,Emanuël A. P. Habets,Daniel Tenbrinck
类目: Machine Learning (cs.LG)
*备注: 21 pages, 15 figures
Abstract:Sparse training reduces the memory and computational costs of deep neural networks. However, sparse optimization methods, e.g., those adding an \ell_1 penalty, often control sparsity only indirectly through a regularization parameter \lambda , whose mapping to the final sparsity rate is non-trivial. In our experiments, we found this parameter sensitivity to be particularly pronounced for Bregman-based optimizers. Specifically, the two variants LinBreg and AdaBreg reach the same sparsity at \lambda values that differ by up to two orders of magnitude, requiring expensive trial-and-error sweeps to achieve a user-specified sparsity. To address this, we propose an adaptive regularization scheme that updates \lambda based on the difference between the model’s current sparsity and the target sparsity. We analyze the resulting algorithm and evaluate it on automatic speaker verification with ECAPA-TDNN and ResNet34 on VoxCeleb and CNCeleb. The proposed method reliably achieves sparsity targets ranging between 75% and 99%. It also converges faster than the oracle-tuned non-adaptive baseline during early training and matches or surpasses its final performance in equal error rate. We further show that the adaptive scheme inherits key properties from its non-adaptive counterpart, including improved out-of-distribution robustness over the dense baselines.
[LG-20] Black-box model classification under the discriminative factorization
链接: https://arxiv.org/abs/2605.07878
作者: Hayden Helm,Merrick Ohata,Carey Priebe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Access to modern generative systems is often restricted to querying an API (the ``black-box" setting) and many properties of the system are unknown to the user at inference time. While recent work has shown that low-dimensional representations of models based on the relationship between their embedded responses to a set of queries are useful for inferring model-level properties, the quality of these representations is highly sensitive to the query set. We introduce the \emphdiscriminative factorization to distinguish between high- and low-quality query sets in the context of black-box model-level classification. Under this framework, the probability of chance-level classification decays exponentially in the query budget. On three auditing tasks, estimated factorization parameters predict the empirical performance decay rate. We conclude by showing that query sets selected using the estimated discriminative field reproduce the empirical ordering of oracle query sets.
[LG-21] ADKO: Agent ic Decentralized Knowledge Optimization
链接: https://arxiv.org/abs/2605.07863
作者: Lucas Nerone Rillo,Zhanhong Jiang,Nastaran Saadati,Aditya Balu,Baskar Ganapathysubramanian,Chinmay Hegde,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注: 31 pages
Abstract:We present Agentic Decentralized Knowledge Optimization (ADKO), a framework for collaborative black-box optimization across autonomous agents that achieves sample efficiency, privacy preservation, heterogeneous-objective handling, and communication efficiency. Each agent maintains a private Gaussian Process (GP) surrogate trained on local data and communicates only through knowledge tokens-compact, lossy summaries containing directional signals, advantage scores, and optional language-model (LM) insights-without sharing raw data or model parameters. ADKO unifies GP-Upper Confidence Bound (GP-UCB), parallel Bayesian optimization, decentralized learning, and LM-guided discovery. We provide the first formal analysis of dual information loss: token compression, quantified via mutual-information-based fidelity, and LM approximation error, decomposed into bias and stochastic noise. Our main result shows cumulative regret decomposes into GP error, LM bias, LM noise, and compression loss, with necessary and sufficient conditions for sublinear regret. We also propose fidelity-aware token pruning to preserve high-information tokens under memory budget. Experiments on neural architecture search and scientific discovery validate the theory and show consistent improvements over strong baselines.
[LG-22] Actor-Critic Algorithm for Dynamic Expectile and CVaR
链接: https://arxiv.org/abs/2605.07857
作者: Yudong Luo,Erick Delage
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires transition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.
[LG-23] Distributional simplicity bias and effective convexity in Energy Based Models
链接: https://arxiv.org/abs/2605.07844
作者: Aurélien Decelle,Alfonso de Jesús Navas Gómez,Beatriz Seoane
类目: Machine Learning (cs.LG)
*备注: 13 pages, 2 figures
Abstract:Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.
[LG-24] RelAgent : LLM Agents as Data Scientists for Relational Learning
链接: https://arxiv.org/abs/2605.07840
作者: Xingyue Huang,Louis Tichelman,Jinwoo Kim,Krzysztof Olejniczak,İsmail İlkan Ceylan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Relational learning is a challenging problem that has motivated a wide range of approaches, including graph-based models (e.g., graph neural networks, graph transformers), tabular methods (e.g., tabular foundation models), and sequence-based approaches (e.g., large language models), each with its own advantages and limitations. We propose RelAgent, an LLM-based autonomous data scientist for relational learning, which operates in two phases. In the search phase, an LLM agent uses database, validation, and evaluation workspace tools to construct SQL feature programs and select a predictive model. In the inference phase, the resulting program is executed without further LLM calls. The final predictor consists of SQL queries and a classical model, enabling fast, deterministic, and intrinsically interpretable predictions: features are human-readable queries, and predictions depend only on the resulting query-defined feature map, enabling scalable deployment using standard database systems.
[LG-25] NSPOD: acceleratingthe convergence ofKrylov-based iterative linearsolvers via approximated PODs
链接: https://arxiv.org/abs/2605.07828
作者: Francesc Levrero-Florencio,Youngkyu Lee,Jay Pathak,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, 3 tables
Abstract:The convergence of Krylov-based linear iterative solvers applied to parametric partial differential equations (PDEs) is often highly sensitive to the domain, its discretization, the location/values of the applied Dirichlet/Neumann boundary conditions, body forces and material properties, among others. We have previously introduced hybridization of classical linear iterative solvers with neural operators for specific geometries, but they tend to not perform well on geometries not previously seen during training. We partially addressed this challenge by introducing the deep operator network Geo-DeepONet and hybridizing it with Krylov-based iterative linear solvers, which, despite learning effectively across arbitrary unstructured meshes without requiring retraining, led to only modest reductions in iterations compared to state-of-the-art preconditioners. In this study we introduce Neural Subspace Proper Orthogonal Decomposition (NSPOD), a multigrid-like deep operator network-based preconditioner which can dramatically reduce the number of iterations needed for convergence in Krylov-based linear iterative solvers, even when compared to state-of-the-art methods such as algebraic multigrid preconditioners. We demonstrate its efficiency via numerical experiments on a linearized version of solid mechanics PDEs applied to unstructured domains obtained from complex CAD geometries. We expect that the findings in this study lead to more efficient hybrid preconditioners that can match, or possibly even surpass, the convergence properties of the current gold standard preconditioning methods for solid mechanics PDEs.
[LG-26] Scaling Categorical Flow Maps
链接: https://arxiv.org/abs/2605.07820
作者: Oscar Davis,Anastasiia Filippova,Pierre Ablin,Victor Turrisi,Amitis Shidani,Marco Cuturi,Louis Béthune
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ( 1 B), leaving the question of its scalability completely open. In this article, we train a 1.7 B-parameter base flow model on 2.1 T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as 4 inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.
[LG-27] GRASP – Graph-Based Anomaly Detection Through Self-Supervised Classification
链接: https://arxiv.org/abs/2605.07812
作者: Robin Buchta,Carsten Kleiner,Felix Heine,Gabi Dreo Rodosek
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Advanced persistent threat (APT) attacks remain difficult to detect due to their stealth, adaptability, and use of legitimate system components. Provenance-based intrusion detection systems (PIDS) offer a promising defense by capturing detailed relationships between system components and actions. However, current PIDS rely on predefined or subset-determined thresholds, which limit detection stability and the ability to detect any anomalous behavior in general. Furthermore, related work often neglects the role of process executables, which describe system activity by interacting through a process with files, network components, and other processes. We introduce GRASP, a PIDS based on masked self-supervised classification. GRASP masks the executable information of processes and learns to infer it from their two-hop provenance graph neighborhood, marking misclassified processes as anomalies. It captures behavior patterns for the learned executables without thresholding, making it robust against interference and unknown activities. Evaluations on the DARPA TC and OpTC datasets demonstrate that GRASP consistently detects anomalous behavior, including known attack-related activities, outperforming existing systems. Our PIDS identifies all documented attacks on datasets where the behavior of executables is learnable. In addition, compared to existing systems, GRASP uncovers potentially malicious anomalous behavior not labeled as an attack in the documentation.
[LG-28] he Minimax Rate of Second-Order Calibration
链接: https://arxiv.org/abs/2605.07808
作者: Kamil Ciosek,Banafsheh Rafiee,Sina Ghiassian,Nicolò Felicioni
类目: Machine Learning (cs.LG)
*备注:
Abstract:We characterize the minimax rate of estimating the second-order calibration error for binary classification, which quantifies whether a higher-order predictor’s epistemic-uncertainty estimate matches the conditional variance of the label probability on its level sets. Our key observation is that the sech perturbation kernel, previously used only to enforce smoothness of calibration functions, in fact makes them analytic in a strip of half-width h\pi/2 . Polynomial regression then estimates the calibration error at rate \tildeO(1/\sqrtn) , with explicit constants, a qualitative improvement over the O(n^-1/4) rate achievable by bucketing or kernel smoothing. A matching \Omega(1/\sqrtn) lower bound establishes minimax optimality up to logarithmic factors. As a corollary, we give the first finite-sample guarantee for second-order Platt scaling, yielding a post-hoc procedure that recalibrates both the mean prediction and the epistemic-variance estimate of any higher-order predictor. Along the way, we provide a bucket-free definition of second-order calibration and relate it quantitatively to the bucketed formulation of Ahdritz et al. [2025]. Our experiments confirm the predicted rate and the quality of the recalibrated uncertainties.
[LG-29] Flexible Routing via Uncertainty Decomposition
链接: https://arxiv.org/abs/2605.07805
作者: Charlotte Peale,Siddartha Devic,Parikshit Gopalan,Udi Wieder,Aravind Gollakota
类目: Machine Learning (cs.LG)
*备注:
Abstract:A key strategy for balancing performance and cost in modern machine learning systems is to dynamically route queries to either a low-cost model or a more expensive oracle (such as a large pretrained model or human expert), an approach known as model routing. In this work we present a new uncertainty-aware router that (1) avoids unnecessary oracle calls on inherently ambiguous queries, and (2) adapts dynamically to different loss functions and cost parameters through simple hyperparameter changes, without retraining. Our method, applicable to any classification setting where multiple independent annotations per input are available, is based on decomposing total uncertainty into irreducible and reducible components using higher-order predictors [Ahdritz et al., 2025]. This enables a unified approach to both routing and abstention: predict with the weak model when uncertainty is low, route to the oracle when reducible uncertainty is high, and abstain when irreducible uncertainty is high. Our router comes with strong theoretical guarantees bounding regret relative to optimal task-specific routers. We conduct experiments on both synthetic and real-world datasets that demonstrate the benefits of our approach in suitable regimes – in particular, whenever reducible and irreducible uncertainty are not too correlated.
[LG-30] raining-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers
链接: https://arxiv.org/abs/2605.07772
作者: Noboru Isobe,Daisuke Inoue,Masaaki Imaizumi
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 48 pages, 6 figures, comments are wellcome!
Abstract:Transformers perform inference by iteratively transforming token representations across layers. This layerwise computation has been studied empirically, and recent mean-field theories of Transformer dynamics explain how attention can drive token distributions toward clustering. However, existing mean-field analyses largely treat model parameters as prescribed, leaving open how training reshapes this clustering picture. We study this question in a noisy mean-field Transformer in which only a parameter-linear FFN is trained under L^2 regularization. We find and analyze a training-induced phase in the dynamics: after initially following attention-driven clustering, the token distribution can leave the clustered regime near the final layers. Our mathematical analysis is based on an entropy-regularized interaction energy that captures the clustering bias of attention. More broadly, our results point toward a training-aware mean-field theory of Transformer dynamics, in which training and inference dynamics are treated together.
[LG-31] Interactive Trajectory Planning with Learning-based Distributionally Robust Model Predictive Control and Markov Systems
链接: https://arxiv.org/abs/2605.07768
作者: Erik Börve,Nikolce Murgovski,Morteza Haghir Chehreghani,Leo Laine
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We investigate interactive trajectory planning subject to uncertainty in the decisions of surrounding agents. To control the ego-agent, we aim to first learn the decision distribution and solve a Stochastic Model Predictive Control (SMPC) problem. To account for errors in the learned distribution, we show that it is possible to utilize Probably Approximately Correct (PAC) learning in combination with Distributionally Robust (DR) optimization to obtain a solution which accounts for the errors induced by the learning model. The results indicate that our PAC learning-based DR-MPC framework provides a method to interpolate between a robust MPC and an omnipotent SMPC, based on the available number of samples.
[LG-32] Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation
链接: https://arxiv.org/abs/2605.07765
作者: Elliot Pickens,Chiraag Gohel,Sidharth Satya
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we study TabPFN as a training-free, modular summary network for simulation-based Bayesian inference (SBI). Tabular foundation models such as TabPFN are pretrained on broad families of synthetic tabular data-generating processes and adapt at test time through in-context learning, making them natural candidates for SBI, where posterior estimation often depends on learning informative summaries of simulated observations. We propose PFN-NPE: a general recipe that uses a pretrained TabPFN encoder as a fixed summary network for simulator outputs, then pairs the resulting summaries with a downstream inference head chosen for the problem. With normalizing flows as the default inference head, PFN-NPE matches established posterior approximation methods and sometimes outperforms them. More importantly, diagnostic probes show that the TabPFN-derived summaries often preserve useful posterior location and marginal information. These analyses also reveal a limitation in that TabPFN-derived summaries may struggle to represent the joint posterior structure even when the marginals are well recovered. Still, our experiments show that TabPFN can serve as an effective summary network across a diverse set of SBI settings, with the inference network left modular and task-dependent.
[LG-33] SMT-Based Active Learning of Weighted Automata
链接: https://arxiv.org/abs/2605.07758
作者: Tiago Ferreira,Kevin Batz,Alexandra Silva
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: Appearing in CAV 2026
Abstract:We present an SMT-based active learning algorithm for nondeterministic weighted automata (WFAs) as a practical and robust alternative to Hankel/L*-style methods. Our algorithm is parametric in a given semiring and, if it terminates, guaranteed to produce minimal WFAs. We prove partial correctness and provide a sufficient termination condition, which in particular implies termination for all finite semirings. Our extensive experimental evaluation shows that our algorithm is capable of learning numerous minimal WFAs over both finite and infinite semirings, vastly outperforms a naive baseline, and is competitive with a state-of-the-art algorithm while producing significantly smaller automata and requiring less interaction with the teacher.
[LG-34] Efficient Verification of Neural Control Barrier Functions with Smooth Nonlinear Activations
链接: https://arxiv.org/abs/2605.07757
作者: Jun Zhang,Haibo Zhang,Chun Liu,Xiaofan Wang,Liang Xu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures
Abstract:Formal verification of neural control barrier functions (NCBFs) remains challenging, especially for neural networks with nonlinear activations like (\tanh). Existing CROWN-based methods rely on conservative linear relaxations for Jacobian bounds, limiting scalability. We propose LightCROWN, which computes tighter Jacobian bounds by exploiting the analytical properties of activation functions. Experiments on nonlinear control systems including the inverted pendulum, Dubins car, and planar quadrotor demonstrate that LightCROWN improves verification success rates up to 100%, while enhancing speed and scalability. Our approach provides a generalizable improvement for CROWN-based frameworks, enabling more efficient verification of complex NCBFs. The code can be found at this http URL.
[LG-35] Robust and Reliable AI for Predictive Quality in Semiconductor Materials Manufacturing with MLOps and Uncertainty Quantification
链接: https://arxiv.org/abs/2605.07752
作者: Min Gao,Julia Maria Perathoner,Anton Ludwig Bonin,Steven Eulig,Gianni Klesse
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Semiconductor materials manufacturing presents unique challenges for machine learning deployment due to evolving process conditions, equipment degradation, and raw material variability that can cause model performance deterioration over time. This study benchmarks machine learning operations (MLOps) retraining strategies using five years of real manufacturing data to identify optimal retraining approaches for quality prediction. We evaluate various retraining frequencies and hyperparameter optimization strategies using control limit normalized residuals as key performance metric. Results demonstrate that a fixed retraining cadence every five production batches without hyperparameter retuning achieves superior performance across all drift conditions while significantly reducing computational overhead compared to strategies incorporating hyperparameter optimization. This approach effectively maintains model accuracy during both abrupt process changes and gradual equipment degradation patterns. To address the critical need for uncertainty quantification in manufacturing decision-making, we implement conformal prediction to generate prediction confidence intervals with strong statistical guarantees. This enables proactive quality control by identifying when prediction intervals fall within acceptable control limits, transforming traditional reactive quality management into a predictive framework. The findings provide practical guidelines for implementing robust MLOps strategies in manufacturing environments where computational efficiency and reliable uncertainty quantification are paramount for operational success.
[LG-36] Bayesian Fine-tuning in Projected Subspaces
链接: https://arxiv.org/abs/2605.07706
作者: Viktar Dubovik,Patryk Marszałek,Jacek Tabor,Tomasz Kuśmierczyk
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead. While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models. Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains. Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel framework for parameter-efficient Bayesian fine-tuning, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces. The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. Our empirical findings show that, with the appropriate projection of the weight space uncertainty can be effectively modeled in a low-dimensional space, and weight covariances exhibit low ranks.
[LG-37] Future Validity is the Missing Statistic: From Impossibility to Φ-Estimation for Grammar-Faithful Speculative Decoding
链接: https://arxiv.org/abs/2605.07698
作者: Wenhua Nie,Zijie Meng,Kun Zou,Zheng Lin,Ziwei Li,Haoran Zheng,Jyh-Shing Roger Jang,Hao Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejection, and rollback soundness samples from the locally projected distribution \mu^\mathrmproj rather than the grammar-conditional distribution \mu^\star . This extends the GAD impossibility result to speculative decoding; on Dyck grammars with Qwen3-8B, the total-variation gap can reach 0.996. We identify the future-validity function \Phi_t(y)=\Pr_p[\mathrmvalid\ completion\mid y] as the missing correction statistic. The target distribution is a Doob transform of the base model with h=\Phi , while local masking corresponds to setting h to one. With exact \Phi , our oracle decoder FVO-Spec samples exactly from \mu^\star ; with approximate \Phi , we bound the resulting total-variation error. Because exact future validity is hard for general context-free grammars, we evaluate estimator hierarchies on tractable Dyck and finite JSON languages. OneStep reduces Dyck TV by 14% with under 1% throughput overhead, exact dynamic programming reduces it by 97%, and finite-language correction closes JSON gaps to numerical precision. All fidelity claims are scoped to enumerable grammars and token tries.
[LG-38] oward Better Geometric Representations for Molecule Generative Models
链接: https://arxiv.org/abs/2605.07693
作者: Shaoheng Yan,Zian Li,Cai Zhou,Qiaojing Huang,Kai Liu,Muhan Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geometric representation-conditioned molecule generation provides an effective paradigm that decouples molecule representation modeling from structure generation. By decoupling molecule generation into two stages-first generating a meaningful molecule representation, and then generating a 3D molecule conditioned on this representation-the efficiency and quality of the generation process can be significantly enhanced. However, its effectiveness is fundamentally limited by the quality of the representation space: pretrained molecular encoders, such as UniMol, produce representations that are non-smooth and not fully exploited during the generative training process. In this work, we propose LENSEs, a framework that better exploits the potential of molecule representations in representation-conditioned generation methods. In particular, LENSEs introduces three complementary mechanisms: (1) a representation head, simultaneously trained during generative tasks, that extracts multi-level representations from the pretrained encoder; (2) a molecule perceptual loss that optimizes the generator in a semantic-informative representation space; and (3) a node-level representation alignment (REPA) loss that explicitly aligns the generator’s hidden states with encoder representations, reducing the semantic gap between pretraining and generation. We demonstrate the effectiveness of these improvements through extensive molecule generation tasks. Specifically, on the challenging molecule generation dataset GEOM-DRUG, LENSEs achieves 97.28% validity and 98.51% molecule stability, surpassing existing advanced methods. Further analyses through Lipschitz constant reduction (4.6x) and QM9 probing tasks also demonstrate the smoother, more informative refined representations, establishing generative training with alignment objectives as a potential pretraining paradigm for molecular encoders.
[LG-39] Fortifying Time Series: DTW-Certified Robust Anomaly Detection
链接: https://arxiv.org/abs/2605.07690
作者: Shijie Liu,Tansu Alpcan,Christopher Leckie,Sarah Erfani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series anomaly detection is critical for ensuring safety in high-stakes applications, where robustness is a fundamental requirement rather than a mere performance metric. Addressing the vulnerability of these systems to adversarial manipulation is therefore essential. Existing defenses are largely heuristic or provide certified robustness only under \ell_p -norm constraints, which are incompatible with time-series data. In particular, \ell_p -norm fails to capture the intrinsic temporal structure in time series, causing small temporal distortions to significantly alter the \ell_p -norm measures. Instead, the similarity metric \emphDynamic Time Warping (DTW) is more suitable and widely adopted in the time-series domain, as DTW accounts for temporal alignment and remains robust to temporal variations. To date, however, there has been no certifiable robustness result in this metric that provides guarantees. In this work, we introduce the first \emphDTW-certified robust defense in time-series anomaly detection by adapting the randomized smoothing paradigm. We develop this certificate by bridging the \ell_p -norm to DTW distance through a lower-bound transformation. Extensive experiments across various datasets and models validate the effectiveness and practicality of our theoretical approach. Results demonstrate significantly improved performance, e.g., up to 18.7% in F1-score under DTW-based adversarial attacks compared to traditional certified models.
[LG-40] Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
链接: https://arxiv.org/abs/2605.07689
作者: Wenhua Nie,Jianan Wu,Junlin Liu,Ziwei Li,Zheng Lin,Zhang Zijian,Yilong Fan,Haoran Zheng,Jyh-Shing Roger Jang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every response is wrong, the centered advantage is exactly zero and the policy receives no learning signal. We prove that the true degeneracy rate always exceeds the i.i.d. Bernoulli prediction by Jensen’s inequality, and observe a 0.69 degeneracy rate at group size four in logged Qwen3.5-9B GSM8K training. We then show that the fixed-reference Sign advantage, A=2r-1 , performs pass@ G failure descent by increasing the probability that at least one sample in the group succeeds. On the full GSM8K test set across seven seeds, Sign reaches 73.8% accuracy versus 28.4% for standard normalized group-mean DrGRPO at group size four, a 45.4 point gain with p0.0001 . The effect is directionally consistent on Llama-3.1-8B and positive but underpowered on a MATH-500 transfer check. Pass@ k analysis indicates that the main benefit is search compression rather than large capacity expansion, aligning the empirical gains with recent RLVR ceiling observations.
[LG-41] he Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
链接: https://arxiv.org/abs/2605.07686
作者: Wenhua Nie,Junlin Liu,Jianan Wu,Zijie Meng,Yilong Fan,Zhang Zijian,Haoran Zheng,Jyh-Shing Roger Jang
类目: Machine Learning (cs.LG)
*备注: 40 pages, 6 figures
Abstract:Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, \mathrmAcc_\mathrmthink(b)=\alpha_c F_L(b)+\alpha_t(1-F_L(b)) , that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
[LG-42] Structured Coupling for Flow Matching
链接: https://arxiv.org/abs/2605.07676
作者: Xavier Sumba,Carles Balsells-Rodas,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Standard flow matching scales well but typically relies on an unstructured source distribution, limiting its ability to learn interpretable latent structure. Latent-variable models, by contrast, capture structure but often sacrifice generative quality. We bridge this gap by proposing Structured Coupling for Flow Matching (SCFM), a cooperative framework that augments flow matching with structured latent representation learning. By introducing structured latent variables and exogenous noise into the source, SCFM jointly learns a structured prior (via latent variable modeling) and a continuous transport map (via flow matching). It uses a shared time-dependent recognition network for both latent variable model variational inference and intermediate-time flow velocity estimation. This yields a structurally informed yet unconditional, simulation-free flow model, where the latent variable model can also assist flow sampling. Empirically, SCFM facilitates unsupervised latent representation learning for clustering, disentanglement and downstream tasks, while remaining competitive with flow matching in sample quality, showing that meaningful structure can be learned without sacrificing generative fidelity.
[LG-43] Differentially Private Auditing Under Strategic Response
链接: https://arxiv.org/abs/2605.07674
作者: Florian A. D. Burnat
类目: Computer Science and Game Theory (cs.GT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Regulatory audits of AI systems increasingly rely on differential privacy (DP) to protect training data and model internals. We study audit design when the audited developer can strategically respond to the privacy-constrained audit interface. We formalize privacy-constrained auditing as a bilevel Stackelberg game, in which an auditor commits to a query policy and DP budget allocation across harm dimensions, and a strategic developer reallocates mitigation efforts in response. We introduce the welfare-weighted under-detection gap B_w , the welfare-weighted true residual harm the audit fails to detect at the developer’s strategic best response, and prove that naive DP auditing (uniform or harm-proportional allocation) induces a strictly larger B_w than any non-strategic mitigation baseline whenever effective detectability is heterogeneous, the welfare weights are not comonotone with detectability, and the developer’s optimum is interior. We characterize the optimal auditor allocation as a four-factor balance of welfare weight, audit miss-probability, detectability elasticity, and mitigation-cost curvature, and provide a single-level reformulation of the bilevel problem via the developer’s KKT system. We propose Strategic Private Audit Design (SPAD), a projected-gradient algorithm with hypergradients computed through the developer’s best response.
[LG-44] Quotient Semivalues for False-Name-Resistant Data Attribution
链接: https://arxiv.org/abs/2605.07663
作者: Florian A. D. Burnat,Brittany I. Davidson
类目: Computer Science and Game Theory (cs.GT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Data valuation methods allocate payments and audit training data’s contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from 1.74 under baseline Shapley to 0.96 , near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness–Sybil frontier.
[LG-45] Direction-Preserving Number Representations
链接: https://arxiv.org/abs/2605.07662
作者: Bardia Zadeh,George A. Constantinides
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 9 pages excluding appendices and references, 18 in total. 5 figures
Abstract:Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two’s complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA’s NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA’s choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained. Comments: 9 pages excluding appendices and references, 18 in total. 5 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) ACMclasses: G.1.0 Cite as: arXiv:2605.07662 [cs.LG] (or arXiv:2605.07662v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07662 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bardia Zadeh [view email] [v1] Fri, 8 May 2026 12:32:33 UTC (799 KB)
[LG-46] Learning Large-Scale Modular Addition with an Auxiliary Modulus
链接: https://arxiv.org/abs/2605.07648
作者: Hanato Kikuchi,Ryosuke Masuya,Kazuhiko Kawamoto,Hiroshi Kera
类目: Machine Learning (cs.LG)
*备注: 10+11 pages, 5 figures
Abstract:Learning parity functions, more general modular addition, is a challenging machine learning task due to its input sensitivity. A recent study substantially scaled modular addition learning in both the number of summands and the modulus. Its key idea is to increase zeros in training sequences, reducing the effective number of summands and thus controlling training difficulty; however, this induces covariate shift between training and test input distributions. This study theoretically and empirically analyzes this side effect and proposes a covariate-shift-free method for modular addition. Specifically, we introduce an auxiliary modulus Kq during training, which reduces wrap-around frequency and problem difficulty while preserving the same input distribution across training and testing. Experiments show strong scalability and sample efficiency: even for large input length N , large modulus q , and small datasets – where the sparse method fails to learn – our method achieves equal or better match accuracy and relaxed \tau -accuracy. For example, at N=64 and q=974269 , our method trained on 100K samples achieves 97.0% \tau -accuracy at \tau=0.05 , while the sparse method achieves only 9.5% with the same data size and 93.9% even when extended to 1M samples.
[LG-47] Optimal Recourse Summaries via Bi-Objective Decision Tree Learning
链接: https://arxiv.org/abs/2605.07598
作者: Ioannis Chatzis,Jason Liartis,Athanasios Voulodimos,Giorgos Stamou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Actionable Recourse provides individuals with actions they can take to change an unfavorable classifier outcome. While useful at the instance level, it is ill-suited for global auditing and bias detection, since aggregating local actions is costly and often inconsistent. Recourse Summaries address this limitation by partitioning the population and assigning one shared action per subgroup, enabling comparison across subgroups. Designing summaries involves a fundamental trade-off between recourse effectiveness and recourse cost, which existing methods do not adequately address. We introduce Summaries of Optimal and Global Actionable Recourse (SOGAR), which formulates recourse summary learning as an optimal decision tree learning problem and finds the Pareto front – the complete set of solutions where improving one objective necessarily worsens the other. SOGAR enables post-hoc selection of the desired trade-off without retraining. Using shallow axis-parallel decision trees and sparse leaf actions, SOGAR produces stable, low-cost, and effective recourse summaries that outperform existing approaches across effectiveness and cost metrics.
[LG-48] Bilevel Graph Structure Learning Revisited: Inner-Channel Origins of the Reported Gain
链接: https://arxiv.org/abs/2605.07577
作者: Minkyoung Kim,Beakcheol Jang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen- \phi , a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of T -step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen- \phi as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.
[LG-49] Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning ICML2026
链接: https://arxiv.org/abs/2605.07557
作者: Yaxin Hou,Jun Ma,Hanyang Li,Bo Han,Jie Yu,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注: The paper is accepted by ICML 2026
Abstract:Semi-supervised learning faces significant challenges in realistic scenarios where labeled data is scarce and unlabeled data follows unknown, arbitrary distributions. We formalize this critical yet under-explored paradigm as Universal Semi-supervised Learning (UniSSL). Existing methods typically leverage unlabeled data via pseudo-labeling. However, they often rely on the idealized assumption of a uniform unlabeled data distribution or require sufficient labeled data to estimate it. In the UniSSL setting, such dependencies lead to numerous erroneous pseudo-labels, thereby triggering representation confusion. Fortunately, we observe that inter-sample relations captured by representations are more reliable than pseudo-labels. Leveraging this insight, we shift our focus to representation-level structural inference to bypass distribution estimation. Accordingly, we propose Simplex Anchored Graph-state Equipartition (SAGE), which captures high-order inter-sample dependencies to establish structural consensus for guiding representation learning. Meanwhile, to mitigate representation confusion, we employ vectors that satisfy a simplex equiangular tight frame to serve as a coordinate frame for guiding inter-class representation separation. Finally, we introduce a weighting strategy based on distribution-agnostic metrics to prioritize reliable pseudo-labels and an auxiliary branch to isolate potentially erroneous pseudo-labels. Evaluations on five standard benchmarks show that SAGE consistently outperforms state-of-the-art methods, with an average accuracy gain of \textbf8.52%.
[LG-50] Disagreement-Regularized Importance Sampling for Adversarial Label Corruption
链接: https://arxiv.org/abs/2605.07551
作者: Csongor Horváth,Ida-Maria Sintorn,Prashant Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Standard Importance Sampling (IS) collapses under label corruption because high-norm examples, prioritized for variance reduction, are often adversarial outliers. We formalize this misalignment using an \varepsilon -contamination model and propose Disagreement-Regularized Importance Sampling (DR-IS), a sub-sampling method based on loss rank-disagreement across independent proxy ensemble. We prove finite-sample concentration bounds showing that the empirical rank disagreement of bulk corrupted examples is bounded above, and that of boundary-clean examples bounded below, both at rate O(\sqrt\log(N/\delta)/K) with probability 1-\delta ; when the structural expectation gap \Delta’ between the two groups is positive and the boundary-clean set is at least as large as the selected subset, these bounds certify strict separation and control the contamination rate of the selected subset. Empirically, DR-IS remains robust under targeted high-norm attacks that break magnitude-based methods such as the Error L_2 -norm (EL2N) on benchmark datasets. DR-IS complements training-dynamics approaches like Area Under the Margin ranking (AUM), offering improved robustness in the loss-aligned regime alongside explicit finite-sample concentration certificates and a contamination bound limiting noise leakage from the statistical tail of corrupted points.
[LG-51] On the Invariance and Generality of Neural Scaling Laws
链接: https://arxiv.org/abs/2605.07546
作者: Xing Han,Ziyin Liu,Suchi Saria,Paul Pu Liang
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures, 11 tables
Abstract:Neural scaling laws establish a predictable relationship between model performance and data or compute, offering crucial guidance for resource allocation in new domains and tasks. Yet such laws are most needed precisely where they are hardest to obtain: fitting one for a new model task pair demands expensive sweeps that typically exhaust the very compute budget the law is meant to economize. This paper poses the research question of how to develop generalizable scaling laws: laws fit once on a well-resourced source domain and reliably transported to new domains where running a full sweep is infeasible, which requires a fundamental understanding of when and why scaling properties change. We address this by identifying the right invariants: scaling laws are preserved under bijective (information-preserving) transformations of the data and modified in predictable, information-theoretically grounded ways under non-bijective transformations that lower its information resolution \rho : a single axis along which a law fit in one domain can be transported to another. We validate this across language, vision, and speech, and demonstrate two cross-domain applications: predicting scaling for language models trained on electronic health records from laws fit on general text, and predicting time-series classification scaling under varying levels of noise injection, recovering the data-scaling exponents to within 3% error.
[LG-52] GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training
链接: https://arxiv.org/abs/2605.07536
作者: Henghui Xu,Yuchen Zhang,Xiaobo Ma
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Detecting stealthy malicious communications from flow logs under benign-only training remains a critical challenge in network security. Malicious communications often camouflage as normal traffic like standard HTTPS flows. Conventional intrusion detectors rely strictly on known labeled attacks. Alternatively, they score flows completely independently. These approaches fail against sparse and context-dependent suspicious activity. To capture this essential context, graph anomaly detectors have been introduced to add valuable relational information to the analysis. However, existing methods fail to test the structural consistency of specific communication edges. To overcome these fundamental limitations, we present GESR, a novel graph-based framework for detecting suspicious communications and anomalous hosts under a benign-only training setting. GESR models complex network activity as attributed communication graphs. It cleverly reconstructs edge semantics entirely from local structural context rather than isolated features. This non-intuitive design forces the framework to predict expected communication patterns from neighborhood topologies. Attackers cannot easily manipulate this deep structural dependency. The model then converts the resulting structural inconsistencies into host-level anomaly scores. It utilizes robust Median Absolute Deviation (MAD) calibration for this final step. We evaluate GESR extensively on CTU-13 and CICIDS2017 datasets. These evaluations strictly impose tight false-positive operating constraints. On CICIDS2017, GESR achieves an outstanding ROC-AUC of 0.9753. It also yields a high TPR of 0.8569 at a strict 5% FPR threshold. GESR consistently outperforms existing methods across both evaluated benchmarks. The results prove that structure-conditioned edge reconstruction is a credible direction for practical intrusion detection.
[LG-53] SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching
链接: https://arxiv.org/abs/2605.07531
作者: Hippolyte Labarrière,Cesare Molinari,Silvia Villa,Lorenzo Rosasco
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.
[LG-54] ssellations of Semi-Discrete Flow Matching
链接: https://arxiv.org/abs/2605.07513
作者: Emile Pierret,Johannes Hertrich,Samuel Hurault,Julie Delon
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study Flow Matching in a semi-discrete setting where a Gaussian source is transported toward a discrete target supported on finitely many points. This semi-discrete regime is the theoretical setting behind the use of Flow Matching for generative modeling, where the target distribution is represented by a finite dataset. In this semi-discrete regime, the exact Flow Matching velocity field is available in closed form, which makes it possible to analyze the geometry induced by the terminal flow map independently of optimization and approximation effects. We investigate the terminal assignment regions, namely the preimages of the target atoms under the terminal flow. We show that these regions are open, simply connected and, under an additional assumption, homeomorphic to the unit ball. At the same time, a planar four-point example shows that these cells can differ sharply from Laguerre cells arising in semi-discrete optimal transport: they may be non-convex, have curved boundaries, and exhibit different boundedness and adjacency patterns. These results clarify the geometry intrinsically induced by the exact semi-discrete Flow Matching objective before neural approximation enters the picture.
[LG-55] NPMixer: Hierarchical Neighboring Patch Mixing for Time Series Forecasting
链接: https://arxiv.org/abs/2605.07476
作者: Jung Min Choi,Vijaya Krishna Yalavarthi,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate time series forecasting remains a challenge due to the complexity of local temporal dynamics and global dependencies across multiple variables. In this paper, we propose \textbfNeighboring \textbfPatching \textbfMixer (\textbfNPMixer), a hierarchical architecture featuring a Learnable Stationary Wavelet Transform that adaptively learns filter coefficients to decompose signals into trend and detail components in a data-dependent manner. Our framework introduces a Neighboring Mixer Block that captures local temporal dynamics through a series of hierarchical MLP layers operating on non-overlapping patches. Specifically, the mixer block utilizes MLPs to learn temporal patterns within and across these patches, expanding the receptive field to capture multi-scale dependencies. A Channel-Mixing Encoder is applied to high-frequency components to learn channel correlations while preserving the stability of the underlying global trend. Extensive experiments on seven benchmark datasets demonstrate that NPMixer consistently outperforms state-of-the-art models, achieving better performance in 20 out of 28 ( 71.4% ) evaluated experimental setups for MSE. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07476 [cs.LG] (or arXiv:2605.07476v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07476 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jung Min Choi [view email] [v1] Fri, 8 May 2026 09:22:51 UTC (59 KB) Full-text links: Access Paper: View a PDF of the paper titled NPMixer: Hierarchical Neighboring Patch Mixing for Time Series Forecasting, by Jung Min Choi and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-56] ransfer Learning Across Fast- and Full-Simulation Domains in High-Energy Physics
链接: https://arxiv.org/abs/2605.07471
作者: Matthias Schott,Lucie Flek
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 16 pages, 8 figures
Abstract:Machine-learning models in high-energy physics are often trained on simulated data, where fully simulated samples are computationally expensive while fast simulation provides large statistics at reduced realism. In this work, we systematically study transfer learning between fast-simulated and fully simulated datasets in a realistic LHC environment. We consider three representative tasks, signal-background classification, quark-gluon jet tagging, and missing transverse energy reconstruction, using dense neural networks, graph neural networks, and transformer-based architectures. Models are pretrained on ATLAS-like fast simulation and adapted to CMS-like fast simulation and to fully simulated ATLAS Open Data. Across all tasks, pretrained models consistently outperform independently trained baselines and require significantly less target-domain training data, typically reducing the needed statistics by about a factor of two. These results demonstrate that fast simulation can be used to learn robust, reusable representations and motivate publishing trained models as reusable scientific assets beyond large foundation models.
[LG-57] Uncovering Hidden Systematics in Neural Network Models for High Energy Physics
链接: https://arxiv.org/abs/2605.07470
作者: Lucie Flek,Philipp Alexander Jungs,Akbar Karimi,Timo Saala,Alexander Schmid,Matthias Schott,Philipp Soldin,Christopher Wiebusch,Ulrich Willemsen
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 18 pages, 9 figures
Abstract:Neural networks (NNs) are inherently multidimensional classifiers that learn complex, non-linear relationships among input observables. While their flexibility enables unprecedented performance in high-energy physics (HEP) analyses, it also makes them sensitive to small variations in their inputs. Consequently, the propagation and estimation of systematic uncertainties in NN-based models remain an open challenge. There are indications that uncertainties derived in control regions or from nominal variations of input features can underestimate the true model uncertainty, potentially leaving biases unaccounted for. Inspired by insights from adversarial-attack studies in machine learning, we explore how subtle perturbations, fully consistent with the experimental uncertainties on the input observables, can lead to substantial changes in NN outputs, while keeping the one-dimensional and correlated input distributions nearly unchanged. Using a set of representative HEP tasks, including event classification and object identification, and testing across a variety of network architectures, we demonstrate that networks can be systematically “fooled” at significant rates within the allowed uncertainty envelopes. Building on this observation, we introduce a quantitative framework to probe and measure the hidden sensitivity of neural networks to realistic experimental variations, providing a practical path to evaluate and control their systematic uncertainty in physics analyses.
[LG-58] Approximation Error Upper and Lower Bounds for Hölder Class with Transformers ICML2026
链接: https://arxiv.org/abs/2605.07463
作者: Xin He,Yuling Jiao,Xiliang Lu,Jerry Zhijian Yang
类目: Machine Learning (cs.LG)
*备注: 31 pages, 2 figures. Accepted by ICML2026
Abstract:We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most \mathcalO(\varepsilon^-d_0/\alpha) blocks can approximate any bounded Hölder function with d_0 -dimensional input and smoothness \alpha\in(0,1] under any accuracy \varepsilon0 . In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least \Omega(\varepsilon^-d_0/(4\alpha)) blocks to achieve the \varepsilon approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers’ empirical effectiveness in real-world settings.
[LG-59] Learning Minimal-Deviation Corrections for Multi-Dimensional Mismodelling in HEP Simulations
链接: https://arxiv.org/abs/2605.07460
作者: Matthias Schott,Lucie Flek
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 12 pages, 6 figures
Abstract:Accurate Monte Carlo (MC) modelling in high-energy physics is challenging, particularly in complex scenarios where simulations fail to reproduce observed data. In practice, experimental information is often limited to one-dimensional (1D) distributions, while mismodelling arises in a multidimensional feature space. This restricts traditional correction methods, as one-dimensional reweighting ignores correlations and fully multidimensional approaches require large target datasets. We propose a neural network-based method that operates under these constraints by learning a transformation of simulated events that reproduces the available 1D target distributions while remaining close to the original simulation. This minimal-deviation principle preserves the global correlation structure of the baseline model while enabling targeted corrections of mismodelled features. Using controlled studies with simulated pseudo-data, we show that the method improves agreement with target distributions and maintains a consistent multidimensional structure. The approach is designed for complex, high-dimensional analyses where traditional techniques are insufficient, providing a scalable way to enhance MC modelling under limited information.
[LG-60] Estimation of Motor Unit Parameters from Surface Electromyograms using an Informed Autoencoder
链接: https://arxiv.org/abs/2605.07458
作者: Kaja Balzereit,Malte Mechtenberg,Axel Schneider
类目: Machine Learning (cs.LG)
*备注:
Abstract:Motor unit parameters such as the innervation zone centre or the conduction velocity of the electrical potential harbour the potential to improve the fidelity of neuromechanical models used for movement and force prediction. Determining these parameters in a non-invasive way is challenging, as they are subject-specific and may vary with muscle contraction. Existing work on the estimation of motor unit parameters mainly relies on white-box modelling and therefore requires substantial manual modelling effort. This work targets the simultaneous estimation of multiple subject-specific motor unit parameters from electromyography (EMG) recordings measured non-invasively at the skin surface. This results in an inverse problem with a nonlinear loss function. To address this problem, an informed autoencoder is developed. This autoencoder reconstructs the surface EMG recordings while learning the parameters in its latent space and adhering to physical laws that relate the parameters to the EMG signals. In experiments on synthetic data, innervation zone centres are estimated with a mean absolute error of 2.5989 \mathrmmm , and conduction velocities of the electric potential are estimated with a mean absolute error of 0.1697 \mathrmm\mathrms^-1 . These results demonstrate the plausibility of this novel approach, which enables the simultaneous estimation of several motor unit parameters while reducing manual modelling effort through the integration of data-driven machine learning.
[LG-61] Inference-Time Attribute Distribution Alignment for Unconditional Diffusion
链接: https://arxiv.org/abs/2605.07456
作者: Hao Luan,See-Kiong Ng,Chun Kai Ling
类目: Machine Learning (cs.LG)
*备注: Preprint. 35 pages, 13 figures
Abstract:Inference-time controllable generation is essential for real-world applications of unconditional diffusion models. However, most existing techniques focus on individual samples, struggling in applications that require the sample population to follow specific attribute distributions (e.g., demographic balance or semantic proportions). We formalize this setting as the inference-time attribute distributional alignment problem for pretrained unconditional diffusion models. To address this, we cast inference-time attribute distributional alignment as an optimal control problem over the reverse diffusion process, viewing the process as the rollout of a dynamical system and augmenting it with additive, time-dependent perturbations as control. We solve for the perturbations using an optimal-control-based algorithm to optimize a differentiable distribution-matching objective while penalizing control effort to preserve data fidelity. Experiment results in image generation demonstrate that our proposed plug-and-play approach can better align attribute distributions to diverse and flexible test-time targets compared to baselines, without retraining or finetuning the pretrained diffusion model.
[LG-62] VNN-LIB 2.0: Rigorous Foundations for Neural Network Verification
链接: https://arxiv.org/abs/2605.07451
作者: Ann Roy,Allen Antony,Andrea Gimelli,Matthew L. Daggitt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural network verification is an active and rapidly maturing research area, with a growing ecosystem of solvers and tools. The VNN-LIB standard was introduced to support interoperability in this ecosystem, but Version~1.0 has several serious short-comings as a formal foundation: it lacks a precise syntax, semantics, and type system, offers limited expressivity, and relies on externally defined ONNX models whose semantics are informal and constantly evolving. The latter distinguishes VNN-LIB from established standards such as SMT-LIB, where queries are self-contained and have fixed semantics. In this paper we address these challenges by developing the theoretical foundations of VNN-LIB~2.0. Our key contribution is the introduction of the notion of a \emphnetwork theory, which abstractly characterises the minimal semantic interface required from a neural network model format. This abstraction enables VNN-LIB to be defined independently of any specific ONNX version while remaining compatible with evolving model representations. Building on this foundation, we present a formal syntax for a more expressive query language, a type system for it over the numeric domains provided by the network theory, and finally a formal semantics. To ensure internal consistency, the standard is mechanised in the Agda theorem prover. VNN-LIB~2.0 therefore provides robust and rigorous foundations for trustworthy neural network verification. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07451 [cs.LG] (or arXiv:2605.07451v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] GameGen-Verifier: Parallel Keypoint-Based Verification for LLM -Generated Games via Runtime State Injection
链接: https://arxiv.org/abs/2605.07442
作者: Chaobo Jia,Ruipeng Wan,Ting Sun,Weihao Tan,Borui Wan,Yuxuan Tong,Guangming Sheng,Hong Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent’s gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07442 [cs.LG] (or arXiv:2605.07442v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] A Flexible Adaptive Stable Clustering Algorithm for Archive-Scale Online Mass Spectrometry
链接: https://arxiv.org/abs/2605.07424
作者: Shao Shi,Xin Yang,Huiran Feng,Jianhuai Ye,Tianlong Hu,Yaling Zeng,Tzung-May Fu,Lei Zhu,Huizhong Shen,Chen Wang,Shu Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern online mass spectrometry generates multi-terabyte data streams critical for understanding Earth’s environmental systems. However, extracting actionable chemical insights from these repositories is impeded by a computational bottleneck: existing clustering methods force a compromise among scalability, metric flexibility, and algorithmic stability. Here, we introduce Flexible Adaptive Stable Clustering (FASC), a dynamical systems framework that resolves these constraints by architecturally decoupling the similarity kernel from rigorous optimization logic. Unlike legacy heuristics that suffer from stochastic drift and algorithmic blending, FASC employs a Density-Augmented Similarity Selection rule and geometric constraints to guarantee deterministic, order-independent convergence. After validating FASC on canonical machine-learning ground truths (achieving 99.5% cluster purity and 0.99 Adjusted Rand Index), we deployed the framework on 25 million mass spectra of atmospheric aerosols. Demonstrating strictly linear empirical runtime scaling (O(N)), FASC autonomously mapped atmospheric aging pathways of secondary inorganic aerosols while isolating ultra-rare industrial tracers (0.2% abundance), providing a scalable infrastructure for mining environmental big data.
[LG-65] Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
链接: https://arxiv.org/abs/2605.07417
作者: Mohammad Hasan Ahmadilivani,Marten Roots,Marco Restifo,Sven-Markus Loorits,Luca Di Mauro,Jaan Raik
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures, 3 tables. The paper is accepted at IEEE IOLTS’26
Abstract:Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC’s impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.
[LG-66] Risk-Consistent Multiclass Learning from Random Label-Subset Membership Queries
链接: https://arxiv.org/abs/2605.07413
作者: Jiaxu Su,Junpeng Li,Changchun Hua,Yana Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Obtaining accurate class labels is often costly or unreliable, and may also be limited by privacy or other practical conditions. Compared with asking an annotator to provide the exact class, it is often easier to ask whether the true label belongs to a certain label subset. This query-response form defines a distinct weak-supervision mechanism: weak supervision information is generated through feedback on a label subset. Although weakly supervised learning has studied many learning frameworks, most existing work starts from established weak label objects. A systematic characterization is still lacking for weakly supervised learning generated directly by such query response observations. This paper proposes a multiclass learn ing framework under random label-subset queries. We model the data-generating distribution of query-response observations and derive an unbiased estimator of the target risk under the empirical risk minimization (ERM) framework. To address negative empirical risk and the associated overfitting problem, we introduce corrected risk estimators based on non-negative and absolute-value corrections. Theoretical analysis establishes a conditional generalization and excess-risk bound for the unbiased estimator, and a bias-and-consistency result for the corrected risk estimator. Experiments under the matched random-query mechanism demonstrate the feasibility of direct query-response learning and the stabilization effect of risk correction.
[LG-67] Emergent Symbolic Structure in Health Foundation Models: Extraction Alignment and Cross-Modal Transfer ICML
链接: https://arxiv.org/abs/2605.07407
作者: Gajendra Katuwal,Advait Koparkar,Salar Abbaspourazad,Anshuman Mishra,Sarvesh Kirthivasan
类目: Machine Learning (cs.LG)
*备注: 8 pages ICML workshop, 4 main figures
Abstract:Health foundation models (FMs) learn useful representations from wearable sensors, but interpreting what they encode and transferring that knowledge across modalities after training remains difficult. We present a post-training framework that decomposes frozen embeddings into interpretable directions, referred to as symbols, and use these symbols to align the embedding spaces without retraining. We evaluate the framework on three FMs for photoplethysmography (PPG) and accelerometer data, independently pretrained on ~20M minutes of unlabeled data from ~172K participants, and analyzed on a held-out cohort of 30K subjects. We find that extracted symbols associate selectively with health conditions and physiological attributes, and these associations are partially shared across modalities and architectures. Cross-modal transfer via symbols retains more than 95% of in-domain performance, is nearly symmetric across domain directions, and saturates with limited paired data, together indicating that alignment recovers a shared low-dimensional subspace rich in physiological information. Overall, these results suggest that health FM embeddings contain an interpretable symbolic organization that is shared across modalities and supports cross-domain transfer without joint training.
[LG-68] Have Graph – Will Lift? The Case for Higher-Order Benchmarks
链接: https://arxiv.org/abs/2605.07397
作者: Bastian Rieck
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:
Abstract:After a somewhat rocky start, geometry and topology have established a foothold in machine learning. Message passing, either on graphs or higher-order complexes, is one of the main drivers of geometric deep learning, and paradigms that were once considered to be firmly in the realm of the abstract-like sheaves-have been “tamed” to serve as novel inductive biases for model architectures in topological deep learning. The veritable diversity of models, however, is in stark contrast to the scarcity of suitable benchmark datasets. As a result, researchers often resort to lifting existing graph datasets to include higher-order information. In this opinion paper, I want to encourage the community to also source new datasets, which may be used to prop up the foundations of our research field.
[LG-69] Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry
链接: https://arxiv.org/abs/2605.07389
作者: A. Azamnouri,M. Haug,L. Woltmann,M. Fritz,J. Bogner,S. Wagner
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:The integration of machine learning (ML) into complex software systems has increased challenges in collaboration and communication (CoCo) of the teams building these systems. ML engineering (MLE) teams often involve diverse roles, ML engineers, data scientists, software engineers, and domain experts, each bringing unique goals, experiences, and jargon. These interdisciplinary dynamics can make it challenging to deploy, reproduce, and maintain ML-enabled systems over the long term. Previous studies have uncovered several CoCo challenges and practices, but most have focused on software-centric companies, leaving limited empirical understanding of how these dynamics unfold in hardware-centric contexts. In hardware-centric environments, CoCo challenges are shaped by additional constraints such as strict data governance, long development cycles, and tight coupling with physical processes, which amplify coordination complexity and reduce flexibility. To strengthen empirical understanding in such settings, we present a qualitative investigation of MLE teams within a global semiconductor company, where ML-enabled systems and manufacturing processes introduce additional complexity. We interviewed 12 practitioners regarding CoCo practices, tools, challenges, and approaches. Through analysis, we identified 16 recurring challenges, with unclear roles and responsibilities emerging as the most critical, and common practices and recommendations practitioners considered effective in mitigating CoCo problems. While grounded in a single organizational context, our findings align with known issues in interdisciplinary ML-enabled systems development, but also demonstrate how these challenges manifest differently under hardware-driven constraints. Our results highlight directions for future research and tool support to strengthen CoCo in MLE projects and ensure the success of ML-enabled systems.
[LG-70] Convex Optimization with Nested Evolving Feasible Sets
链接: https://arxiv.org/abs/2605.07386
作者: Karthick Krishna M.,Haricharan Balasundaram,Rahul Vaze
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:
Abstract:Convex Optimization with Nested Evolving Feasible Sets (CONES) is considered where the objective function f remains fixed but the feasible region evolves over time as a nested sequence S_1 \supseteq S_2 \supseteq \cdots \supseteq S_T . The goal of an online algorithm is to simultaneously minimize the regret with respect to hindsight static optimal benchmark and the total movement cost while ensuring feasibility at all times. CONES is an optimization-oriented generalization of the well-known nested convex body chasing problem. When the loss function is convex, we propose a lazy-algorithm and show that it achieves O(T^1-\beta), O(T^\beta) simultaneous regret and movement cost for any \beta \in (0,1] , over a time horizon of T . When the loss function is strongly convex or \alpha -sharp, we propose an algorithm Frugal that simultaneously achieves zero regret and a movement cost of O(\log T) . To complement this, we show that any online algorithm with o(T) regret has a movement cost of \Omega(\logT) for both cases, proving optimality of Frugal.
[LG-71] StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
链接: https://arxiv.org/abs/2605.07384
作者: Panqi Chen,Yifan Sun,Shikai Fang,Xiao Fu,Lei Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48% improvement in accuracy and up to 20–100X faster inference than diffusion-based methods.
[LG-72] Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
链接: https://arxiv.org/abs/2605.07378
作者: Yameng Peng,Andy Song,HaythamM. Fayek,Vic Ciesielski,Xiaojun Chang
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: text overlap with arXiv:2403.04161
Abstract:Zero-shot proxies, also known as training-free metrics, are widely adopted to reduce the computational overhead in neural network evaluation for scenarios such as Neural Architecture Search (NAS), as they do not require any training. Existing zero-shot metrics have several limitations, including weak correlation with the true performance and poor generalisation across different networks or downstream tasks. For example, most of these metrics apply only to either convolutional neural networks (CNNs) or Transformers, but not both. To address these limitations, we propose Sample-Wise Activation Patterns (SWAP), and its derivative, SWAP-Score, a novel and highly effective zero-shot metric. SWAP-Score is broadly applicable across both architecture families and task domains, demonstrating strong predictive performance in the majority of tasks. This metric measures the expressivity of neural networks over a mini-batch of samples, showing a high correlation with the neural networks’ ground-truth performance. For both CNNs and Transformers, the SWAP-Score outperforms existing zero-shot metrics across computer vision and natural language processing tasks. For instance, Spearman’s correlation coefficient between the SWAP-Score and CIFAR-10 validation accuracy for DARTS CNNs is 0.93, and 0.71 for FlexiBERT Transformers on GLUE tasks. Moreover, SWAP-Score is label-independent, hence can be applied at the pre-training stage of language models to estimate their performance for downstream tasks. When applied to NAS, SWAP-empowered NAS, SWAP-NAS can achieve competitive performance using only approximately 6 and 9 minutes of GPU time, on CIFAR-10 and ImageNet respectively. Our code is available at: this https URL
[LG-73] QuadNorm: Resolution-Robust Normalization for Neural Operators
链接: https://arxiv.org/abs/2605.07375
作者: Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注: 42 pages, 8 figures
Abstract:Normalization layers in neural operators usually compute statistics by uniformly averaging discrete grid values, making the normalization itself discretization-dependent and thereby a source of transfer error across different resolutions or meshes. To enable discretization robustness, we introduce a quadrature normalization family that replaces existing uniform averaging in normalization layers with numerical quadrature: QuadNorm and BlendQuadNorm. On endpoint-inclusive uniform grids, the proposed quadrature moments are O(h^2) -consistent across discretizations, meaning that their cross-resolution mismatch decays quadratically with grid spacing. A transfer-error bound then predicts how normalization-induced mismatch scales with both the resolution gap and network depth. The experiments show the same gap- and depth-scaling trends predicted by the transfer-error bound. On Darcy, QuadNorm delivers the best cross-resolution performance at every tested target resolution from 64^2 to 256^2 ; on real-data benchmarks, Transolver with QuadNorm achieves nearly resolution-invariant transfer. The largest gains appear on nonperiodic PDEs and nonspectral architectures, where native-resolution improvements also emerge. We also validate BlendQuadNorm, which stays close to LayerNorm behavior and serves as a conservative default for periodic FNO settings. These results identify normalization as a previously overlooked source of resolution dependence in neural operators.
[LG-74] FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agent ic Conversational AI
链接: https://arxiv.org/abs/2605.07364
作者: Aditi J. Shelke,Renuka J. Shelke,Yash M. Kamerkar
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 9 tables; machine learning, MLOps, aviation delay prediction
Abstract:Flight delays impose cascading operational and financial burdens across the aviation network, costing the U.S. economy billions of dollars annually by disrupting interconnected aircraft rotation systems. While prior machine learning approaches have demonstrated strong predictive performance, most treat upstream delays as static input variables rather than explicitly modeling how delays propagate dynamically through aircraft rotation chains, and none have deployed such systems alongside a live weather-aware conversational AI interface for end-user interaction. This paper presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction built through a progressive three-version feature engineering framework. Version 1 trains an XGBoost classifier on 11 schedule-based features establishing a baseline ROC AUC of 0.732 on 7.07 million BTS 2018 On-Time Performance records. Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, yielding the dominant performance gain (AUC 0.732 to 0.875) and surpassing the single-stage XGBoost baseline reported by Zhou (2025). Version 3 integrates five NOAA meteorological features across 10 major U.S. airports, achieving a final test set AUC of 0.879. FlightSense is deployed as a production AWS MLOps pipeline incorporating live weather ingestion via Lambda, real-time SageMaker inference, an interactive Streamlit dashboard, and an Amazon Bedrock Nova Micro conversational assistant answering natural-language delay queries via a tool-use architecture.
[LG-75] CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
链接: https://arxiv.org/abs/2605.07335
作者: Mengran Li,Bo Li,Jiaying Wang,Wenbin Xing,Yixuan Dong,Chengyang Zhang,Hongliang Zhang,Yuzhong Peng,Jinlin Wu,Bob Zhang,Bingo Wing-Kuen Ling,Fuji Yang,Zhen Lei,Jiebo Luo,Zelin Zang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Virtual Cell Modeling (VCM) requires models that not only predict perturbation responses, but also support targeted revision when predictions fail. Current LLM-assisted modeling workflows face a refinement-routing problem: prediction discrepancies are observed through executable implementations, but the relevant revision may involve the modeling assumption, representation design, implementation, or task constraint. Without structured feedback propagation across these levels, iterative refinement may repair code while failing to revise the assumption responsible for the discrepancy. We propose CellScientist, a dual-space hierarchical framework that couples a high-level hypothesis space with a low-level executable implementation space. CellScientist represents modeling decisions as structured states, realizes them as admissible programs under task and interface constraints, and routes execution discrepancies back to targeted hypothesis or implementation updates. This enables a closed Hypothesis - Implementation - Hypothesis loop where failures become structured signals for model refinement rather than debugging events. Across morphology and transcriptomic benchmarks, with additional single-cell perturbation evaluations, the final executable models selected by CellScientist improve over reference baselines under fixed split and evaluation protocols, while the workflow produces auditable refinement traces.
[LG-76] Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
链接: https://arxiv.org/abs/2605.07333
作者: Zixuan Xie,Xinyu Liu,Claire Chen,Shuze Daniel Liu,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.
[LG-77] Latent Order Bandits
链接: https://arxiv.org/abs/2605.07304
作者: Emil Carlsson,Newton Mwai,Fredrik D. Johansson
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2508.05367
Abstract:Bandit algorithms solve diverse sequential decision-making problems, but are often too sample-inefficient for from-scratch personalization. To substantially reduce exploration times, latent bandit algorithms exploit cross-instance structure implied by discrete latent states, provided that the posterior distribution of rewards and latent states is known and accurate. However, obtaining an accurate model of this structure is difficult, and a small number of latent states may be insufficient to characterize the reward distributions in all problem instances. We propose latent order bandits (LOB), relaxing the assumptions of latent bandits to require only prior knowledge of a partial order of action preferences in each state. This allows instances of the same state to vary in reward distributions, as long as the partial order of actions is shared. For example, groups of users on a streaming service may agree on which movie genres are the best but rate experiences on different scales. We give an upper-confidence bound procedure for the LOB problem, applicable to both total and partial latent orders, and give an upper bound on its regret. To improve empirical performance, we propose a posterior-sampling algorithm and show, in a suite of experiments, that both are competitive with full-prior latent bandits when same-state instances share reward parameters, and preferable to them when reward scales differ between instances with the same latent state.
[LG-78] Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
链接: https://arxiv.org/abs/2605.07302
作者: Junjie Yu,Yue Wang,Zihan Deng,Yan Zhu,Wenxiao Ma,Quanying Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.
[LG-79] Sparse Random-Feature Neural Networks with Krylov-Based SVD for Singularly Perturbed ODE
链接: https://arxiv.org/abs/2605.07286
作者: Kevin Kurian Thomas Vaidyan,Siddharth Rout
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Random-feature neural networks (RFNNs), including architectures with fixed hidden layers and analytically determined output weights, offer fast training but often suffer from issues due to dense representations of the hidden layer activation. Their reliance on dense feature mappings and least squares solvers can limit scalability and numerical stability, particularly for high-dimensional or stiff systems. Specifically, the activation matrix is observed to be low-rank and extremely ill-conditioned. In this work, we propose a sparse framework for RFNNs that integrates structured sparsity into the hidden layer activations that increases the rank and employs Sparse Singular Value Decomposition (sSVD) for solving the resulting linear least squares problem scalably and efficiently while catering to the bad condition number. We explore the theory behind Lanczos-Golub-Kahan Bidiagonalization technique for sparse SVD and conduct some experiments to identify some limitations and justify the requirement for orthogonalization step in our application. Then, we demonstrate that the proposed method maintains or improves solution accuracy for solving the benchmark one-dimensional steady convection-diffusion equations case having stronger advection, while achieving substantial gains in training efficiency and robustness compared to standard dense implementations.
[LG-80] Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
链接: https://arxiv.org/abs/2605.07284
作者: Yifan Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model’s earlier-layer state with each model’s late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint’s upstream state before treating the late stack as self-contained.
[LG-81] he Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
链接: https://arxiv.org/abs/2605.07282
作者: Yifan Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Final outputs hide when a checkpoint commits to its next-token prediction. We introduce the convergence gap, a model-diffing diagnostic that decodes each layer’s next-token distribution and measures its distance to the model’s own final distribution. Across six paired pretrained and instruction-tuned checkpoints in native prompting regimes, instruction-tuned checkpoints remain farther from their final predictions later into the stack. The effect persists under endpoint-matched raw and tuned readouts, endpoint-free same-history checks, and fixed-history template replay. Matched-prefix interventions identify late MLP windows as the largest tested leverage point: late IT grafts into PT hosts increase late KL by +0.34 nats, while PT-late swaps into IT hosts reduce it by -0.51 nats; matched random late perturbations give only +0.003 versus +0.327 for the true late graft. A preselected Gemma case study provides behavior-facing plausibility for the same late swap, without serving as a benchmark claim. These results identify a robust predictiondynamics signature of post-training: released instruction-following checkpoints tend to settle later, and late MLP computation is the strongest tested bidirectional handle on that delay under matched histories.
[LG-82] bispectrum: Selective G-Bispectra Made Practical
链接: https://arxiv.org/abs/2605.07270
作者: Johan Mathe,Adele Myers,Simon Mataigne,Nina Miolane
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many machine learning tasks are invariant under the action of a group G of transformations: signal classification can be invariant under translations, image classification under 2D rotations, and spherical-image classification under 3D rotations. The G -bispectrum is a principled complete invariant of a signal (retaining all all signal’s information up to the group action) with proven benefits in machine learning and as a pooling layer in deep networks. However, its deployment has been hampered by high computational cost and a patchwork of group-specific implementations. We present bispectrum, an open-source, fully unit-tested PyTorch library that implements selective G -bispectra for seven different group actions, as differentiable modules that can be directly incorporated into machine learning pipelines and deep learning architectures. For finite groups G , selectivity reduces the computational cost from O(|G|^2) to O(|G|) . For planar rotations, we leverage the disk bispectrum. For spherical 3D rotations, we introduce an augmented selective bispectrum at band-limit L which reduces the cost from O(L^3) to \Theta(L^2) coefficients. We profile the entire library (for which we implemented various compute optimizations), showing that it delivers near-exact G -invariance with its selective G -bispectra computed in sub-millisecond time on GPU (up to commonly used bandlimits). We evaluate the benefits of incorporating G -bispectra as pooling layers into deep learning architectures on three classical benchmark datasets --comparing against norm pooling, gated pooling, Fourier-ELU pooling, max pooling, and (non-equivariant) data-augmented convolutional baselines. Results show that G -bispectra consistently outperform alternatives in the low-data, moderate-capacity regime.
[LG-83] PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning
链接: https://arxiv.org/abs/2605.07267
作者: Elahe Khatibi,Ziyu Wang,Saba A. Farahani,Di Huang,Hung Cao,Ramesh Jain,Amir M. Rahmani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Personalized healthcare decisions require reasoning about how physiological and behavioral variables influence an individual patient over time. Existing temporal causal discovery methods are poorly matched to this setting: cohort-level models provide stable but non-personalized structures, while per-patient discovery is unreliable because individual trajectories are short, noisy, irregular, and non-stationary. This creates a fundamental gap between population-level causal modeling and the patient-specific, time-varying mechanisms needed for intervention reasoning. We introduce PerCaM-Health, a framework for learning personalized dynamic causal graphs from longitudinal health data. The framework learns a knowledge-guided population temporal graph, then conservatively adapts and evolves it using patient-specific temporal evidence and rolling-window updates, producing interpretable and auditable graph sequences. By coupling these graphs with temporal structural equations, the framework enables patient-level counterfactual queries, such as estimating short-horizon outcome changes under hypothetical behavioral interventions. Experiments on a semi-synthetic dynamic health benchmark show that PerCaM-Health improves graph recovery, dynamic edge tracking, and intervention direction accuracy compared to cohort-level, per-patient, and non-personalized temporal baselines. These results demonstrate that jointly modeling personalization and temporal evolution yields more reliable causal structure and intervention reasoning.
[LG-84] How Big Should a Wireless Foundation Model Be?
链接: https://arxiv.org/abs/2605.07266
作者: Wei-Lun Cheng,Wanjiun Liao
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Wireless foundation models are rapidly emerging as a key enabler of AI-native communication systems, yet a fundamental question remains unanswered: how large should these models be? We present a principled, physics-grounded answer, showing that the intrinsic dimensionality (dNL, the nonlinear manifold dimension of the channel) acts as the fundamental bottleneck, defining the scaling ceiling once a data-sufficient regime is reached. This dimensionality is not a design choice but a physical constraint: Maxwell’s equations, finite scatterers, and antenna aperture inherently constrain wireless propagation environments to a limited number of degrees of freedom – spanning 5-35 across both real-world OTA measurements and 3GPP-standardized channel models we evaluate – orders of magnitude below the ~1,000-dimensional semantic space of language. As a consequence, we propose a scaling framework for wireless AI: taking NTN satellite channels as a representative case (dNL ~= 14), scaling gains diminish rapidly beyond ~30 million parameters, entering a stochastic asymptote above 70M where a further 1.6x increase (96M-150M) yields only 0.52 dB. Beyond this ceiling, inference-time adaptation via pilot-aided test-time training (TTT) is far more effective: a compact 12M-parameter model surpasses a static 96M model by 9.9 dB (NMSE, SNR = 20 dB) / 7.6 dB (MCM, SNR = 10 dB) at one-eighth the parameters. With dNL distributions validated across real-world indoor massive MIMO measurements, our scaling laws and TTT gains are demonstrated through NTN satellite simulations, reframing wireless AI design: channel geometry – not model size – fundamentally governs the scaling laws of physical-layer wireless AI.
[LG-85] Sample Complexity of Stochastic Optimization with Integer Variables
链接: https://arxiv.org/abs/2605.07239
作者: Hongyu Cheng,Yinghao Zheng,Marco Molinaro,Amitabh Basu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We establish sample complexity results for stochastic optimization over the integers, especially with a view to understand the complexity with respect to the corresponding continuous optimization problem. We show that integer optimization can sometimes require strictly more samples and sometimes strictly smaller number of samples, depending on the structure of the objective and constraints. 1. For Lipschitz objectives over subsets of the \ell_\infty ball, the statistical complexity of general stochastic mixed-integer, nonlinear, nonconvex optimization is exactly the same as stochastic linear optimization with just bound constraints. 2. For Lipschitz objectives over subsets of the \ell_2 ball, we show that integer optimization can require strictly smaller sample size compared to the continuous setting in a certain regime. To get to this result, we also establish tight sample complexity results for nonconvex continuous stochastic optimization which, to the best of our knowledge, do not appear in prior work. 3. For strongly convex, smooth objectives, integer optimization has high statistical complexity compared to the continuous setting. In particular, we show that integer optimization requires \Omega(1/\epsilon^2) samples to report an \epsilon -approximate solution, compared to the well-known O(1/\epsilon) sample complexity from the continuous optimization literature.
[LG-86] Modulated learning for private and distributed regression with just a single sample per client device
链接: https://arxiv.org/abs/2605.07233
作者: Praneeth Vepakomma,Amirhossein Reisizadeh,Samuel Horváth,Munther Dahleh
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 30 pages
Abstract:This work focuses on the question of learning from a large number of devices with each device holding only a single sample of data. Several real-world applications exist to this one sample per client setup up including learning from fitness trackers, data/app usage aggregators, body-worn sensing devices, and daily event monitors to name a few. When a client has only one sample, the standard federated learning paradigm breaks down as a local update based on that single point is far from being useful, especially in the earlier rounds for estimation of the model coefficients. This utility is further weakened by the privacy-inducing noise applied at every round. This work caters to this problem to enable such clients to collaboratively contribute to effectively learn a global model without leaking the privacy of their data. The proposed approach injects a single, carefully calibrated noisy perturbation to transform the sample at each client, followed by a post-processed representation which is shared with the server. These representations aggregated at the server are processed to obtain an unbiased gradient update that in expectation matches the non-private centralized gradient while preserving data privacy. This approach is different than traditional private federated learning, where the communication payloads involve model coefficients as opposed to privately transformed data samples. This method enables devices with extremely limited data to collaborate and learn accurate, privacy-preserving models without requiring large local datasets or sacrificing individual privacy.
[LG-87] Dont Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
链接: https://arxiv.org/abs/2605.07222
作者: Takato Honda
类目: Machine Learning (cs.LG)
*备注: 9 pages main text + appendix. Code: this https URL
Abstract:How few parameters do we really need to forecast a periodic time series? An hourly electricity series, reshaped as a 24-row matrix with one column per day, is approximately rank-1: a daily shape modulated by a daily level (median centered rank-1 energy 0.82 on GIFT-Eval). Should we learn the shape? Smoothing, shrinkage, and low-rank fits all seem like obvious upgrades over the simple average of the last K=2 cycles. On all 97 GIFT-Eval configurations, we tested 8 such alternatives (e.g., Fourier, EWMA, James-Stein, rank-r SVD): none significantly beats the frozen baseline under Holm correction; two are significantly worse. The resulting method, FLAIR, is (a) Effective: matches PatchTST on aggregate GIFT-Eval (relMASE 0.838 vs 0.849); (b) Compact: 28 scalars for hourly, 57 for weekly; © Fast: 22 minutes on one CPU core of a MacBook Pro; (d) Closed-form Hands-Off: one SVD per period candidate, GCV-averaged Ridge, no GPU, no pre-training, no per-task tuning. In the high-rank-1, many-cycle regime, extra flexibility is estimation noise.
[LG-88] On the Robustness of Distribution Support under Diffusion Guidance
链接: https://arxiv.org/abs/2605.07220
作者: Ruijia Cao,Yuchen Wu,Nisha Chadramoorthy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion guidance is a powerful technique that enables controllable and high-fidelity sample generation with diffusion models. At a high level, it modifies the score function by incorporating a guidance term that steers the generative process toward a desired condition. Despite its empirical success, the theoretical properties of diffusion guidance remain largely unexplored, and it is not well understood why it consistently produces high-quality samples. In this work, we explain the effectiveness of diffusion guidance by establishing a \emphrobustness of support property. Specifically, we show that, given exact access to the score functions, guided diffusion processes almost always generate samples that remain close to the target support. This property is particularly desirable, as samples that lie off the support are often structurally implausible and may adversely affect downstream tasks. Our analysis covers both Denoising Diffusion Implicit Models (DDIM) and Denoising Diffusion Probabilistic Models (DDPM), and applies to a wide range of discretization schemes induced by exponential integrators. Our results provide a rigorous foundation for understanding why diffusion guidance produces physically meaningful and structurally plausible samples. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07220 [cs.LG] (or arXiv:2605.07220v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-89] Improved Model-based Reinforcement Learning with Smooth Kernels
链接: https://arxiv.org/abs/2605.07218
作者: Kun Long,Yuqiang Li,Xianyi Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages, 5 figures
Abstract:For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.
[LG-90] FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
链接: https://arxiv.org/abs/2605.07208
作者: Jianrong Ding,Jianyuan Zhong,Zhengyan Shi,Qiang Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly used to brainstorm and evaluate research ideas, yet assessing such judgments is fundamentally difficult because the true impact of a new idea may take years to emerge. We address this challenge by using the impact forecasting of human-authored manuscripts as a verifiable proxy task. In a prospective forecasting study, we find that frontier LLMs fail to reliably distinguish high-impact papers from ordinary publications, suggesting that static text-based judging is insufficient for scientific evaluation. To address this limitation, we propose \textbfFAME ( \underline\textF orecasting \underline\textA cademic Impact via Continuous-Time \underline\textM anifold \underline\textE volution), a spatiotemporal framework for modeling the dynamic trajectories of scientific topics. FAME projects papers into a dynamic latent space informed by textual features and a verified knowledge-flow graph, learning geometric constraints that align impactful manuscripts with the forward momentum of their fields. Experiments on 3,200 arXiv papers across three fast-evolving subfields show that FAME consistently and substantially outperforms state-of-the-art LLM evaluators in prospective multidimensional impact forecasting. Furthermore, integrating FAME’s dynamic geometric signals into LLMs significantly improves their forecasting performance. These results support manuscript impact forecasting as a useful, measurable proxy benchmark and position FAME as a strong, trajectory-aware foundation for automated scientific evaluation.
[LG-91] Arrow: A Foundation Model for Causal Discovery
链接: https://arxiv.org/abs/2605.07204
作者: Ryan Thompson,He Zhao,Daniel M. Steinberg,Edwin V. Bonilla
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Arrow, a foundation model for zero-shot causal discovery on observational tabular data. Arrow factorizes a directed acyclic graph into an undirected skeleton and a topological order, guaranteeing acyclicity by construction. Given a new dataset, it uses a transformer-based architecture to contextualize variables within and across observations, then predicts skeleton edge probabilities and node order scores that together define a graph. Arrow is trained in a supervised fashion on synthetic datasets with ground-truth graphs, using an end-to-end differentiable directed edge composite likelihood induced by the skeleton-order factorization. The training distribution spans diverse graph families, functional forms, noise models, and dataset shapes. Across in- and out-of-distribution synthetic, semi-synthetic, and real datasets, Arrow matches or outperforms existing causal discovery methods at substantially lower inference cost than competitive alternatives. Our results demonstrate that large-scale pretraining on diverse synthetic data can yield zero-shot causal discovery models that are fast, accurate, and reusable on new datasets.
[LG-92] Coupling Models for One-Step Discrete Generation
链接: https://arxiv.org/abs/2605.07193
作者: Fred Zhangzhi Peng,Avishek Joey Bose,Anru R. Zhang,Alexander Tong
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at this https URL.
[LG-93] Star Elastic: Many-in-One Reasoning LLM s with Efficient Budget Control
链接: https://arxiv.org/abs/2605.07182
作者: Ali Taghibakhshi,Ruisi Cai,Saurav Muralidharan,Sharath Turuvekere Sreenivas,Aditya Vavre,Ameya Sunil Mahabaleshwarkar,Bilal Kartal,Sheldon Liang,Marcin Chochowski,Zijia Chen,Akhiad Bercovich,Ran Zilberstein,Ran El-Yaniv,Yonatan Geifman,Daniel Korzekwa,Yoshi Suhara,Oluwatobi Olabiyi,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.
[LG-94] Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy
链接: https://arxiv.org/abs/2605.07171
作者: Ishank Juneja,Carlee Joe-Wong,Osman Yağan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by multi-armed bandits with cost-subsidy (MAB-CS). Of interest to this paper is the setting where the quality (reward) constraint is specified relative to the unknown best reward and the cost of each arm is known. We characterize the expected sub-optimal samples required by any policy by proving instance-dependent lower bounds that offer new insight into the problem and are a strict generalization of prior bounds. Then, we propose an algorithm called Cost-Ordered Feasibility (COF) that leverages our insight and intelligently combine samples from all arms to gauge the feasibility of a cheap arm. Thereafter, we analyze COF to establish instance-dependent upper bounds on its expected cumulative cost and quality regret, i.e., relative to the cheapest feasible arm. Finally, we empirically validate the merits of COF, comparing it to baselines from the literature through extensive simulation experiments on the MovieLens and Goodreads datasets as well as representative synthetic instances. Not only does our paper develop qualitatively better theoretical regret upper bounds, but COF also convincingly demonstrates improved empirical performance.
[LG-95] Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach ECML-PKDD2026
链接: https://arxiv.org/abs/2605.07166
作者: Nikhilesh Prabhakar,Varun Balaji,Athresh Karanam,Kristian Kersting,Sriraam Natarajan
类目: Machine Learning (cs.LG)
*备注: Under Review for ECML-PKDD 2026
Abstract:Imitation learning is widely used for learning to act in complex environments. While pure neural-based methods handle high dimensional data effectively, they suffer from the requirement of large number of samples and are prone to overfitting. Pure symbolic approaches, while generalize well, do not handle high-dimensional data effectively. We propose a neurosymbolic approach that achieves the best of both worlds, i.e, handling high-dimensional data while achieving generalization. The key advantage of our approach is that it can effectively exploit additional privileged information that is available only during training (in our case, gaze data). Our empirical evaluations demonstrate the effectiveness, efficiency and the generalization capability of our proposed approach.
[LG-96] Learned Lagrangian Models of PDEs via Euler-Lagrange Residual Minimization
链接: https://arxiv.org/abs/2605.07157
作者: Lyra Zhornyak,Eric Forgoston,M. Ani Hsieh
类目: Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 2 tables, 7 pages of appendices
Abstract:We present the first method to directly use a learned continuous Lagrangian to forecast the dynamics of systems governed by partial differential equations, exploiting the inherent conservative structure to achieve stable long-range predictions. We develop an optimization-based integrator that minimizes the squared Euler–Lagrange residual via a mesh-free near-symplectic construction on local space-time patches. Different from integrators for analytical models, integrators for learned models should decouple model error (phase error) from integration error (conservation error). By relying on optimization rather than time-stepping, we bypass the global coupling inherent to fixed discretizations, which slows time- and space-stepping and complicates learning. Our method scales linearly with domain size via Jacobi iteration, and places no structural requirements on the learned network, allowing it to be coupled with existing physics-guided machine learning (ML) methods. We validate our approach on a learned representation of a double pendulum, a one-dimensional wave equation, and a two-dimensional wave equation. Our method achieves error comparable to classical symplectic methods while generalizing to spatially varying dynamics and arbitrary boundary conditions without retraining.
[LG-97] Regret-Oracle Complexity Tradeoffs in Agnostic Online Learning
链接: https://arxiv.org/abs/2605.07155
作者: Idan Attias,Steve Hanneke,Arvind Ramaswami
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agnostic online learning is classically solved via a reduction to the realizable setting, utilizing Littlestone’s Standard Optimal Algorithm (SOA) as a base learner. However, the SOA is computationally intractable to execute even for a single round. To overcome this barrier, recent work in oracle-efficient online learning replaces the SOA with a realizable base learner that accesses the concept class exclusively through an offline empirical risk minimization (ERM) oracle. While such agnostic learners achieve near-optimal expected regret, they suffer from a doubly-exponential oracle complexity of O\big(T^2^O(d_\mathrmLD)\big) , where d_\mathrmLD is the Littlestone dimension and T is the number of rounds. In this work, we significantly improve this oracle complexity while relying on an even weaker primitive: a weak-consistency oracle, which merely decides whether a given labeled dataset is realizable. At the core of our approach is an adaptive and dynamic agnostic-to-realizable reduction that actively prunes non-realizable label sequences on the fly. By using the VC dimension ( d_\mathrmVC ) to bound the number of dynamically maintained active paths, our algorithm reduces the total query complexity down to O(T^d_\mathrmVC+1) while perfectly preserving near-optimal expected regret. Crucially, this dynamic pruning also yields a memory reduction over the standard reduction. Furthermore, we formally quantify the regret–oracle complexity tradeoff, providing upper bounds that smoothly interpolate between restricted query budgets and attainable expected regret. We complement these with lower bounds proving that any learner restricted to Q = o(\sqrtT) queries must suffer an expected regret of \Omega(T/Q) .
[LG-98] Simple KNN-Based Outlier Detection Achieves Robust Clustering
链接: https://arxiv.org/abs/2605.07130
作者: Tianle Jiang,Yufa Zhou
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Code: this https URL
Abstract:Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the \textitrobust k -Means problem (i.e., k -Means with outliers), the goal is to remove z outliers and minimize the k -Means cost on the remaining points. Despite the close connection between robust k -Means and outlier detection, both theoretical and empirical understanding of the effectiveness of \textitclassic outlier detection heuristics for robust k -Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large K -Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust k -Means to standard k -Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.
[LG-99] Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought
链接: https://arxiv.org/abs/2605.07123
作者: Zixuan Xie,Xinyu Liu,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.
[LG-100] When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification
链接: https://arxiv.org/abs/2605.07120
作者: Wenjie Guan,Jelena Bradic
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Template tasks have emerged as a clean testbed for asking whether transformers reason with abstract symbols rather than concrete token names. We study the fixed-label classification version of this problem, where train and test examples share latent templates but may use disjoint vocabularies. Unlike next-token prediction, the model need not emit unseen symbols; it must learn a decision rule invariant to symbol renaming. We analyze regularized kernel logistic classification in the transformer-kernel regime. Our main result decomposes the learned predictor into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. We encode these overlaps by a colored collision graph and prove high-probability margin-transfer guarantees for fresh-symbol classification. This perspective extends template-based analyses to logistic classification and refines scalar diversity conditions: vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved. More broadly, the same perturbation framework applies to abstraction-augmented inputs, yielding a general margin-versus-collision criterion for identifying when prompting strategies improve fresh-symbol generalization. Synthetic template experiments illustrate the predicted roles of regularization, sample size, and transformer-kernel structure.
[LG-101] Conformal-Style Quantile Analyses for Stochastic Bandits
链接: https://arxiv.org/abs/2605.07115
作者: Chengyu Du,Mengfan Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level (\alpha), the natural upper-tail target of arm (j) is the upper endpoint (F_j^-1(1-\alpha/2)) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution (O(\nicefrac\log n\Delta_j^\mathrmACP)). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.
[LG-102] Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
链接: https://arxiv.org/abs/2605.07114
作者: Tao Wang,Shuo Li,Yan Sun,Dongsheng Ding,Edgar Dobriban
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model–benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.
[LG-103] Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks
链接: https://arxiv.org/abs/2605.07113
作者: Hao Chen,Chendi Qian,Christopher Morris,Andrea Lodi,Can Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Exact solution of hard combinatorial optimization problems often relies on strong convex relaxations, but solving these relaxations repeatedly inside a branch-and-bound algorithm can be prohibitively expensive. Hence, we consider this challenge for Max-Cut, where branch and bound commonly uses semidefinite programming (SDP) relaxations to bound subproblems. We propose a Max-Cut-specific graph neural network that serves as a principled, lightweight neural proxy for these SDP solvers and can be plugged directly into an exact branch-and-bound framework. The proposed architecture has update steps of complexity \mathcalO(n^2 + ne) , and predicts both primal- and dual-feasible SDP solutions. The primal SDP solutions yield feasible Max-Cut solutions via the Goemans–Williamson algorithm. In addition, it is trained in a self-supervised fashion without requiring solved SDP relaxations as labels. Empirically, we show that our architecture can substantially reduce the cost of bounding in exact Max-Cut solving by up to 10.6 \times compared with using the state-of-the-art SDP solver Mosek. Our work highlights the potential of learned, validity-preserving surrogates for accelerating exact optimization over structured convex relaxations.
[LG-104] Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift
链接: https://arxiv.org/abs/2605.07104
作者: Xinyu Liu,Zixuan Xie,Shangtong Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as Q -learning and linear temporal difference learning. Specifically, for a power-law learning rate O(n^-\eta) with \eta \in (1/2, 1) , we obtain an almost sure convergence rate arbitrarily close to o(n^1 - 2\eta) . For a harmonic learning rate O(n^-1) , we obtain an almost sure convergence rate arbitrarily close to o(n^-1) , which we argue is a strong result because it is close to the optimal rate O(n^-1\log\log n) given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.
[LG-105] CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation
链接: https://arxiv.org/abs/2605.07098
作者: Mohamed Elrefaie,Dule Shu,Matthew Klenk,Faez Ahmed
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Crash simulation is a cornerstone of modern vehicle development because it reduces the need for costly physical prototypes, accelerates safety-driven design iteration, and increasingly supports virtual testing workflows. At the same time, modeling structural crash mechanics remains exceptionally challenging: the response is governed by nonlinear contact, large deformation, material plasticity, failure, and complex multi-body interactions evolving over space and time on high-resolution finite-element meshes. In this work, we introduce \textscCarCrashNet, a public high-fidelity open-source benchmark for data-driven structural crash simulation. \textscCarCrashNet combines component-scale and full-vehicle simulations in a multi-modal format, including more than 14,000 bumper-beam pole-impact simulations with varying geometry, materials, and boundary conditions, together with 825 full-vehicle crash simulations built from three industry-standard vehicle models of increasing structural complexity: Dodge Neon, Toyota Yaris, and Chevrolet Silverado. To establish the reliability of the benchmark, we validate our open-source finite-element workflow based on OpenRadioss against both experimental crash data and the commercial solver Ansys LS-DYNA. We also introduce \textscCrashSolver, a machine-learning model designed for full-vehicle crash prediction from high-resolution finite-element crash data. We further perform extensive benchmarking across the released datasets and evaluate \textscCrashSolver against state-of-the-art geometric deep learning and transformer-based neural solvers. Our results position \textscCarCrashNet as a foundation for reproducible research in structural simulation, crashworthiness modeling, and AI-driven virtual crash testing. The dataset is available at this https URL.
[LG-106] Actor-Critic with Active Importance Sampling
链接: https://arxiv.org/abs/2605.07094
作者: Majid Molaei,Gabor Paczolay,Matteo Papini,Alberto Maria Metelli,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07094 [cs.LG] (or arXiv:2605.07094v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07094 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-107] st-Time Compositional Generalization in Diffusion Models via Concept Discovery
链接: https://arxiv.org/abs/2605.07078
作者: Zekun Wang,Anant Gupta,Tianyi Zhu,Christopher J. MacLellan
类目: Machine Learning (cs.LG)
*备注: 9 pages
Abstract:Compositional generalization requires models to produce novel configurations from familiar parts. In diffusion models, prior compositional generation methods typically assume that the relevant concepts or conditioning signals are already available. We instead ask whether a pretrained diffusion model can discover query-specific concepts from the time-indexed scores it learns for the noisy marginals p_t(x_t) and compose them at test time. Given a single out-of-distribution query, our method performs gradient ascent on s_\theta(x_t,t) \approx \nabla_x_t\log p_t(x_t) at multiple noising timesteps to recover local density modes, maps these modes into clean-space Gaussians, greedily selects relevant prototypes with a submodular likelihood objective, and combines them into a product-of-experts (PoE) teacher model with an analytic score. This teacher model can be sampled directly through classifier-free guidance or used to generate a sample pool for training a new class embedding and low-rank adapter. On held-out composition benchmarks built from ColorMNIST and CelebA, both the analytic PoE sampler and the low-rank adapted model outperform query-only and nearest trained-class baselines. These results suggest that the time-indexed score geometry of the diffusion model contains reusable density-mode concepts that support test-time compositional generation without a predefined concept library.
[LG-108] ModelLens: Finding the Best for Your Task from Myriads of Models
链接: https://arxiv.org/abs/2605.07075
作者: Rui Cai,Weijie Jacky Mo,Xiaofei Wen,Qiyao Ma,Wenhui Zhu,Xiwen Chen,Muhao Chen,Zhe Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model–dataset–metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.
[LG-109] Less Random More Private: What is the Optimal Subsampling Scheme for DP-SGD? NEURIPS2026
链接: https://arxiv.org/abs/2605.07072
作者: Andy Dong,Ayfer Özgür
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 17 pages, 1 table. Submitted to NeurIPS 2026
Abstract:Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum ( \sigma \to 0 and \sigma \to \infty ). Our analysis reveals that the privacy-noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite-noise guarantees, we introduce a practical near-exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition-based PLD analyses. Evaluations across more than 60 practical DP-SGD configurations show that BIS consistently outperforms Poisson subsampling in the low-noise regimes most relevant for high-utility private training, reducing the required noise multiplier by up to 9.6% . These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP-SGD, structured participation can be both more practical and more private. Our implementation is available at this https URL.
[LG-110] PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
链接: https://arxiv.org/abs/2605.07067
作者: Haozhou Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Muon’s matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon’s polar spectral-norm control but breaks the gauge-equivariance, since AdamW’s coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon’s Newton-Schulz polar map to AdamW’s preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon’s polar step is Schur gauge-equivariant on multiplicity matrices while AdamW’s coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.07067 [cs.LG] (or arXiv:2605.07067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.07067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-111] Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure
链接: https://arxiv.org/abs/2605.07057
作者: Jiamin Xu,Jacqueline Maasch,Kyra Gan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emphgiven a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property? We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbfMOSE (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same Q -function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emphcontrolled redundancy is necessary to unlock the benefit of causal state information.
[LG-112] A Behavioral Framework for Data-Driven Modeling of Nonlinear Systems in Vector-Valued Reproducing Kernel Hilbert Spaces
链接: https://arxiv.org/abs/2605.07052
作者: Boya Hou,Maxim Raginsky
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 12 pages
Abstract:We generalize Jan Willems’ behavioral approach to a class of discrete-time nonlinear systems in a vector-valued reproducing kernel Hilbert space (RKHS). Apart from linear time-invariant systems, this class covers nonlinear systems modeled by Volterra series and their autoregressive variants, as well as systems admitting Hammerstein-type state-space realizations. We apply the proposed framework to the problem of data-driven modeling of such systems, i.e., when simulation or control objectives for an unknown system are carried out without an explicit system identification step. To that end, we link the behavioral approach to two data-driven modeling methods in a vector-valued RKHS: (1) minimum-norm interpolation and (2) subspace identification.
[LG-113] PACEvolve: Improving Test-time Learning for Evolutionary Search Agents
链接: https://arxiv.org/abs/2605.07039
作者: Minghao Yan,Bo Peng,Benjamin Coleman,Ziqi Chen,Zhouhang Xie,Shuo Chen,Zhankui He,Noveen Sachdeva,Weili Wang,Ed H. Chi,Shivaram Venkataraman,Wang-Cheng Kang,Derek Zhiyuan Cheng,Beidou Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of- k frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.
[LG-114] Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE
链接: https://arxiv.org/abs/2605.07034
作者: Riyazuddin Mohammed,Lan Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Modern cybersecurity relies heavily on static machine-learning-based malware classifiers. However, transformations such as packing and other non-semantic modifications applied to executable files limit their reliability. Malware classifiers often learn these unnecessary artifacts rather than the true binary behavior because of the high association between maliciousness and packing. Moreover, these malware classifiers are black boxes, making it difficult to understand what they learn. To address this issue, we proposed a two-part framework using the post-hoc interpretability XAI tool TRUSTEE, followed by a manual analysis of the top features. We conducted several controlled experiments by varying the dataset composition ratios to understand their impact on the results. The top-ranked features across all experiments, identified by TRUSTEE, were predominantly packing artifacts, portable executable(PE) metadata, and n-grams at the string level, rather than malicious semantics. These results suggest that these malware classifiers are highly sensitive to dataset composition and can misinterpret packing as malicious behavior. Our proposed framework allows for the reproducible diagnosis of such biases and forms a guideline for building more robust and semantically meaningful malware detection models
[LG-115] Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
链接: https://arxiv.org/abs/2605.07024
作者: Mahdi Erfanian,Nelson Daniel Troncoso,Aashna Garg,Amabel Gale,Xiaoyu Liu,Pareesa Ameneh Golnari,Shengyu Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks – plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at this https URL.
[LG-116] Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
链接: https://arxiv.org/abs/2605.07022
作者: Haydn Jones,Yimeng Zeng,Alden Rose,Li S. Yifei,Yining Huang,Kaiwen Wu,Jiaming Liang,Maggie Ziyu Huan,Yoseph Barash,Cesar de la Fuente-Nunez,Osbert Bastani,Zachary Ives,Mark Yatskar,Jacob R. Gardner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Manually curated biomedical repositories – spanning bioactivity, genomics, and chemistry – are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks – blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions – Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard – e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: this https URL.
[LG-117] Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization
链接: https://arxiv.org/abs/2605.07011
作者: Da Long,Lingyi Fu,Diya Michelle Rao,Jasmine Ruales Carrera,Yang Bai,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Motivational-interviewing-based health coaching is an effective approach for improving mental health and promoting healthy behavior change. However, the scarcity of trained human coaches and the high cost of coaching services make such support inaccessible to many people who could benefit from it. This motivates the development of AI health coaches that can provide scalable and affordable support. Existing methods typically optimize only one side of the interaction: they either train a dialogue agent against a fixed client environment or train a client simulator against a fixed assistant. This one-sided setup can limit exploration of the interaction space and may be inefficient at developing the capabilities required by the target agent and pushing its performance boundaries. In this paper, we propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.
[LG-118] Inductive Power Grid Cascading Failure Analysis with GRU-Gated Graph Attention
链接: https://arxiv.org/abs/2605.07010
作者: Tianxin Zhou,Xiang Li,Haibing Lu
类目: Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, IEEE format
Abstract:Identifying vulnerable transmission lines in power grids before a cascading failure occurs is challenging: existing methods can learn inter-line failure correlations from cascade data, but they are trained and evaluated on a single grid, and transferring the learned knowledge to an unseen grid remains an open problem. We address this by training a single Gated Recurrent Unit (GRU)-gated Graph Attention Network on combined cascading failure data from limited training grids and applying it directly to any unseen grid without retraining. A GRU gate controls what information each node retains or discards at each cascade iteration. Empirical evaluation shows that the model transfers zero-shot to multiple new grids spanning inter-time and inter-domain settings. Using information extracted from the trained model, we consistently identify more vulnerable lines than established structural and electrical baselines.
[LG-119] Equivalence of Coarse and Fine-Grained Models for Learning with Distribution Shift COLT2026
链接: https://arxiv.org/abs/2605.07005
作者: Adam R. Klivans,Shyamal Patel,Konstantinos Stavropoulos,Arsen Vasilyan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 26 pages, Accepted to COLT 2026
Abstract:Recent work on provably efficient algorithms for learning with distribution shift has focused on two models: PQ learning (Goldwasser et al. (2020)) and TDS learning (Klivans et al. (2024)). Algorithms for TDS learning are allowed to reject a test set entirely if distribution shift is detected. In contrast, PQ learners may only reject points that are deemed out-of-distribution on an individual basis. Our main result is a surprising equivalence between these two models in the distribution-free setting. In particular, we give an efficient black-box reduction from PQ learning to TDS learning for any Boolean concept class. This equivalence implies the first hardness results for distribution-free TDS learning of basic classes such as halfspaces. The main technical contribution underlying our equivalence is a method for boosting, via branching programs, the weak distinguishing power of TDS learners that have rejected the target domain. We also show that giving a learner access to membership queries sidesteps these hardness results and allows for efficient, distribution-free PQ learnability of halfspaces. Our algorithm iteratively recovers large-margin separators obtained by applying successive Forster transforms on the training data. Comments: 26 pages, Accepted to COLT 2026 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.07005 [cs.DS] (or arXiv:2605.07005v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.07005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-120] Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
链接: https://arxiv.org/abs/2605.06997
作者: Anupama Sridhar,Alexander Johansen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from O(r^2) streaming state where r is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy ( \sim3% ) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve 100% retrieval accuracy on every configuration tested, including distractor gaps of 4,096 tokens with 32 KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.
[LG-121] Why Does Agent ic Safety Fail to Generalize Across Tasks?
链接: https://arxiv.org/abs/2605.06992
作者: Yonatan Slutzky,Yotam Alexander,Tomer Slor,Yoav Nagel,Nadav Cohen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with H_\infty -robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.
[LG-122] Response Time Enhances Alignment with Heterogeneous Preferences
链接: https://arxiv.org/abs/2605.06987
作者: Federico Echenique,Alireza Fallah,Baihe Huang,Michael I. Jordan
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
*备注:
Abstract:Aligning large language models (LLMs) to human preferences typically relies on aggregating pooled feedback into a single reward model. However, this standard approach assumes that all labelers share the same underlying preferences, ignoring the fact that real-world labelers are highly heterogeneous and usually anonymous. Consequently, relying solely on binary choice data fundamentally distorts the learned policy, making the true population-average preference unidentifiable. To overcome this critical limitation, we demonstrate that augmenting preference datasets with a simple, secondary signal – the user’s response time – can restore the identifiability of the population’s average preference. By modeling each decision as a Drift-Diffusion Model (DDM), we introduce a novel, consistent estimator of heterogeneous preferences that successfully corrects the distortions of standard choice-only labels. We prove that our estimator asymptotically converges to the true average preference even in extreme cases where each anonymous labeler contributes only a single choice. Empirically, across both synthetic and real-world datasets, our method consistently outperforms standard baselines that otherwise fail and plateau at a bias floor. Because response times are essentially free to record and require zero user tracking or identification, our results bring promises and open up new opportunities for future data-collection pipelines to improve the social benefit without requiring user-level identifiers or repeated elicitations.
[LG-123] FastOmniTMAE: Parallel Clause Learning for Scalable and Hardware-Efficient Tsetlin Embeddings
链接: https://arxiv.org/abs/2605.06982
作者: Ahmed K. Kadhim,Lei Jiao,Rishad Shafik,Ole-Christoffer Granmo,Mayur Kishor Shende
类目: Machine Learning (cs.LG)
*备注:
Abstract:Embedding models in natural language processing (NLP) increasingly rely on deep architectures such as BERT, while simpler models such as Word2Vec provide efficient representations but limited interpretability. The Tsetlin Machine ™ offers an alternative logic-based learning paradigm. Omni TM Autoencoder (Omni TM-AE) applies this paradigm to static embedding by exploiting automaton state distributions within a single clause layer, but its training process remains slow. In this work, we propose FastOmniTMAE, a reformulation of Omni TM-AE that replaces sequential training dependencies with a two-stage parallel process: evaluation and update. Using a Single-Run Multi-Environment Benchmark covering classification, similarity, and clustering, FastOmniTMAE achieves up to 5 \times faster training in classification while maintaining comparable embedding quality under both Spearman and Kendall similarity measures. To address the limited efficiency of TM training on conventional GPUs, we further implement FastOmniTMAE as a reusable accelerator on SoC-FPGA platforms. The Multi-Hardware Benchmark shows that FastOmniTMAE achieves similarity scores of 0.669 on a resource-constrained FPGA and 0.696 on an UltraScale+ SoC, demonstrating efficient logic-based embedding training with a small hardware footprint.
[LG-124] Rollback-Free Stable Brick Structures Generation
链接: https://arxiv.org/abs/2605.06947
作者: Chenhui Xu,Ziyue Bai,Fuxun Yu,Heng Huang,Jinjun Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:While autoregressive models have advanced 3D generation, creating physically stable brick structures remains a challenge due to the strict requirements of gravity and interconnectivity. Existing approaches rely on external physical simulators during inference to perform rejection sampling and brick-by-brick rollbacks, which severely bottlenecks efficiency. To address this, we propose a reinforcement learning paradigm that shifts physical validity enforcement from test-time correction to training-time policy optimization. By utilizing assembly-level rewards, the model optimizes for collision avoidance, global connectivity, structural interlocking, and shape conformity. This paradigm allows the model to internalize physical priors, enabling the first rollback-free generation of stable brick structures. Experimental results demonstrate that our approach achieves state-of-the-art generation quality while accelerating inference speed by orders of magnitude. Our code and dataset are available at this https URL. Our models are available at this https URL.
[LG-125] ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data
链接: https://arxiv.org/abs/2605.06943
作者: Steven Song,Sahil Sethi,Brett Beaulieu-Jones,Robert L. Grossman
类目: Machine Learning (cs.LG)
*备注:
Abstract:In time-series domains where both predictive performance and interpretability are essential, deep neural networks achieve strong results but provide limited insight into how their predictions are made. Projection-based prototype networks address this limitation by grounding predictions in similarity to representative training examples, enabling case-based explanations and global prototype inspection. However, existing approaches rely on label supervision, tying prototypes to a specific task and requiring large labeled datasets. We introduce ProtoSSL, a novel framework for learning interpretable, projection-based prototypes from unlabeled time-series data and adapting them to downstream tasks. Our key idea is to separate motif discovery from label alignment. ProtoSSL first learns a reusable prototype bank using a self-supervised objective applied directly to prototype activations, and then aligns these prototypes to downstream tasks through an efficient assignment procedure. Across six electrocardiography (ECG) datasets, ProtoSSL improves label efficiency, outperforming supervised prototype baselines in low-data regimes with as few as 256 labeled examples; with fine-tuning, ProtoSSL outperforms supervised prototype baselines at full dataset scale. In a human evaluation study, ProtoSSL produces prototypes and prototype-based explanations that are judged more favorably than those learned with direct label supervision. We further show that the framework extends to audio classification. Thus, ProtoSSL enables both learning generalizable prototypes from unlabeled data before the downstream label space is known, and subsequent assignment of interpretable, projection-grounded prototypes to new time-series tasks.
[LG-126] Causal-Aware Foundation-Model for Bilevel Optimization in Discrete Choice Settings
链接: https://arxiv.org/abs/2605.06941
作者: Shivaram Subramanian,Zhengliang Xue,Markus Ettl,Yingdong Lu,Jayant Kalagnanam
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We introduce a causal aware foundation-model framework for real time optimal decision making in discrete choice environments. We propose a constrained triple-head price optimization (C3PO) network to solve a bilevel decision problem in which a service provider selects an optimal assortment while heterogeneous users make personalized acceptance or rejection choices optimizing their own personalized preferences. C3PO integrates imitation learning of prices, multi-task learning of revenue responses, and in context learning of price elasticity to generate pricing recommendations while adhering to business constraints. During inference, frontier model prompting retrieves an enhanced elasticity prior for new products from behavioral economics literature, improving pricing effectiveness. We demonstrate strong in context learning performance using simulated, synthetic, and real-world datasets. C3PO is trained on simulated data generated from multiple classical discrete choice models in economics. The model is trained on data comprising simulated customer segments and counterfactual action and outcome pairs and evaluated on randomly generated choice environments with no access to the underlying preference structure. The trained model consistently improves the pricing KPIs, with gains increasing as customer price sensitivity increases. We also deploy the tuned foundation model for optimal pricing in real-world applications such as healthcare, tender pricing, airline ancillary pricing, and other domains, achieving substantial gains across multiple products, markets, and divisions.
[LG-127] Bias and Uncertainty in LLM -as-a-Judge Estimation
链接: https://arxiv.org/abs/2605.06939
作者: James Fiedler
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ( J ) and cross-model calibration instability ( \Delta J ), and a real-data MMLU-Pro case study with sign reversal. We propose J and \Delta J as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.
[LG-128] A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
链接: https://arxiv.org/abs/2605.06937
作者: Teo Susnjak
类目: Machine Learning (cs.LG)
*备注:
Abstract:This methods article presents a reproducible calibration workflow for prompt-based large language models (LLMs) in structured evidence-synthesis tasks. The method separates the rules that define the scientific task from the mutable prompt harness that frames and applies them. It optimises that harness against labelled or reference examples and an explicit task metric, then preserves the calibrated workflow as an inspectable artefact with its specification, metric, settings, and evaluation traces. The example code instantiates the protocol with DSPy and GEPA tools, but the underlying logic can transfer to other prompt-optimisation frameworks that support structured task definitions, metric-guided search, and artefact reuse. Title and abstract screening is the worked validation case because it provides labelled benchmark data and clear evaluation metrics. The demonstrated workflow uses a smaller student LLM for performing the scientific task execution and a larger reflection LLM to steer the prompt optimisation process during calibration. This work shows compilation, artefact round-tripping, and how optimisation budget affects a smaller student model.
[LG-129] Learned Lyapunov Shielding for Adaptive Control
链接: https://arxiv.org/abs/2605.06934
作者: Giansalvo Cirrincione,Adriano Fagiolini
类目: Machine Learning (cs.LG)
*备注:
Abstract:We augment the Slotine–Li adaptive controller for Euler–Lagrange systems with three learned components: a structured-quadratic Lyapunov function (V_\psi) whose positive-definiteness follows from a Cholesky parameterization, a residual Soft Actor–Critic policy that adds bounded torque corrections to the analytic baseline, and a physics-informed neural network that estimates unmodeled dynamics. A closed-form safety filter, derived from the single affine constraint (\dot V_\psi + \alpha V_\psi \le 0), projects every policy output onto the safe set without requiring an online QP solver. We prove: global feasibility of the filter under a drift-decay condition on the control-degeneracy set; exponential stability under exact shielding, with a robust extension whose margin depends on the PINN approximation error; almost-sure convergence of the three-timescale policy–certificate–multiplier updates to a KKT point; and a PAC generalization bound for the certificate over compacts. On a 2-DOF manipulator with nonlinear friction and variable payload, the learned certificate accounts for most of the empirical gain: tracking error drops by 41% on nominal friction and 24% on aggressive friction at the centroid of the training distribution. A 7-DOF scalability study on a Franka Emika Panda confirms clean convergence of the full pipeline at industrial scale, identifies the conditions under which gains over exact model-based baselines should and should not be expected, and documents a warm-start pathology of the learned certificate that has practical implications for deployment.
[LG-130] arget-Aware Data Augmentation for SAT Prediction
链接: https://arxiv.org/abs/2605.06931
作者: Eshed Gal,Uri Ascher,Eldad Haber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning-based approaches to NP-hard problems have shown increasing promise, but their progress is fundamentally constrained by the high cost of generating labeled training data. In domains such as Boolean satisfiability (SAT), standard pipelines rely on solver-in-the-loop labeling, which scales poorly with problem size and limits the amount of usable supervision. This bottleneck hinders the broader goal of leveraging machine learning to capture structure in hard combinatorial problems. In this work, we propose a target-aware, solver-free data generation framework for SAT that produces correctly labeled SAT and UNSAT instances by construction, eliminating the need for expensive solver calls. Our method aligns generated instances with the structural properties of a target benchmark, making synthetic data effective for downstream learning. We further develop a linear-programming-aware graph neural network (LPGNN) architecture that incorporates constraint-violation residuals into message passing, enabling the model to exploit underlying optimization structure. Together, these contributions support a data-centric paradigm for learning on NP-hard problems, where scalable, task-aligned data generation is as critical as model design. Our approach yields orders-of-magnitude speedups in data generation, demonstrating that benchmark-aligned synthetic data can effectively augment solver-labeled datasets for GNN-based SAT prediction.
[LG-131] yche: One Step Flow for Efficient Probabilistic Weather Forecasting
链接: https://arxiv.org/abs/2605.06916
作者: Fan Xu,Yuan Gao,Kun Wang,Rui Su,Fenghua Ling,Hao Wu,Wanli Ouyang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Probabilistic weather forecasting requires not only accurate trajectories, but calibrated distributions over plausible atmospheric futures. Recent data-driven systems have achieved remarkable deterministic skill, and diffusion-based ensemble forecasters have substantially improved sample realism and uncertainty quantification. However, their inference cost scales with forecast horizon, ensemble size, and the number of denoising steps required for each transition, making large operational ensembles expensive. To address this, we present Tyche, a one-step conditional flow model for efficient probabilistic weather forecasting. Tyche models the conditional forecast distribution with a destination-aware average-velocity flow that maps Gaussian noise directly to future weather states in a single function evaluation (1-NFE). To make this one-step transport learnable in high-dimensional geophysical fields, we derive a JVP-regularized rectification objective that enforces temporal self-consistency across source and destination flow timesteps without explicitly forming Jacobians. The transport field is parameterized by an isotropic Swin-style transformer that preserves fine-scale spatial structure while remaining scalable on global grids. To improve ensemble reliability under autoregressive forecasting, we further introduce a rollout-based finetuning stage with curriculum CRPS calibration supervision. Experiments on ERA5 at 1.5 ^\circ and 6-hour resolution show that our Tyche, using merely a single NFE, matches or exceeds the forecast skill and calibration of state-of-the-art multi-step generative baselines and the operational ECMWF IFS ensemble.
[LG-132] LLM s are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLM s probabilistic beliefs
链接: https://arxiv.org/abs/2605.06915
作者: Chacha Chen,Matthew Jörke,Adam Goliński,Masha Fedzechkina,Guillermo Sapiro,Sinead Williamson,Nicholas Foti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance – indicating the LLMs’ probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.
[LG-133] Dual-Scale Temporal Fusion Reveals Structured Predictability in Subseasonal-to-Seasonal Temperature Prediction
链接: https://arxiv.org/abs/2605.06911
作者: Elnaz Bashir,Jiali Wang,Lin Yan
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures
Abstract:Subseasonal-to-seasonal (S2S) temperature forecasts, spanning several weeks to a few months, are critically needed in agriculture practice, energy planning, and extreme-weather induced risk management, yet their reliability varies substantially across seasons and regions. Forecast skill is often attributed primarily to lead time, but this perspective does not fully explain the spatiotemporal patterns of predictability. Here we show that S2S predictability is organized across interacting temporal components, spatial heterogeneity, and large-scale pattern coherence, and that this structure can be explicitly characterized and exploited. We develop a dual-scale learning framework that separates calendar-aligned historical climate context from lead-time matched recent weather evolution, combining them through spatially adaptive fusion to enable stable temperature forecasts across the 30 to 90-day window. The learned fusion weights reveal that the balance between these two temporal scales shifts systematically with season and geography: during winter, interannual context dominates over high latitudes and complex terrain where forecast is the most difficult, while summer predictions reflect a more balanced temporal contribution across the domain. This spatially explicit reorganization of predictability, rather than simple lead-time decay, emerges as the primary determinant of forecast skill within the subseasonal window. Topology-aware structural constraints further improve spatial coherence of predicted temperature fields, stabilizing large-scale pattern organization particularly over complex terrain. These results reframe S2S predictability as a structured, multi-scale phenomenon, providing a more interpretable foundation for improving forecast systems and informing their use in practice.
[LG-134] raXion: Rethinking Pre-training Frameworks for Mobility and Beyond
链接: https://arxiv.org/abs/2605.06906
作者: Shang-Ling Hsu,Mark Tenzer,Cyrus Shahabi,Khurram Shafique
类目: Machine Learning (cs.LG)
*备注: 31 pages, 2 figures
Abstract:Human mobility differs from text and from generic time series in three structural ways: visits are tuple-valued events whose meaning depends on the joint distribution over location, time, and activity; users carry persistent signatures across trajectories; and visits are not independent across users, since co-location at shared places is a primary signal. Existing pre-training recipes for mobility import objectives from language modeling, treating trajectories as sentences and visits as tokens, an analogy that fails against each of the three properties above. These properties define a broader class, multi-entity spatiotemporal event streams (MESES), spanning enterprise authentication logs, electronic health records, and other event-stream domains where entities share infrastructure, schedules, or contexts. We make the properties precise as three axioms that any pre-training framework for MESES should satisfy, and introduce TraXion, whose objectives and architecture are jointly designed to meet them. A single TraXion checkpoint per dataset beats task-specific baselines on every task across six public mobility datasets covering anomaly detection, next-POI recommendation, next-visit prediction, and social-link prediction. The same recipe, applied unchanged to enterprise authentication logs and ICU mortality prediction, matches or exceeds prior work on both, showing that event streams from domains as different as mobility, security, and healthcare can be modeled under a single framework.
[LG-135] Conservative Flows: A New Paradigm of Generative Models
链接: https://arxiv.org/abs/2605.06905
作者: Eshed Gal,Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern generative modeling is dominated by transport from a noise prior to data. We propose an alternative paradigm in which generation is performed by a discrete stochastic dynamics that leaves the data distribution invariant, initialized from data-supported states rather than from noise. The framework can utilize any pretrained flow model. We develop two probability-preserving sampling mechanisms, a corrected Langevin dynamics with a Metropolis adjustment and a predictor-corrector flow, that operate directly on existing checkpoints. We validate the framework on a synthetic Swiss-roll target, ImageNet-256 and Oxford Flowers-102, where our samplers consistently improve over the original generation procedures.
[LG-136] Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation Progressive Training and Interpretable Diagnostics
链接: https://arxiv.org/abs/2605.06902
作者: Shane Cairns,Leonardo Enzo Brito da Silva,Sasha Petrenko,Donald C. Wunsch II,Jian Liu
类目: Machine Learning (cs.LG)
*备注: 35 pages, 3 figures, 11 tables. Preprint submitted to Neural Networks
Abstract:Adversarial robustness has been studied extensively for offline deep networks, but less is known about strict single-pass streaming neural learners. This paper studies adversarial robustness in Fuzzy ARTMAP, an Adaptive Resonance Theory architecture based on category competition, complement coding, match tracking, and replay-free prototype updates. We introduce WB-Softmax, a differentiable white-box attack surrogate aligned with ARTMAP’s category-competition and map-field prediction mechanism, and formalize a streaming evaluation principle requiring robustness to be assessed on the final deployed model. Across four image benchmarks, WB-Softmax achieves 89-100% attack success on vanilla Fuzzy ARTMAP models. We show that defense rankings can reverse across protocols: offline adversarial training may appear strong under transfer attacks yet collapse under adaptive white-box evaluation, whereas progressive two-stage selective training provides the strongest overall replay-free robustness. We further show that ART’s explicit category geometry enables interpretable diagnosis of separation collapse and match-score inversion. These results provide a mechanism-aligned, protocol-aware framework for adversarial robustness in streaming prototype-based learners.
[LG-137] Accelerated Relax-and-Round for Concave Coverag e Problems
链接: https://arxiv.org/abs/2605.06900
作者: Matthew Fahrbach,Mehraneh Liaee,Morteza Zadimoghaddam
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 47 pages, 6 figures
Abstract:We present an accelerated relax-and-round algorithm for concave coverage problems, which generalize the classic maximum coverage problem. Building on the relax-and-round framework of Barman et al. [STACS 2021], we propose two significant improvements. First, we replace the linear programming (LP) relaxation step with a projected accelerated gradient method applied to a smooth surrogate objective to achieve a \widetildeO(mn \varepsilon^-1) running time. Second, we use a specialized rounding scheme for the hypersimplex that combines the Carathéodory decomposition algorithm in Karalias et al. [NeurIPS 2025] with randomized swap rounding of Chekuri et al. [FOCS 2010]. We prove tight approximation ratios for new reward functions, including a 0.827 -approximation for the logarithmic reward \varphi(x) = \log(1 + x) . Finally, we conduct maximum multi-coverage experiments on synthetic and real-world graphs, demonstrating that our algorithm outperforms approaches that use state-of-the-art LP solvers.
[LG-138] McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware
链接: https://arxiv.org/abs/2605.06894
作者: Md Mahmuduzzaman Kamol,Jesus Lopez,Saeefa Rubaiyet Nowmi,Emilia Rivas,Md Ahsanul Haque,Edward Raff,Aritran Piplai,Mohammad Saidur Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 pages, 14 figures, 14 tables
Abstract:Machine learning (ML) in real-world systems must contend with concept drift, adversarial actors, and a spectrum of potential features with varying costs and benefits. Malware naturally exhibits all of these complexities, but for the same reason, it is challenging to curate and organize data to study these factors. We present McNdroid, to our knowledge the largest longitudinal multimodal Android malware benchmark for malware detection and drift analysis. McNdroid spans 2013–2025, excluding 2015, and represents each application with three aligned modalities–static features from manifests and smali code, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs. Using temporally separated splits, we evaluate standard ML and deep-learning detectors across increasing train–test time gaps. Results show clear temporal degradation, while multimodal fusion outperforms the best single modality across long-term temporal gaps. Cross-modal agreement also declines over time, suggesting that drift affects both individual feature spaces and the consistency among modalities. We further analyze modality-specific drift, malware-family evolution, and temporal changes in model explanations. We publicly release McNdroid, benchmark splits, and code to support reproducible research on temporal generalization and robust multimodal learning in security-critical, non-stationary settings.
[LG-139] Better Protein Function Prediction by Modeling Survivorship Bias
链接: https://arxiv.org/abs/2605.06879
作者: Zhongmou Chao,Poompol Buathong,Ekaterina Selivanovitch,Susan Daniel,Peter I. Frazier
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 29 pages, 12 figures, 3 tables
Abstract:Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data – predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants – Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.
[LG-140] mporal Attention for Adaptive Control of Euler-Lagrange Systems with Unobservable Memory
链接: https://arxiv.org/abs/2605.06877
作者: Giansalvo Cirrincione,Adriano Fagiolini
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adaptive control of Euler-Lagrange systems is challenging when friction is governed by a finite-horizon internal state that is not directly observable from joint measurements. In this setting, the measured closed-loop state is no longer Markovian, and standard certainty-equivalence adaptive laws may lose their convergence guarantees. The paper proposes a meta-control architecture in which the gains of a computed-torque controller are generated by a self-attention block processing a short window of recent motion history. The number of attention heads is selected before policy training through a surrogate analysis of the autocovariance of the memory-state gradient along the temporal window. This surrogate is based on a temporal adaptation of an incremental rank-tracking framework previously developed by the authors. The selected head count is then fixed and used as an architectural hyperparameter in a reinforcement-learning stage, where the policy is trained under a shielded admissibility constraint. The approach is tested on a 2-DOF manipulator with nonlinear friction and variable payload. In the short and matched memory regimes, the single-layer attention-only meta-controller outperforms a deeper Transformer baseline, with tracking-error reductions of 12 and 19 percentage points, respectively. The reported effect sizes are large, with d approximately -1.1 and -2.1, and Mann-Whitney p 0.05 in both cases. In the long memory regime, however, the advantage disappears. Four out of ten training runs show either divergence or payload-invariant policy collapse, revealing a weakness in the static Phase-1 head-count prescription. This motivates moving rank-tracking inside the reinforcement-learning loop, allowing attention heads to be pruned or grown at runtime instead of fixed before training. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06877 [cs.LG] (or arXiv:2605.06877v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06877 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Adriano Fagiolini Dr [view email] [v1] Thu, 7 May 2026 19:25:18 UTC (187 KB)
[LG-141] On the Divergence of Differential Temporal Difference Learning without Local Clocks
链接: https://arxiv.org/abs/2605.06874
作者: David Antrobius,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning rate is a critical component of reinforcement learning (RL). This work uses global and local clocks to distinguish two types of learning rates. The former is of the standard form \alpha_t that depends only on the time step t (i.e., a global clock). The latter is of the form \alpha_\nu(S_t, t) , where \nu(s, t) counts the number of visits to state s until time t (i.e., a local clock). In discounted RL, an RL algorithm that is convergent with a local clock is always also convergent with a global clock, and vice versa. We are not aware of any counterexample. The key contribution of this work is to show that this nice correspondence breaks down in average-reward RL. Specifically, we construct a counterexample showing that although differential temporal difference learning is convergent with a local clock, it can diverge with a global clock. This counterexample closes the open problem in Wan et al. [2021], Blaser et al. [2026].
[LG-142] Continuous First Discrete Later: VQ-VAEs Without Dimensional Collapse
链接: https://arxiv.org/abs/2605.06870
作者: Xinyu Zhao,Nikita Karagodin,Hamed Hassani,Sinan Hersek,Paul Pu Liang,Yury Polyanskiy
类目: Machine Learning (cs.LG)
*备注:
Abstract:While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a “warm-up phase” that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes K \in 2^10, 2^14, 2^16 , AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at K \in 2^13, 2^14 , it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.
[LG-143] When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize
链接: https://arxiv.org/abs/2605.06868
作者: Yi Wang,Chandrajit Bajaj
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Fixed-budget nonconvex optimization can fail not because local descent is unstable, but because it is too stable: after reaching a nearby stationary point, an optimizer may spend the remaining evaluations refining an uninformative local minimum. We formulate this failure mode as a control problem over optimizer dynamics, where the learner must decide when to descend, when to exploit a promising basin, and when stagnation should trigger movement elsewhere. We introduce SHAPE, a structured adaptive port-Hamiltonian task-family optimizer for event-triggered minima hunting under local information. Starting from gradient-descent dynamics, SHAPE lifts optimization to an augmented phase space (q, p) , where the primal state q represents the candidate solution, the cotangent variable p carries directional sensitivity, and a controller u provides processed information from current gradient oracle. Within each stage, a learned Hamiltonian vector field induces structured local descent; across stages, a fixed event clock in the implementation updates ports and memory when local equilibria are detected, with stage-dependent horizons treated in the analysis as a direct generalization. This design preserves a passivity-compatible structure while allowing the same trained policy to use clean, stochastic, or estimated gradient inputs. Experiments on fixed-budget nonconvex optimization tasks show that SHAPE improves best-so-far performance compared with fixed-policy optimizers. These results suggest that adaptive Hamiltonian energy shaping provides a principled mechanism for balancing descent, exploration, and budget allocation in difficult optimization landscapes.
[LG-144] A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
链接: https://arxiv.org/abs/2605.06866
作者: Ege C. Kaya,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 53 pages
Abstract:Recent non-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full-state updates under a generative model, model-based estimators, accelerated variants, or different approximation architectures. Standard categorical temporal-difference learning is typically used in a different regime. It asynchronously performs a single-state update at each iteration and, in online settings, is driven by a Markovian trajectory. This leaves an important gap between existing finite-iteration theory and the categorical recursions most closely aligned with practical distributional temporal-difference implementations. We bridge this gap for two categorical policy-evaluation methods: scalar categorical temporal-difference learning in the Cramér geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry. After suitable isometric embeddings, both algorithms take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.
[LG-145] Dataset Watermarking for Closed LLM s with Provable Detection
链接: https://arxiv.org/abs/2605.06865
作者: Pengrun Huang,Kamalika Chaudhuri,Yu-Xiang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are pre-trained and post-trained on vast amounts of loosely curated data, raising the possibility that these models may have been trained on proprietary datasets or the same benchmarks used for evaluation. This motivates the need for dataset watermarking: designing datasets such that training on them leaves detectable signatures in the resulting model. Prior work has explored this problem for open models. We introduce the first dataset watermarking method for closed LLMs with provable detection. In particular, we embed a dataset-level watermark signal by increasing the co-occurrence frequency of randomly selected word pairs through rephrasing, and detect it using a statistical test on co-occurrence patterns in model-generated outputs. We evaluate our method with multiple base models and benchmark datasets and show that it reliably detects the watermark ( p 0.01 ) in the fine-tuning stage. Notably, our method remains effective in a data mixture setting where the watermarked dataset constitutes only approximately 1% of the total fine-tuning tokens. Furthermore, we show that our method preserves the utility and semantic integrity of the benchmark.
[LG-146] Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization
链接: https://arxiv.org/abs/2605.06864
作者: John Wang,Mengfan Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study multi-objective multi-agent multi-armed bandits (MO-MA-MAB) under stochastic rewards, where agents observe heterogeneous reward vectors and communicate over time-varying graphs. We formulate this emerging problem setting to address \emphefficient learning, measured by Pareto regret, and incorporate \emphfair learning as an additional goal, captured via social welfare. To measure efficiency, we formulate Pareto regret and develop \textscPareto UCB1 Gossip, whose novel exploration radius explicitly separates statistical uncertainty in Pareto-based inference from consensus error. To express the fairness constraint, we formulate a Nash Social Welfare objective over preference-scalarized rewards and propose \textscSimulated NSW UCB Gossip, which integrates preference-based reward simulation, gossip-based utility estimation, and UCB-style exploration. We prove that \textscPareto UCB1 Gossip achieves (\mathcalO(\log T)) regret and an instance-independent rate of (\mathcalO(\sqrtT)), while \textscSimulated NSW UCB Gossip achieves an instance-independent regret bound of (\mathcalO(T^3/4)). This separation reveals the cost of imposing the fairness constraint to our efficiency objective: fairness limits information aggregation and slows convergence. Experiments show that our methods consistently outperform baselines, improving performance by approximately (100%) and (50%) in the efficiency and fairness settings, respectively.
[LG-147] Christoffel-DPS: Optimal sensor placement in diffusion posterior sampling for arbitrary distributions
链接: https://arxiv.org/abs/2605.06861
作者: James Rowbottom,Nick Huang,Carola-Bibiane Schönlieb,Ben Adcock
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:State estimation is a critical task in scientific, engineering and control applications. Since the reliability of reconstructions depends on the number and position of sensors, optimal sensor placement (OSP) is essential in scenarios where measurements are sparse and expensive. Classical OSP approaches rely on Gaussian assumptions and are consequently unable to account for the complex distributions encountered in many real-world systems. Generative-model-based reconstruction using sensor guided diffusion posterior sampling (DPS) has emerged as a promising technique for reconstructing states from highly complex distributions. However, existing sensor-selection methods either require unrealistically many sensors or emulate classical OSP, creating a mismatch between modern recovery models with classical OSP tools motivating the need for fundamentally new ideas towards OSP that match the recent advances made in powerful recovery models. We introduce a distribution-free sensor placement framework based on the Christoffel function: a mathematical formulation of optimal sampling and recovery guarantees for posterior sampling with arbitrary sensors and signal distributions, from which we derive a new OSP strategy with non-asymptotic bounds on the number of sensors needed for recovery. We develop Christoffel-DPS, with offline and online variants, instantiating Christoffel sampling for generative models. Christoffel-DPS outperforms Gaussian OSP baselines and existing generative-model placement methods, validating that distribution-free sensing is both theoretically principled and practically superior. The framework is model-agnostic; we demonstrate its application to a range of unconditional DPS and flow-matching models on structurally non-Gaussian benchmarks, showing the efficacy of Christoffel-DPS in low sensor budget regimes.
[LG-148] Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks
链接: https://arxiv.org/abs/2605.06834
作者: Patrick Elisii,Lucas Beauchemin,Dawer Jamshed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual learning research attempts to conserve two fundamental capabilities: new knowledge acquisition and the preservation of previously acquired knowledge. While knowledge in this case can be measured through performance over an implicit or explicit task space, model plasticity generally concerns adaptability as data distributions evolve. Though much of the literature has focused on catastrophic forgetting, deep networks can also suffer from loss of plasticity, becoming progressively harder to update under continued training. Recent research has identified multiple mechanisms underlying this phenomenon, including neuron saturation, parameter norm growth, and loss of useful curvature directions. Adaptive reset-based interventions, which selectively reinitialize low-utility network parameters, have emerged as practical solutions to restore trainability. Existing utility measures used to guide resets, such as activation magnitude, contribution utility, or gradient-based activity, rely on proxy signals that can become misaligned with the intervention they are meant to guide. In this paper, we introduce gradient times difference from reference (GXD), a theoretically motivated utility measure based on reference-based gradient attribution that estimates the first-order functional cost of replacing a unit. Our results show that utility measures aligned with the functional cost of the reset can make interventions more reliable in settings where existing reset criteria degrade. GXD reframes adaptive resetting as an intervention cost estimation problem, providing a practical path toward more robust continual learning systems.
[LG-149] SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
链接: https://arxiv.org/abs/2605.06822
作者: Xiwen Chen,Wenhui Zhu,Songzhu Zheng,Kashif Rasul,Yueyue Deng,Huayu Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly deployed for autonomous financial trading, a domain requiring continuous adaptation to noisy, non-stationary markets. Existing self-improving agents typically address this through unbounded free-form prompt optimization. However, in low signal-to-noise environments with delayed scalar rewards (P\L), this unstructured approach exacerbates the fundamental credit assignment problem: optimizers cannot reliably distinguish systematic logic flaws from stochastic market variance, inevitably leading to policy drift. To overcome this bottleneck, we introduce the Self-Evolving Human-Auditable Rubric Policy (SHARP), a neuro-symbolic framework that replaces unconstrained text mutation with structured, symbolic policy optimization. SHARP confines the agent’s reasoning to a bounded, human-readable rubric of explicit condition-action rules. When sub-optimal trades occur, an attribution agent employs cross-sample reasoning across multiple samples to isolate specific rule failures. This enables targeted, atomic policy edits that are subsequently regularized through strict walk-forward validation. Evaluated across three diverse equity sectors and four LLM backbones, SHARP consistently transforms generic initial heuristics into highly robust strategies, lifting the empirical performance of compact models by 10 to 20 percentage points on average (e.g., GPT-4o-mini). Ultimately, SHARP demonstrates that LLMs can achieve dynamic and efficient adaptation while significantly enhancing the structural transparency and auditability demanded by institutional finance.
[LG-150] A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
链接: https://arxiv.org/abs/2605.06819
作者: Ilan Doron-Arad,Idan Mehalel,Elchanan Mossel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for M steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after M autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire M -step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon M , and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon M : essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on M altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025]. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.06819 [cs.LG] (or arXiv:2605.06819v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06819 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Idan Mehalel [view email] [v1] Thu, 7 May 2026 18:21:05 UTC (58 KB)
[LG-151] From Model to Data (M2D): Shifting Complexity from GNNs to Graphs for Transparent Graph Learning
链接: https://arxiv.org/abs/2605.06814
作者: Debolina Halder Lina,Arlei Silva
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) achieve high performance but can be opaque to humans, making it difficult to understand and compare the many proposed architectures. While existing explainability methods attribute individual predictions to nodes, edges, or features, they do not provide architectural transparency or explain the fundamental performance gap between simple and more complex models. To address this limitation, we introduce Model-to-Data (M2D) distillation, a new framework that increases transparency by transferring model complexity into the data space. M2D distills the teacher model into an augmented graph with enriched features and structure, enabling a simple student to match the teacher’s performance. By materializing model behavior in the data, our approach allows humans to inspect architectural advantages directly. We show that M2D reveals underlying mechanisms such as fairness objectives and attention-based aggregation in an interpretable way, enhancing GNN transparency while preserving performance.
[LG-152] MIND: Monge Inception Distance for Generative Models Evaluation
链接: https://arxiv.org/abs/2605.06797
作者: Quentin Berthet,Yu-Han Wu,Clement Crepy,Romuald Elie,Klaus Greff,Michael Eli Sander
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose the Monge Inception Distance (MIND), a metric for evaluating generative models that addresses key limitations of the widely adopted Fréchet Inception Distance (FID). The MIND metric leverages the sliced Wasserstein distance to compare distributions by averaging one-dimensional optimal transport distances, efficiently computed via sorting. This approach circumvents the estimation of high-dimensional means and covariance matrices, which underlie FID’s poor sample complexity and vulnerability to adversarial attacks. We empirically demonstrate three primary advantages: (i) it is more sample-efficient by one order of magnitude, (ii) it is faster to compute by two orders of magnitude, (iii) it is more robust to adversarial attacks such as moment-matching. We show that MIND with 5k samples can replace the evaluation performance of FID with 50k samples, providing high correlation with this standard benchmark and superior discriminative performance. We further demonstrate that even smaller sample sizes (e.g., 1k or 2k) remain highly informative for rapid model iteration.
[LG-153] Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
链接: https://arxiv.org/abs/2605.06763
作者: Mohsen Dehghankar,Abolfazl Asudeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both query- and step-dependent. Our empirical observations confirm that missing even one critical key can lead to sharp error spikes, especially in long reasoning tasks where the set of important tokens varies throughout decoding. This observation motivates the need for indexing methods that dynamically adapt to these variations across decoding steps while guaranteeing a full recall of the relevant keys above a certain threshold. We address this challenge by reformulating sparse attention as the halfspace range searching problem. However, existing range searching indices are not suitable for modern LLM inference due to their computational and implementation overheads. To overcome this, we introduce Louver, a novel index structure tailored for efficient KV cache retrieval. Louver (i) guarantees zero false negatives with respect to a specified threshold in both theory and practice, (ii) is lightweight to integrate into existing LLM pipelines, and (iii) incorporates hardware-aware optimizations for both CPU and GPU executions. Our experiments demonstrate that Louver outperforms prior sparse attention methods in both accuracy and runtime, and is faster than highly optimized dense attentions such as FlashAttention. These results highlight that recall guarantees are a critical and overlooked dimension of sparse attention, and open a new direction for building theoretically grounded, efficient KV cache indices.
[LG-154] Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning
链接: https://arxiv.org/abs/2605.06756
作者: Umme Mahbuba Nabila,Paul Seurin,Linyu Lin,Majdi I. Radaideh
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 23 pages, 12 figures, and 2 tables
Abstract:Real-time supervisory control of thermal energy distribution systems requires digital twins that are accurate, interpretable, and uncertainty-aware, yet remain data and computationally efficient. High-fidelity simulations alone are costly, while purely data-driven surrogates often lack robustness. To address these challenges, this work proposes an active learning (AL) framework that couples system-level Modelica simulations with four simpler physics-informed and data-driven surrogate modeling approaches: deterministic Sparse Identification of Nonlinear Dynamics with Control (SINDyC), its probabilistic multivariate-Gaussian extension (MvG-SINDyC), feedforward neural network (FNN), and gated recurrent unit (GRU) network. Tailored to each surrogate, model-specific AL query strategies are employed, including Mahalanobis-distance sampling in coefficient space for MvG-SINDyC and error-based sampling in prediction space for SINDyC, FNN, and GRU, allowing the learning process to prioritize dynamically informative trajectories. The proposed approach is demonstrated on the glycol heat exchanger (GHX) subsystem of the Thermal Energy Distribution System (TEDS) at Idaho National Laboratory. Across key GHX outputs–the bypass mass flow rate \dotm_\mathrmGHX and heat transfer rate Q_\mathrmGHX -the AL framework achieves comparable predictive accuracy using as few as one-fifth of the simulation trajectories required by random sampling. Among the evaluated surrogates, the GRU achieves the highest predictive fidelity, while SINDyC remains the most computationally efficient and interpretable. The probabilistic MvG-SINDyC surrogate further enables uncertainty quantification and exhibits the largest computational gains under AL. Comments: 23 pages, 12 figures, and 2 tables Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2605.06756 [cs.LG] (or arXiv:2605.06756v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-155] A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics
链接: https://arxiv.org/abs/2605.06741
作者: Zixi Li,Youzhen Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning-rate steps are usually treated as hyperparameters. This paper isolates a local beliefspace calculation: when an update is modeled as a projected forward step on the probability simplex, admissibility means contractivity in the natural KL/Bregman geometry. Under this model, the upper bound of an admissible step is not a tuning slogan but a formula.
[LG-156] On Training in Imagination
链接: https://arxiv.org/abs/2605.06732
作者: Nadav Timor,Ravid Shwartz-Ziv,Micah Goldblum,Yann LeCun,David Harel
类目: Machine Learning (cs.LG)
*备注:
Abstract:State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation–the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.
[LG-157] Semantic State Abstraction Interfaces for LLM -Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics NEURIPS2024
链接: https://arxiv.org/abs/2605.06730
作者: Likhita Yerra(1),Remi Uttejitha Allam(1) ((1) AIVANCITY School of AI and Data)
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures. NeurIPS 2024 manuscript style (preprint)
Abstract:We introduce Semantic State Abstraction Interfaces (SSAI): a methodological template for mapping sparse unstructured text into K auditable, named coordinates with neutral defaults on no-news days, designed to separate representation hypotheses from optimisation variance in sequential decision systems. Our contribution is the framework and its evaluation protocol, not a claim that SSAI outperforms denser alternatives. We instantiate SSAI with K=4 axes (sentiment, risk, confidence, volatility forecast) on a US-equity panel (30 NASDAQ-100 names, FNSPID news, 2019–2023 test), and evaluate it across direct factor portfolios, supervised ridge forecasters, and RL agents (DP-PPO, SAC) that share the same fixed \phi . The four-factor factor portfolio reaches 307.2% cumulative return and Sharpe 1.067, but apparent gains versus buy-and-hold (243.6%) fail coverage-stratified controls, reverse at \geq 0.2 % costs, and are statistically fragile versus a sentiment-only baseline; a PC1 composite and a FinBERT portfolio baseline are stronger ranking signals in this setting. Ridge and RL blocks diagnose representation versus optimiser effects. We position SSAI as an interpretability-performance diagnostic and reusable protocol for sparse-text decision systems. Comments: 18 pages, 3 figures. NeurIPS 2024 manuscript style (preprint) Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.m Cite as: arXiv:2605.06730 [cs.LG] (or arXiv:2605.06730v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.06730 Focus to learn more arXiv-issued DOI via DataCite
[LG-158] Medical Imaging Classification with Cold-Atom Reservoir Computing using Auto-Encoders and Surrogate-Driven Training
链接: https://arxiv.org/abs/2605.06727
作者: Nuno Batista,Ana Morgado,Oscar Ferraz,Sagar Silva Pratapsi,Jorge Lobo,Gabriel Falcao
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
*备注: 8 pages, 6 figures. Accepted to the 2025 IEEE International Conference on Quantum AI (IEEE QAI). Supported by FCT and the Open Quantum Institute (OQI)
Abstract:We introduce a hybrid quantum-classical pipeline, based on neutral-atom reservoir computing, for medical image classification, focusing on the binary classification task of polyp detection. To deal effectively with the high dimensionality, we integrate a guided auto-encoder. This pipeline learns compact and discriminative representations of image data that are also well-suited for quantum reservoir computing. A key challenge in such systems is the non-differentiable nature of quantum measurements, which creates a ‘gradient barrier’ for standard training. We overcome this barrier by incorporating a differentiable surrogate model that emulates the quantum layer, enabling end-to-end backpropagation through the entire system. This guided training process is jointly optimized for classification accuracy and for faithful image recovery from the auto-encoder. The learned latent representations are encoded as pulse detuning parameters within a Rydberg Hamiltonian, and quantum embeddings are subsequently obtained through expectation values. These embeddings are then passed to a linear classifier. Our simulations show that this method outperforms some traditional approaches that use PCA or unguided autoencoders. We also conduct ablation studies to assess the impact of various quantum and training parameters, demonstrating the robustness and flexibility of our proposed pipeline for real-world medical imaging applications, even in the current NISQ era.
[LG-159] ransformer-Based Wildlife Species Classification from Daily Movement Trajectories
链接: https://arxiv.org/abs/2605.06726
作者: Obed Irakoze,Prasenjit Mitra
类目: Machine Learning (cs.LG)
*备注: 8 pages
Abstract:Inferring the identity of wildlife species from daily movement data alone is a challenging task. We train sequence models on large-scale, 7-species GPS trajectories from the Movebank platform. Trajectories models are evaluated using a protocol in which entire telemetry studies or regions are heldout during testing. We compare Transformer-based sequence models to LSTM, CNN, and Temporal Convolutional Networks, and find that Transformers consistently achieve higher balanced accuracy with gains of approximately 8 to 22 percentage points, depending on the species and experimental setting. In an elephant binary classification task with 1-hour resolution, the Transformer achieves a balanced accuracy of 0.83 and an AUC of 0.92, substantially outperforming all baseline models. We examine, under data-limited conditions, feature representations by analyzing the differences between a basic displacement-based encoding and an expanded range of movement descriptors that include speed, direction, and turning behavior. With feature augmentation, we see clear performance gains, especially for underrepresented and sparsely represented species, such as large carnivores, lions, and Zebras. Finally, experiments comparing 1-hour and 30-minutetemporal resolutions show that while finer sampling can capture short-term movement patterns for some species, a unified 1-hour resolution yields more promising performance across studies by reducing missing data and ensuring consistent temporal coverage.
[LG-160] UANDROMD-X: Advanced Entropy and Visual Analytics Dataset for Enhanced Malware Detection and Classification
链接: https://arxiv.org/abs/2605.06718
作者: Parthajit Borah,Upasana Sarmah,D.K. Bhattacharyya,J.K. Kalita
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Malware and malware-based attacks are becoming more prevalent and complex. Attackers regularly come up with new techniques that have the ability to evade conventional and signature-based malware defense. In order to address such threats, there is an increasing demand for advanced and better defense solutions. Machine learning-based techniques are efficiently capable of defending against malware and malware-based attacks. Nevertheless, creating and efficiently testing such techniques demand high-quality datasets having samples of various malware families as well as goodware. The lack of such datasets continues to be a major bottleneck in malware research. In this paper, we introduce TUANDROMD-X, a multiclass malware dataset with visual and entropy-based features of each sample, distinctly identifying malware from goodware. The dataset is created based on static analysis, lowering the overhead that comes with high feature engineering and dynamic analysis. As a result, TUANDROMD-X facilitates researchers and cyber-security experts to design faster and better malware detection systems.
[LG-161] Information-theoretic Limits of Learning and Estimation
链接: https://arxiv.org/abs/2605.06710
作者: Abbas El Gamal,Maxim Raginsky
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Information theory plays a central role in establishing fundamental limits on what any learning or estimation algorithm can – and cannot – achieve, regardless of computational power. In this chapter, we provide an introduction to these connections. End-of-chapter exercises makes the material suitable for both classroom use and self-study. We begin by introducing concentration inequalities along with the notions of covering and packing in metric spaces, and the associated concept of metric entropy. These tools are essential for our analysis. We then introduce the learning-theoretic framework and derive upper bounds on generalization error in terms of metric entropy, Rademacher complexity, and the VC dimension, as well as mutual information and relative entropy. Finally we discuss the minimax estimation framework and establish lower bounds on minimax risk using Fano’s inequality, yielding bounds in terms of relative entropy and covering and packing numbers. This manuscript contains preprint of a chapter under consideration for inclusion in the forthcoming third edition of Cover and Thomas’s Elements of Information Theory, posted with permission from Wiley. It would follow the chapter posted at arXiv:2605.02989 . The table of contents of the new edition can be found at: this https URL . For feedback, please contact abbas@ee.this http URL. Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: 60, 62, 68 ACMclasses: G.3 Cite as: arXiv:2605.06710 [cs.IT] (or arXiv:2605.06710v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2605.06710 Focus to learn more arXiv-issued DOI via DataCite
[LG-162] Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices
链接: https://arxiv.org/abs/2605.06686
作者: Kirk Bansak,Elisabeth Paulson,Dominik Rothenhäusler,Jeremy Ferwerda,Jens Hainmueller,Michael Hotard
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 13 pages, 2 figures, 10 tables
Abstract:Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).
[LG-163] From Canopy to Collision: A Hybrid Predictive Framework for Identifying Risk Factors in Tree-Involved Traffic Crashes
链接: https://arxiv.org/abs/2605.06684
作者: Abdul Azim,Ahmed Hossain,Soumyadip Maitra,Panick Kalambay
类目: Machine Learning (cs.LG)
*备注: 30 pages, 10 figures
Abstract:Tree-involved crashes represent a critical subset of run-off-road (ROR) collisions, often resulting in fatal or severe injuries due to high-energy impacts. This study develops a comprehensive analytical framework to identify and quantify risk factors contributing to crash severity in tree-involved collisions using the Crash Report Sampling System (CRSS) database spanning 2020-2023. The modeling framework follows a multi-step process. First, a machine learning based classification model (CatBoost) identifies key factors associated with binary crash injury severity (KA: fatal or incapacitating injury versus BC: non-incapacitating or possible injury). Second, SHapley Additive exPlanations (SHAP) tool is used to quantify and visualize the marginal effects of top influential factors on crash severity. Third, a binary logistic regression model estimates factor effects and validates SHAP-derived importance measures. Finally, SHAP interaction plots examine the combined effects of key contributing factors. Results reveal restraint non-use as the most influential predictor, with unrestrained occupants nearly three times more likely to experience severe outcomes due to ejection risk. Vehicle age, speeding violations, and driver impairment demonstrate substantial effects, reflecting reduced crashworthiness, increased impact forces, and reduced control capabilities. Critical interactions emerge between lighting conditions and vehicle age, speeding and lighting conditions, restraint use and vehicle age, and road surface and speeding, demonstrating additive risk effects with specific interactions. These findings provide critical insights for targeted safe system-based interventions, including enhanced seat belt enforcement, speed management in reduced visibility conditions, and vehicle fleet modernization.
[LG-164] Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding CVPR2026
链接: https://arxiv.org/abs/2605.06679
作者: Yubo Jiang,Yitong An,Xin Yang,Abudukelimu Wuerkaixi,Xuxin Cheng,Fengying Xie,Zhiguo Jiang,Cao Liu,Ke Zeng,Haopeng Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2026 (Conference on Computer Vision and Pattern Recognition). 11 pages, 5 figures. Code available at: this https URL
Abstract:Vision-Language Models (VLMs) are frequently undermined by object hallucination, generating content that contradicts visual reality, due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our finding of an attention imbalance in VLMs, where visual features are under-weighted. Our framework introduces a dual-path contrast: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior-dominant generation. By contrasting outputs from both paths during decoding, PND steers generation toward visually grounded results. Experiments on POPE, MME, and CHAIR demonstrate state-of-the-art performance without retraining.
[LG-165] A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence
链接: https://arxiv.org/abs/2605.06678
作者: Antoine Heranval(BioSP),Olivier Lopez(CREST),Didier Ngatcha,Daniel Nkameni(CREST)
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Applications (stat.AP)
*备注:
Abstract:According to the United Nations Office for Disaster Risk Reduction (2025), the average annual cost of natural catastrophes increased from 70–80 billion USD between 1970 and 2000 to 180–200 billion USD between 2001 and 2020. Reports from organizations such as the IFOA and the WWF highlight the need for the insurance sector to adapt to this rapidly evolving context by developing medium- to long-term strategies that go beyond the one-year horizon of prudential regulations such as Solvency II. This paper introduces an artificial intelligence framework based on Conditional Generative Adversarial Networks (Conditional GANs) to generate future spatio-temporal trajectories of climatic indices. The approach focuses on the Soil Wetness Index (SWI), a key indicator used in France to assess drought severity. Drought accounts for approximately 30% of the indemnities paid under the French natural catastrophe insurance scheme. The proposed model, SwiGAN, simulates plausible drought propagation patterns up to 2050 for a region of France particularly exposed to this hazard. By generating realistic sequences of SWI maps, SwiGAN provides insights into drought dynamics under climate change scenarios and supports the design of adaptive risk management and insurance strategies. The methodology is also generalizable to other climate-related perils and actuarial applications such as economic scenario generation.
[LG-166] A Note on Non-Negative L_1-Approximating Polynomials
链接: https://arxiv.org/abs/2605.08072
作者: Jane H. Lee,Anay Mehrotra,Manolis Zampetakis
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract: L_1 -Approximating polynomials, i.e., polynomials that approximate indicator functions in L_1 -norm under certain distributions, are widely used in computational learning theory. We study the existence of \textitnon-negative L_1 -approximating polynomials with respect to Gaussian distributions. This is a stronger requirement than L_1 -approximation but weaker than sandwiching polynomials (which themselves have many applications). These non-negative approximating polynomials have recently found uses in smoothed learning from positive-only examples. In this short note, we prove that every class of sets with Gaussian surface area (GSA) at most \Gamma under the standard Gaussian admits degree- k non-negative polynomials that \eps -approximate its indicator functions in L_1 -norm, for k=\tildeO(\Gamma^2/\varepsilon^2) . Equivalently, finite GSA implies L_1 -approximation with the stronger pointwise guarantee that the approximating polynomial has range contained in [0,\infty) . Up to a constant-factor, this matches the degree of the best currently known Gaussian L_1 -approximation degree bound without the non-negativity constraint. Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2605.08072 [stat.ML] (or arXiv:2605.08072v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.08072 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-167] PropSplat: Map-Free RF Field Reconstruction via 3D Gaussian Propagation Splatting
链接: https://arxiv.org/abs/2605.08035
作者: William Bjorndahl,Maninder Pal Singh,Farhad Nouri,Joseph Camp
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted for presentation at IEEE DySPAN 2026
Abstract:Building a site-specific propagation model typically requires either ray-tracing over detailed 3D maps or dense measurement campaigns. Both approaches are expensive and often infeasible for rapid deployments where geographic data is unavailable or outdated. We present PropSplat, a map-free propagation modeling method that reconstructs radio frequency (RF) fields using 3D anisotropic Gaussian primitives. Each Gaussian encodes a scalar path loss offset relative to an explicit baseline path loss model with a learnable path loss exponent. Gaussians are initialized along observed transmitter–receiver paths and optimized end-to-end to learn the propagation environment without external information like floor plans, terrain databases, or clutter data. We evaluate PropSplat against wireless radiance field methods NeRF ^2 , GSRF, and WRF-GS+ on two real-world datasets. On large-scale outdoor drive-tests spanning multiple topographical regions at six sub-6 GHz frequencies, PropSplat achieves 5.38 dB RMSE when training measurements are spaced 300m apart and outperforms WRF-GS+ (5.87 dB), GSRF (7.46 dB), and NeRF ^2 (14.76 dB). On indoor Bluetooth Low Energy measurements, PropSplat achieves 0.19m mean localization error, an order of magnitude better than NeRF ^2 (1.84m), while achieving near-identical received signal strength prediction accuracy. These results show that accurate site-specific propagation reconstruction is achievable from sparse RF-native measurements. The need for geographic data as a prerequisite for scalable RF environment modeling is reduced.
[LG-168] Semiparametric Efficient Test for Interpretable Distributional Treatment Effects
链接: https://arxiv.org/abs/2605.08034
作者: Houssam Zenati,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Distributional treatment effects can be invisible to means: a treatment may preserve average outcomes while changing tails, modes, dispersion, or rare-event probabilities. Kernel tests can detect discrepancies between interventional outcome laws, but global tests do not reveal where the laws differ. We propose DR-ME, to our knowledge the first semiparametrically efficient finite-location test for interpretable distributional treatment effects. DR-ME evaluates an interventional kernel witness at learned outcome locations, returning causal-discrepancy coordinates rather than only a global rejection. From observational data, we derive orthogonal doubly robust kernel features whose centered oracle form is the canonical gradient of this finite witness. For fixed locations, we characterize the local testing limit: DR-ME is chi-square calibrated under the null, has noncentral chi-square local power, and uses the covariance whitening that optimizes local signal-to-noise for discrepancies visible through the selected coordinates. This efficient local-power geometry yields a principled location-learning criterion, with sample splitting preserving post-selection validity. Experiments show near-nominal type-I error, competitive power against global doubly robust kernel tests, and interpretable learned locations that localize distributional effects in a semi-synthetic medical-imaging study.
[LG-169] Penalty-Based First-Order Methods for Bilevel Optimization with Minimax and Constrained Lower-Level Problems
链接: https://arxiv.org/abs/2605.08006
作者: Yiyang Shen,Yutian He,Weiran Wang,Qihang Lin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study a class of bilevel optimization problems in which both the upper- and lower-level problems have minimax structures. This setting captures a broad range of emerging applications. Despite the extensive literature on bilevel optimization and minimax optimization separately, existing methods mainly focus on bilevel optimization with lower-level minimization problems, often under strong convexity assumptions, and are not directly applicable to the minimax lower-level setting considered here. To address this gap, we develop penalty-based first-order methods for bilevel minimax optimization without requiring strong convexity of the lower-level problem. In the deterministic setting, we establish that the proposed method finds an \epsilon -KKT point with \tildeO(\epsilon^-4) oracle complexity. We further show that bilevel problems with convex constrained lower-level minimization can be reformulated as special cases of our framework via Lagrangian duality, leading to an \tildeO(\epsilon^-4) complexity bound that improves upon the existing \tildeO(\epsilon^-7) result. Finally, we extend our approach to the stochastic setting, where only stochastic gradient oracles are available, and prove that the proposed stochastic method finds a nearly \epsilon -KKT point with \tildeO(\epsilon^-9) oracle complexity.
[LG-170] Linear Response Estimators for Singular Statistical Models
链接: https://arxiv.org/abs/2605.07970
作者: Chris Elliott,Daniel Murfet
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 24 pages, comments welcome!
Abstract:We define susceptibilities as a measure of the response of an observable quantity of a parameterized statistical model to a perturbation of the data for a general class of observables. We define estimators for these susceptibilities as statistics in a sequence of n data-points and prove that these estimators are consistent and asymptotically unbiased in the large n regime.
[LG-171] Asymptotically Log-Optimal Bayes-Assisted Confidence Sequences for Bounded Means
链接: https://arxiv.org/abs/2605.07964
作者: Valentin Kilian,Stefano Cortinovis,François Caron
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Valentin and Stefano are equal first author
Abstract:Confidence sequences based on test martingales provide time-uniform uncertainty quantification for the mean of bounded IID observations without parametric distributional assumptions. Their practical efficiency, however, depends strongly on the choice of martingale updates, and many existing constructions do not exploit prior information about plausible data-generating distributions or mean values. We propose a Bayes-assisted framework that uses a Bayesian working predictive model to adaptively construct confidence this http URL each candidate mean and time point, the predictive distribution selects, among valid one-step martingale factors, the update maximising predictive expected log-growth; validity is therefore preserved even when the prior or working model is misspecified. We prove that if the predictive distribution is Wasserstein-consistent, the resulting procedure is asymptotically log-optimal, matching the per-sample log-growth of an oracle procedure with access to the true distribution. We instantiate the framework using robust predictives based on Dirichlet-process mixtures and Bayesian exponentially tilted empirical likelihood. Experiments on synthetic data, sequential best-arm identification for LLM evaluation, and prediction-powered inference show that informative priors can substantially reduce confidence-sequence width and sampling effort while retaining anytime-valid coverage.
[LG-172] Characterizing and Correcting Effective Target Shift in Online Learning
链接: https://arxiv.org/abs/2605.07886
作者: Ziyan Li,Naoki Hiratani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages; 6 figures
Abstract:Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.
[LG-173] Flow Matching for Count Data
链接: https://arxiv.org/abs/2605.07746
作者: Ganchao Wei,John Pearson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:High-dimensional count data arise in applications such as single-cell RNA sequencing and neural spike trains, where mapping between distributions across successive batches or time points form critical components of data analysis. The recent success of diffusion- and flow-based deep generative models for images, video, and text motivates extending these ideas to count-valued settings, but many existing methods either treat each count as a categorical state or transform counts into a continuous space, neither of which is natural or efficient when the count range is large. We propose count-FM, a flow-matching framework for count data based on a continuous-time birth-death process with local unit jumps. Count-FM learns marginal transitions efficiently in count space through simulation-free training of conditional transition rates, allowing transport between arbitrary count-distributed source and target populations. In simulation, count-FM achieves better sample quality than representative baselines while using substantially fewer parameters. We further apply count-FM to scRNA-seq and neural spike-train data for unconditional generation, transport, and conditional generation. Across these tasks, count-FM yields improved sample quality, greater modeling efficiency, and interpretable transport paths.
[LG-174] Physics-Informed Reduced-Order Operator Learning for Hyperelasticity in Continuum Micromechanics
链接: https://arxiv.org/abs/2605.07738
作者: Hamidreza Eivazi,Henning Wessels
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 22 pages, 12 figures
Abstract:Physics-informed operator learning is an attractive candidate for surrogate modeling of microstructures, especially in multiscale finite-element simulations. Its practical use, however, is often limited by the high cost of loss evaluation. We address this bottleneck by combining the Equilibrium Neural Operator (EquiNO) with the QR-based discrete empirical interpolation method (Q-DEIM). EquiNO learns only the modal coefficients of reduced displacement-fluctuation and first Piola-Kirchhoff stress representations built from periodic and divergence-free bases, thereby enforcing periodicity and mechanical equilibrium by construction. Q-DEIM then identifies a small set of spatial points through a column-pivoted QR factorization of the stress basis and restricts constitutive evaluations during training to these points alone. This makes full-batch second-order optimization practical for three-dimensional representative volume elements (RVEs). Homogenized first Piola-Kirchhoff stresses are recovered directly from the offline-averaged reduced stress modes, without the need to reconstruct the full stress field at inference time. We validate the framework on two three-dimensional finite-strain hyperelastic RVEs. Q-DEIM reduces the per-step training cost by roughly three orders of magnitude relative to full-field loss evaluation, while reduced homogenization achieves speed-up factors of order 10^3 to 10^4 over direct full-field computations. Despite relying on only a small number of offline snapshot loading paths for basis construction, the method accurately interpolates and extrapolates both microscopic stress fields and homogenized stresses, with prediction quality improving systematically as more snapshots are added.
[LG-175] Debiased Counterfactual Generation via Flow Matching from Observations
链接: https://arxiv.org/abs/2605.07665
作者: Hugh Dance,Johnny Xi,Peter Orbanz,Benjamin Bloem-Reddy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Estimating counterfactual distributions under interventions is central to treatment risk assessment and counterfactual generation tasks. Existing approaches model the counterfactual distribution as a standalone generative target, without exploiting its relationship to the observational data. In this work, we show that under standard assumptions, observational and counterfactual outcome distributions are tightly linked: they have identical support and tail behavior, remain statistically close under weak confounding, and share any features of high-dimensional outcomes which are invariant to confounders. These properties motivate learning counterfactual distributions not from scratch, but via a deconfounding flow from the observational distribution. We formulate this problem via flow-matching and derive a semiparametrically efficient estimator based on a novel efficient influence function correction. We subsequently extend our estimator to target minimal-energy flows in high-dimensions, which we show can be especially simple targets between observational and counterfactual distributions. In experiments, deconfounding flows outperform existing debiased counterfactual distribution estimators, while also mitigating known failure modes of flow-based methods.
[LG-176] Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling
链接: https://arxiv.org/abs/2605.07634
作者: Manojlo Vukovic,Dusan Jakovetic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We consider a first order stochastic optimization framework where, at each iteration, K independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the K -sized data batch into M distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate \mathcalO(T^-1) to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm’s parameters. Moreover, if the time horizon is known in advance, we establish the rate of \mathcalO(T^-\frac12). Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.
[LG-177] A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning ICML2026
链接: https://arxiv.org/abs/2605.07596
作者: Nong Minh Hieu,Antoine Ledent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of \rho_\min^-1/2 where \rho_\min denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes R , regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textitacross classes, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is \mathcalO(k) where k is the number of samples per tuple.
[LG-178] Breaking QAOAs Fixed Target Hamiltonian Barrier: A Fully Connected Quantum Boltzmann Machine via Bilevel Optimization
链接: https://arxiv.org/abs/2605.07473
作者: Jun Liu
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 34 pages, 8 figures, 3 tables, 1 algorithm
Abstract:To overcome the limitations of classical partially connected Boltzmann machines and mainstream quantum Boltzmann machines (QBMs), this work extends the conventional circuit of the quantum approximate optimization algorithm (QAOA) to a bilevel optimization architecture and proposes a fully connected QBM. The inner-loop training simulates positive phase energy minimization based on the computational process of the conventional QAOA circuit, whereas the outer-loop training simulates negative phase contrastive divergence learning by optimizing the structural parameters of the target Hamiltonian. It is found that, first, the model exhibits superior performance using only a single layer (p=1) in the QAOA circuit, with an average probability of 0.9559 in measuring the target quantum state under noiseless conditions. Second, the model exhibits notable noise robustness. Under the typical noise level of current mainstream commercial quantum computing devices, the average probability of measuring the target quantum state reaches 0.6047; when the noise rises to a more stringent level with doubled intensity, this probability remains at 0.3859. In both scenarios, the target quantum state maintains the highest measurement probability among all detected states, with a value several times higher than that of the second-ranked state. This indicates that the model retains strong robustness even when noise meets or exceeds the upper limit of current mainstream commercial quantum computing devices. Third, under a block-by-block learning strategy with p=1 and only 10 measurement shots, the model consistently generates the target “qubit” grid image regardless of noise interference, demonstrating strong robustness in image generation.
[LG-179] Inference of Qualitative Models from Steady-State Data via Weighted MaxSMT
链接: https://arxiv.org/abs/2605.07433
作者: Ondřej Huvar,Nikola Beneš,Martin Jonáš,David Šafránek,Samuel Pastva
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:Qualitative models provide crucial instruments for modelling complex biological systems. While advances in automated reasoning and symbolic encodings have enabled rigorous inference of these models from data, the process remains highly fragile. First, biological measurement errors inevitably propagate into formal model specifications. Second, when a specification becomes unsatisfiable, distinguishing between fundamental design flaws and minor technical errors is notoriously difficult. This uncertainty often leads to under-specification, as it is unclear which observations are still ``safe’’ to incorporate. To overcome these challenges, we introduce a robust inference method based on weighted MaxSMT. By encoding uncertain biological observations as weighted soft constraints, our approach enables the solver to identify a model best reflecting the observations, even with some conflicting constraints. Our method allows for Boolean and multi-valued variable domains, alongside observations derived from discretisation (level constraints) and differential expression (ordering constraints). We show our approach can be used to successfully infer neural cell differentiation models from prior-knowledge networks with 200–1300 genes using ordering constraints on all included genes.
[LG-180] Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers
链接: https://arxiv.org/abs/2605.07297
作者: Mana Sakai,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.
[LG-181] Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples
链接: https://arxiv.org/abs/2605.07119
作者: Yicen Li,Ruiyang Hong,Anastasis Kratsios,Haitz Sáez de Ocáriz Borde,Paul D. McNicholas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Classical clustering methods usually return either a finite partition of the observed data or a finite dendrogram over it. This finite-sample view is inadequate when the hierarchy of interest is a recursive geometric object with fine-scale refinements that continue beyond the levels directly observed. We introduce classification fields: infinite-depth hierarchical cluster structures on \mathbbR^d generated by a local parent-to-child refinement rule. A classification field generator maps each parent centre to an ordered, bounded, and separated tuple of child residuals. Together with a root and a scale factor, this rule recursively generates cluster centres, Voronoi cells, and a metric DAG encoding the hierarchy. Given only a finite prefix of such a hierarchy, we learn a classification field predictor that approximates the generator and can be rolled out to unseen depths. We prove exponential truncation convergence in the completed cell metric and ReLU realizability with width O(\varepsilon^-\gamma) and depth \widetilde O(\varepsilon^-3\gamma/2) , where \gamma=\log K/(-\log s) , up to finite-window aspect-ratio factors. The approximation holds at the level of the induced compact metric structures, measured in the completed cell-metric Hausdorff distance. Experimental validation on matched CFG-generated hierarchies, IFS fractals, and image-induced recursive clustering hierarchies shows that learned predictors preserve ordered child slots, unordered geometry, and hierarchy-level path metrics under recursive rollout. These results support the claim that finite hierarchical observations can reveal local refinement rules capable of generating substantially deeper classification fields.
[LG-182] RACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models
链接: https://arxiv.org/abs/2605.07100
作者: Zhenhan Fang,Aixin Tan,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures and 5 tables
Abstract:Constructing valid and informative conformal prediction regions for multi-dimensional outputs remains a fundamental challenge. While conformal prediction provides finite-sample, distribution-free coverage guarantees, its practical performance critically depends on the choice of nonconformity score. Existing approaches often rely on restrictive geometric assumptions or require explicit likelihood evaluation and invertible transformations, limiting their applicability in complex generative settings. In this work, we introduce TRACE (TRansport Alignment Conformal Estimation), a conformal prediction framework that defines nonconformity through transport alignment in diffusion and flow matching models. Rather than evaluating likelihoods, we measure how well a candidate output aligns with the learned generative dynamics by averaging denoising or velocity-matching errors along stochastic transport trajectories. The resulting transport-based scores are scalar-valued and can be calibrated using split conformal prediction, yielding valid marginal coverage under exchangeability. We further analyze the statistical properties of the proposed scores and their sensitivity to computational budget. Experiments on synthetic and real datasets demonstrate valid coverage and show that the resulting regions adapt naturally to multimodal and non-convex conditional distributions. Comments: 22 pages, 5 figures and 5 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2605.07100 [stat.ML] (or arXiv:2605.07100v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.07100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-183] Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity
链接: https://arxiv.org/abs/2605.07097
作者: Anastasis Kratsios,Gregory Cousins,Haitz Sáez de Ocáriz Borde,Bum Jun Kim,Simone Brugiapaglia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Logic (math.LO); Statistics Theory (math.ST)
*备注:
Abstract:We show that, in a precise sense, a broad class of feedforward neural networks learn (have finite sample complexity) in the PAC model: every fixed finite feedforward architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting, even with unbounded parameters. This covers standard fixed-size MLPs, CNNs, GNNs, and transformers with fixed sequence length, together with the operations and layers typically used in such architectures, including linear projections, residual connections, attention mechanisms, pooling layers, normalization layers, and admissible positional encodings. Hence, distribution-free learnability for modern non-recurrent architectures is not an exceptional property of particular activations or architecture-specific VC arguments, but a consequence of tame feedforward computation. Our results reposition finite-sample PAC learnability as a baseline rather than a differentiator: they shift the focus of architectural comparison toward inductive biases, symmetries and geometric priors, scalability, and optimization behaviour.
[LG-184] Functional-prior-based Bayesian PDE-constrained inversion using PINNs
链接: https://arxiv.org/abs/2605.07060
作者: Ryoichiro Agata,Tomohisa Okazaki
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:
Abstract:Physics-informed neural networks (PINNs) provide a mesh-free framework for solving PDE-constrained inverse problems, but their extension to Bayesian inversion still faces a fundamental difficulty: prior distributions are typically defined in the weight space of neural networks, whereas physically meaningful prior assumptions are more naturally expressed in function space. In this study, we introduce a unified framework, termed functional-prior-based approaches to Bayesian PDE-constrained inversion using physics-informed neural networks (fpBPINN), to incorporate functional priors into Bayesian PINN-based inversion. We consider two complementary approaches. The first is a functional-prior-informed Bayesian PINN (FPI-BPINN), in which a neural network weight prior is learned to be consistent with a prescribed functional prior, and Bayesian inference is subsequently performed in weight space. The second is function-space particle-based variational inference for PINNs (fParVI-PINN), which performs Bayesian estimation using ParVI directly in function space. We also show that random Fourier features (RFF) play an important role in representing Gaussian functional priors with neural networks and in improving posterior approximation. We applied the proposed approaches to one-dimensional seismic traveltime tomography and two-dimensional Darcy-flow permeability inversion. These numerical experiments showed that both approaches accurately estimated posterior distributions, highlighting the significance of introducing physically interpretable functional priors into Bayesian PINN-based inverse problems. We also identified the contrasting advantages of FPI-BPINN and fParVI-PINN, namely flexibility and accuracy, respectively.
[LG-185] A Differentiable Bayesian Relaxation for Latent Partial-Order Inference
链接: https://arxiv.org/abs/2605.06976
作者: Dongqing Li,Geoff K. Nicholls,Shiyi Sun,You Luo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Many ranking and agent trace datasets are recorded as linear orders even though their latent structure is only partially ordered. This is especially common in agent and workflow traces, where observed order may reflect arbitrary linearization rather than true prerequisites. We introduce a differentiable relaxation for latent partial-order inference from such traces. Starting from a hard frontier-constrained model of noisy linear extensions, we replace discontinuous product-order precedence and binary frontier feasibility with smooth surrogates, yielding a continuous posterior that preserves closure-level partial-order semantics and supports gradient-based MCMC and variational inference. We prove soft transitivity, sharp-limit frontier recovery, and convergence to the hard likelihood. Experiments on synthetic data, records of social dominance relations, and cloud-agent traces show close posterior fidelity to hard MCMC on small instances and improved runtime–accuracy trade-offs on larger problems.
[LG-186] Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
链接: https://arxiv.org/abs/2605.06959
作者: Haitham Kanj,Kiryung Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:This paper presents a parametric solution to piecewise linear regression through the Adaptive Block Gradient Descent (ABGD) algorithm. The heart of the method is the parametrization of piecewise linear functions as the difference of max-affine (DoMA) functions. A non-asymptotic local convergence analysis for ABGD is provided under sub-Gaussian covariate and noise distributions. To initialize ABGD, we adapt a prior algorithm originally developed for the simpler setting of max-affine functions. When suitably initialized, ABGD converges linearly to an \epsilon -accurate estimate given \tilde\mathcalO(d\max(\sigma_z/\epsilon,1)^2) observations where \sigma_z^2 denotes the noise variance. This implies exact recovery given \tilde\mathcalO(d) samples in the noiseless case. Also, such a rate is shown to be minimax optimal up to logarithmic factors. Synthetic numerical results corroborate the theoretical guarantees for ABGD. We also observe competitive performance compared to the state-of-the-art methods on real-world datasets.
[LG-187] Physics-Based Flow Matching for Full-Field Prediction of Silicon Photonic Devices
链接: https://arxiv.org/abs/2605.06929
作者: Joseph Quaratiello,Anthony Rizzo
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:Designing photonic integrated circuits requires accurate electromagnetic field simulations, which remain computationally expensive even for simple device geometries. We present PIC-Flow, a generative neural surrogate that predicts electromagnetic field distributions for photonic devices given their geometry and operating wavelength as an alternative to costly finite-difference time-domain (FDTD) simulations. Our approach combines three key ideas: (i) conditional flow matching as the generative framework, learning a velocity field that transports Gaussian noise to physically valid field solutions; (ii) a real-valued U-Net operating on split real and imaginary field channels; and (iii) physics-constrained training through a Helmholtz residual loss enforcing \nabla^2 E_z + k_0^2 \varepsilon E_z = 0 . We introduce an interface-aware masking scheme for the Helmholtz residual that excludes dielectric boundary pixels where finite-difference stencil errors dominate, yielding a physically meaningful compliance metric. The data set consists of 22,500 ground-truth FDTD simulations split evenly between multimode interferometers, Y-branches, and directional couplers at \lambda=1.55,\mu m in an 80/10/10 split between training, validation, and test sets. We evaluate ablations on the network against the held out test devices and also show that the model generalizes to held out device classes such as S-bends, tapers, and cascaded Y-branches. Rather than a drop-in replacement for FDTD, this work establishes a foundation that, with broader data coverage, more compute, and further training optimization, could scale toward broadband, device-agnostic field prediction with dramatically improved runtime for rapid design-space exploration of complex photonic devices and circuits.
[LG-188] You Only Stack Once (YOSO): A Motion-Filtered Deep-Learning Framework for Detecting Faint Moving Sources
链接: https://arxiv.org/abs/2605.06913
作者: Nitya Pandey,César Fuentes,Pedro Bernardinelli,Valeria Frías,Colin Orion Chandler,David E. Trilling,Matthew J. Holman,Steven Stetzler,Dallin Spencer,Hsing Wen Lin,Luis E. Salazar Manzano,Darin Ragozzine,Ryder Strauss,Mario Jurić,Andrew J. Connolly,Hayden Smotherman,Scott S. Sheppard,Kevin Napier
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted to The Astronomical Journal; 13 pages, 9 figures
Abstract:We present You Only Stack Once (YOSO), an automated pipeline designed to detect faint, slow-moving Solar System objects in wide-field astronomical surveys. The pipeline integrates a novel Gaussian Motion Filter (GMoF) that operates at the pixel level to enhance signal-to-noise for objects exhibiting a range of apparent rates of motion. Unlike conventional shift-and-stack methods, which rely on discrete velocity trials, GMoF amplifies trails while suppressing random noise and static background features. Applied to a subset of DEEP observations from the Dark Energy Camera, YOSO recovered 45 out of 73 previously detected objects, as well as 11 new TNOs. It also discovered 216 objects in the near Solar System. Although alternative shift-and-stack methods are sensitive to objects about 0.88 magnitudes fainter, YOSO’s false positive rate is extremely low, since it detects only sources that exhibit a trail and are consistent with a point source when shifted at the right rate. We show how this method can be deployed on large surveys like LSST, and adapted for other domains that require motion-based signal enhancement, including exoplanet imaging through Angular Differential Imaging (ADI), and near-Earth object (NEO) detection for missions like NEO Surveyor. YOSO thus provides a versatile, scalable approach for extracting faint, motion-dependent signals in the era of data-intensive astronomy.
[LG-189] Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
链接: https://arxiv.org/abs/2605.06884
作者: Sayantan Choudhury,Xiaoran Cheng,Martin Takáč,Sen Na,Mladen Kolar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures, 1 table
Abstract:Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of O \left(\varepsilon^\frac-(3\alpha-2)(\alpha-1) \right) for finding an \varepsilon -stationary point, where \alpha\in(1,2] denotes the heavy-tail index. For the inexact-polar setting with \sigma_1=0 , we also provide guarantees that do not require prior knowledge of \alpha . We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.
[LG-190] Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests
链接: https://arxiv.org/abs/2605.06883
作者: Yijin Ni,Xiaoming Huo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2605.06883 [stat.ML] (or arXiv:2605.06883v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.06883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-191] One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
链接: https://arxiv.org/abs/2605.06873
作者: Panos Tsimpos,Edoardo Calvello,Ayoub Belhadji,Nicholas H. Nelsen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Probabilistic conditioning is concerned with the identification of a distribution of a random variable X given a random variable Y . It is a cornerstone of scientific and engineering applications where modeling uncertainty is key. This problem has traditionally been addressed in machine learning by directly learning the conditional distribution of a fixed joint distribution. This paper introduces a novel perspective: we propose to solve the conditioning problem by identifying a single operator that maps any joint density to its conditional, thus amortizing over joint-conditional pairs. We establish that the conditioning operator can be approximated to arbitrary accuracy by neural operators. Our proof relies on new results establishing continuity of the conditioning operator over suitable classes of densities. Finally, we learn the conditioning map for a class of Gaussian mixtures using neural operators, illustrating the promise of our framework. This work provides the theoretical underpinnings for general-purpose, amortized methods for probabilistic conditioning, such as foundation models for Bayesian inference.
附件下载


