本篇博文主要内容为 2026-06-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-04)
今日共更新680篇论文,其中:
- 自然语言处理共126篇(Computation and Language (cs.CL))
- 人工智能共209篇(Artificial Intelligence (cs.AI))
- 计算机视觉共119篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共238篇(Machine Learning (cs.LG))
- 多智能体系统共12篇(Multiagent Systems (cs.MA))
- 信息检索共27篇(Information Retrieval (cs.IR))
- 人机交互共19篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Streaming Communication in Multi-Agent Reasoning
【速读】:该论文旨在解决多智能体推理系统中因采用“生成-传递”范式导致端到端延迟随流水线深度线性增长的问题。其核心解决方案是提出StreamMA,通过在每个推理步骤生成后立即流式传输给下游智能体,实现相邻智能体间的流水线化处理,从而显著降低延迟。令人意外的是,这种流水线机制还提升了推理有效性:由于多步推理的质量存在非均匀性,早期步骤的可靠性高于后期步骤,优先使用这些高可靠性的早期步骤可避免受错误累积的晚期步骤干扰,提升整体推理质量。作者首次建立了对流式、串行与单步协议的闭式联合分析模型,推导出有效性排序、加速比上限及成本比率。实验结果表明,在涵盖数学、科学和代码的八个基准测试中,基于Claude Opus 4.6和GPT-5.4两大前沿大语言模型及链式、树状、图状三种拓扑结构,StreamMA均显著优于基线方法(平均提升7.3个百分点,HMMT 2026上最大提升达22.4个百分点)。此外,研究发现一种“步骤级缩放律”——增加单个智能体的推理步数可同时提升有效性和效率,这一新维度与智能体数量缩放正交且可组合,为系统优化提供了全新路径。
链接: https://arxiv.org/abs/2606.05158
作者: Zhen Yang,Xiaogang Xu,Wen Wang,Cong Chen,Xander Xu,Ying-Cong Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: project page: this https URL
Abstract:Multi-agent reasoning systems adopt a “generate-then-transfer” paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a “step-level scaling law”: increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
[MA-1] Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
【速读】:该论文旨在解决在复杂问题领域中,大语言模型(LLM)代理系统缺乏可审计性与语义正确性保障的问题。针对非平凡问题域中决策过程不可追溯、结果难以验证的挑战,提出了一种名为Agentic Redux的LLM代理架构。其解决方案的关键在于:基于类型λ演算(typed lambda calculus)的形式化证明,确保在适配的问题域上,Agentic Redux的执行过程具有语义上的正确性,并将所有决策记录于不可篡改的追加型账本(append-only ledger)中,从而实现线性可审计性(linear auditability)。此外,论文提出了“本体优先代理设计”(Ontology-First Agent Design)方法论,要求领域专家使用基础形式本体(Basic Formal Ontology, BFO)对问题域进行形式化建模,再由LLM推导出人类与代理可协作的角色分工,以构建可生产部署的代理框架。该方法有效提升了系统在医疗账单合规与安全漏洞披露等关键领域的可信度与可操作性。
链接: https://arxiv.org/abs/2606.04903
作者: Aaron Sterling
机构: Thistleseeds
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注:
Abstract:We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.
[MA-2] Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems
【速读】:该论文旨在解决多智能体人工智能编排系统中跨智能体知识注入失效的系统性问题,尤其聚焦于在分层团队架构下,调度型(cron)智能体无法将知识写入目标智能体持久化内存的技术困境。其核心挑战在于现有架构中存在硬编码的内存隔离机制,导致在定时任务执行上下文中,尽管任务被正确触发,但实际写入操作因内存管理器初始化缺失及skip_memory=True等约束而被静默忽略,形成“通道断裂”(channel fracture)现象。解决方案的关键是提出一种名为CADVP v1.1的跨智能体交付验证协议,其核心为具备否决权的通道确认检查(CC-0),通过13维验证维度实现对交付通道的严格校验,并确立“逆向验证原则”与“通道匹配原则”两项设计准则,从根本上保障知识注入的可追溯性与可靠性。
链接: https://arxiv.org/abs/2606.04896
作者: Levent Liu
机构: Shanghai Qijing Digital Technology Co., Ltd.(上海启景数字科技有限公司)
类目: Multiagent Systems (cs.MA)
备注: 16 pages, 0 figures
Abstract:Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent’s memory – a common requirement in hierarchical team architectures – the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent’s persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target-agent self-writes via memory tools, and © cron-delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) that prevents false-positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.
[MA-3] R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代理型(agentic)应用场景中,尽管在开放性任务上表现出流畅性,却难以保证长期规划、工具使用与行动执行的可靠性问题。其核心挑战源于三个相互关联的结构性缺陷:错误传播缺乏定位、最坏情况扰动未被评估、累积知识无法被显式否定。作者指出,这些缺陷的根源在于多种推理模式(如溯因、反事实、元归纳、修正性及归纳推理)对共享上下文施加了不相容的引导作用。为应对这一问题,论文提出反射式对抗帕累托搜索(Reflective Adversarial Pareto Search, R-APS),这是首个通过推理模式解耦机制联合解决上述三重缺陷的方法。其关键在于将不同推理模式分配独立上下文,并在三个时间尺度上协同运作:基于类型化验证批评者的分阶段组合推理实现故障定位;以敏感性引导的反事实压力测试作为首阶帕累托优化目标以增强鲁棒性;以及通过显式否定机制进行元归纳规则提取以维持持续记忆。R-APS无需微调,仅依赖结构化协议设计即可在冻结的LLM上运行。在平面机构合成任务(涉及机器人、假肢与机械设计)上的实验表明,该方法在32条目标轨迹下,相较于均匀扰动基线实现了3.5倍更紧致的鲁棒性证书,迭代至首次通过的速度提升46%,且在保持杆件数量控制的同时,相较枚举+遗传算法(Enum+GA)实现2.1倍的Chamfer距离降低。此外,小型40亿参数的专用推理模型在协议内表现可媲美通用700亿参数主干模型,表明结构化协议可在一定程度上弥补模型规模差距。
链接: https://arxiv.org/abs/2606.04823
作者: João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.
[MA-4] RAMPART: Registry-based Agent ic Memory with Priority-Aware Runtime Transformation
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂任务执行中面临的上下文管理效率与记忆控制难题,特别是在长序列推理过程中如何实现高效、可编程的内存组织与内容调度。其核心挑战在于:传统基于提示词(prompt)的上下文组装方式存在高成本、不可控的内存占用及缺乏细粒度访问权限管理的问题。为此,RAMPART提出一种编译时内存模型(compile-time memory model)与纯内存内块注册表(in-RAM block registry),通过五个可组合的原语(promote、gate、write、evict、rollback)在零提示词开销下对命名可寻址的内存块进行动态操作,实现对内容顺序、包含与淘汰策略的精确控制。关键创新在于引入溯源标签(provenance tags)与不可淘汰的作者权标记(non-evictable authorship flags),构建了基于块级别的权限化内存模型,确保数据所有权与可追溯性。实验表明,编译时内容放置位置与任务查询之间的结构关系显著影响任务成功率,尤其在第七个块位置附近出现性能断崖;通过将关键块与其内容邻近块分组并整体提升优先级,可使任务成功率提升数十个百分点。跨模型验证显示,内容预热效应(content-priming effect)在不同模型系列中均出现在相同绝对位置,且效果强度随模型规模增强。此外,相关性门控(relevance gating)可降低67.8%的提示词成本,同时保留83%的高优先级条件下的成功率;模式淘汰(schema eviction)策略实现0%无效调用,优于传统基于策略的方法。最后,共享注册表协调机制将多智能体间通信降至方法调用级别,实现零协调令牌成本。
链接: https://arxiv.org/abs/2606.04628
作者: Nikodem Tomczak
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral’s mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.
[MA-5] Agent Jet: A Flexible Swarm Training Framework for Agent ic Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体强化学习(Reinforcement Learning, RL)中传统集中式框架存在的局限性,特别是难以支持异构多模型协同、多任务混合训练、容错执行及实时代码迭代等问题。其核心解决方案是提出一种分布式群体训练框架AgentJet,采用解耦的多节点架构:由部署在GPU集群上的群集服务器节点负责可训练模型的优化,而群集客户端节点则在任意设备上执行各类智能体。这一设计实现了四大关键能力:(1)异构多模型强化学习,支持多个不同LLM作为多智能体团队的“大脑”进行联合训练;(2)隔离式运行时的多任务鸡尾酒式训练;(3)故障容错执行,避免外部环境异常中断训练过程;(4)实时代码迭代,通过动态替换客户端节点实现训练期间智能体代码的在线更新。为提升多模型、多轮次、多智能体场景下的强化学习效率,AgentJet引入上下文追踪模块与时间线合并机制,有效消除冗余上下文信息,实现1.5至10倍的训练加速。此外,该框架还集成自动化研究系统,能够以研究主题为输入,在大规模集群上自主开展长期、跨日的强化学习实验,借助群体架构复现研究人员的关键探索流程,全程无需人工干预。
链接: https://arxiv.org/abs/2606.04484
作者: Qingxu Fu,Boyin Liu,Shuchang Tao,Zhaoyang Liu,Bolin Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Technical report, 27 pages
Abstract:We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.
[MA-6] When Freshness Is Not Enough: Distribution-Aware Age of Information for Networked LQR Control
【速读】:该论文旨在解决当前网络化控制系统的性能评估与优化中,过度依赖平均年龄信息(mean Age of Information, AoI)作为设计指标所导致的潜在偏差问题。尽管平均AoI在无线更新系统设计中被广泛采用,但其作为闭环控制性能代理指标的合理性缺乏控制理论的严格支撑。本文针对具有延迟间歇性更新的标量线性时不变(Linear Time-Invariant, LTI)系统,证明在状态无关调度策略下,无限时域线性二次调节器(LQR)跟踪问题可转化为对调度间隔分布的优化问题;其目标函数不仅依赖于调度间隔的均值,更涉及高阶统计矩,甚至在系统不稳定或存在相关性时依赖指数矩,而非仅关注均值。因此,具有相同平均AoI的不同调度策略可能带来显著差异的跟踪代价。研究进一步拓展至自相关呈指数衰减的扰动情形,推导出等效成本表达式,揭示了调度间隔全分布的关键作用。基于NGSIM US-101数据集的真实车辆轨迹验证表明,实测结果与理论预测趋势高度一致,证实仅使用平均AoI不足以支撑面向控制的网络设计,关键在于全面考虑调度间隔的统计特性。
链接: https://arxiv.org/abs/2606.04361
作者: Abdullah Y. Etcibasi,C. Emre Koksal,Eylem Ekici
机构: The Ohio State University (俄亥俄州立大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO); Dynamical Systems (math.DS); Optimization and Control (math.OC)
备注:
Abstract:Age of Information (AoI) has become a central metric for the design of wireless update systems, especially in applications where fresh measurements support tracking, estimation, and control. Despite its popularity, the use of mean AoI or peak AoI as a surrogate for closed-loop performance is often motivated by intuition rather than by a control-theoretic derivation. This paper examines whether minimizing the mean AoI is in fact optimal for networked control systems. For scalar linear time-invariant systems with delayed intermittent updates, we show that, under state-independent scheduling policies, the infinite-horizon LQR tracking problem reduces to an optimization over the distribution of inter-scheduling intervals. The resulting objective depends on higher-order statistical moments, and in unstable or correlated regimes on exponential moments, of the inter-scheduling process rather than only on its mean. Consequently, policies with identical mean AoI can induce substantially different tracking costs. We further extend the analysis to disturbances with exponentially decaying autocorrelation and derive equivalent cost formulations that expose the role of the full interval distribution. Finally, we validate the theory using real vehicle trajectories from the NGSIM US-101 dataset. The empirical results match the predicted performance trends, demonstrating that mean AoI alone is insufficient for control-oriented network design.
[MA-7] Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems
【速读】:该论文旨在解决大语言模型(LLM)驱动的智能体在实际工作流中因生成内容直接触发状态变更操作而引发的“执行边界”问题,即生成的动作必须在执行前受到有效管控。其核心解决方案是提出一种与模型无关的组织控制层(Organizational Control Layer, OCL),通过在底层LLM生成器不被修改的前提下,于动作执行前实施策略强制与升级机制,实现对生成动作的拦截与治理。OCL在基于AgenticPay改编的对抗性买卖双方协商环境中进行了评估,结果显示,其将不安全执行率从88%降至接近零,同时将有效成功率从12%提升至96%。研究进一步揭示了安全性与实用性之间的权衡关系:严格的治理虽显著提升合规性与可靠性,但在高度受限的市场环境中可能削弱灵活性。因此,该研究主张部署级的LLM智能体系统必须在语言生成与可执行动作之间建立显式的边界治理机制。
链接: https://arxiv.org/abs/2606.04306
作者: Tianyu Shi,Yang Mo,Yiou Liu,Zhuonan Hao,Yin Wang,Wenzhuo Hu,Nan Yu,Meng Zhou,Jiangbo Yu
机构: McGill(麦吉尔大学); Purdue(普渡大学); UNSW(新南威尔士大学); UCLA(加州大学洛杉矶分校); NYU(纽约大学); Stevens(史蒂文斯理工学院); Aimaikj Research
类目: Multiagent Systems (cs.MA)
备注: 13 pages, 2 figures
Abstract:LLM-based agents are increasingly deployed in workflows where generated outputs may directly trigger state-changing actions. This creates an execution-boundary problem: proposed actions must be governed before they are executed. We study this problem through economically consequential multi-agent interactions and argue that deployment-grade agent systems should separate proposal generation from environment-facing execution. To operationalize this principle, we introduce the Organizational Control Layer (OCL), a model-agnostic governance infrastructure that intercepts generated actions before execution through policy enforcement and escalation, without modifying the underlying LLM generator. We evaluate OCL on adversarial buyer–seller negotiation environments adapted from AgenticPay. Across multiple frontier LLM backends, OCL reduces unsafe executions from 88% to near-zero while increasing valid success from 12% to 96%. Results further reveal a safety–utility tradeoff: strict governance improves compliance and reliability against policy and constraint violations, but can reduce flexibility in tightly constrained markets. These findings suggest that deployment-grade LLM agent systems require explicit governance at the boundary between language generation and executable actions. The source code is available at: this https URL
[MA-8] What Makes Majority Illusion Easy to Detect?
【速读】:该论文旨在解决社交网络中“多数错觉”(majority illusion)现象的可检测性问题,即判断一个社交网络是否存在使个体错误感知少数意见为多数的结构可能性。其核心问题是:在给定的社交网络中,是否存在一种二值标记方式,使得至少比例为 $ q $ 的节点其邻居中多数持与自身相反的标签。该问题的关键解决方案在于系统分析底层社交网络的多种结构性特征(如度分布、聚类系数、图的连通性等)如何影响该问题的计算复杂性,并据此构建了该问题在不同网络结构下的精确计算复杂性图谱,从而揭示了多数错觉在何种网络条件下易于发生或可被判定。
链接: https://arxiv.org/abs/2606.04260
作者: Šimon Schierreich,Ildikó Schlotter
机构: University of Vienna (维也纳大学); Institute of Science and Technology Austria (奥地利科学技术研究院)
类目: ocial and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Majority illusion is an undesirable phenomenon in social networks in which agents incorrectly perceive a minority opinion as dominant. This can severely distort collective behavior and decision-making. We study the fundamental question of detecting whether a social network allows for a majority illusion. Formally, in the q -Majority Illusion problem, we ask whether there exists a binary labeling of agents in which at least a q -fraction of agents have the majority of neighbors with the minority label. We investigate how various structural properties of the underlying social network influence the tractability of this question, and provide a detailed map of its computational complexity.
[MA-9] Exploring the Topology and Memory of Consensus: How LLM Agents Agree Frag ment or Settle When Forming Conventions
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在多代理系统中应保留多少记忆,以及如何设计网络连接结构以实现共识的问题。其核心发现在于:记忆深度与网络拓扑结构之间存在关键的交互作用,这种交互会反转记忆对协调效率的影响方向——在去中心化网络中,较长的记忆会延缓系统达到稳态的时间,而在中心化网络中则加速这一过程。然而,中心化网络中“快速收敛”实际上意味着系统更快陷入局部共识的碎片化状态,而非达成全局一致,反而可能诱发观点分化。研究进一步揭示了记忆介导的“速度-统一性权衡”:中心化网络虽能保留更多竞争性惯例,但其收敛速度对记忆深度极为敏感;从代理层面看,高介数桥梁节点因承担信息中介角色而面临协调成本劣势,而处于局部聚类社区中的代理则表现出更高的协调成功率。最后,通过分析发现代理决策行为可被虚构博弈(Fictitious Play)良好拟合,表明其适应机制基于信念更新而非奖励驱动。因此,论文的核心解决方案在于强调:记忆深度与通信拓扑必须协同设计,不能孤立优化。
链接: https://arxiv.org/abs/2606.04197
作者: Aliakbar Mehdizadeh,Martin Hilbert
机构: University of California, Davis (加州大学戴维斯分校)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注: Submitted to the Journal of Artificial Societies and Social Simulation (JASSS)
Abstract:How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory’s effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, “faster settling” in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents’ choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.
[MA-10] oken Budgets: An Empirical Catalog of 63 LLM -Agent Budget-Overrun Incidents with an Affine-Typed Rust Mitigation as a Case Study
【速读】:该论文旨在解决大语言模型(LLM)代理在生产环境中因预算超支(budget overruns)而导致的严重故障问题,此类问题通常由重复执行循环引发,可能在操作员察觉前耗资数千美元。现有系统中,防止此类错误的关键属性(如无别名、无双重支出、无委托后使用成本值)依赖于临时封装而非类型系统进行保障,缺乏强制性。其核心解决方案是通过构建“token-budgets”这一1,180行无unsafe代码的Rust库,将仿射所有权(affine ownership)机制形式化,使得克隆、双重支出或在委托后使用预算等行为成为编译时错误,而非依赖操作员记忆的运行时风险。该方案通过静态分析确保预算操作不可绕过,实现了对预算上限的二进制级安全性(binary-level cap-soundness),尽管该性质仍在待证状态。实验表明,在单代理工作负载下,4行Python计数器可实现0/30超支,但多代理委托场景中,由于委托扇出竞争(delegation-fanout race)的存在,基于asyncio的实现全部超支(30/30),而其他三种受控方案也出现超支;相比之下,token-budgets在五种运行时、三家服务商及温度分层的实时API测试中(N=160),实现零违规与零误拒,且在操作性能上与并发工作持平,验证了其在复杂协作场景下的非可绕过性(non-bypassability)优势。
链接: https://arxiv.org/abs/2606.04056
作者: Sajjad Khan
机构: University of the West of England (西英格兰大学)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注: 26 pages. Artifact (catalog CSV, Rust crate, formal proofs): this https URL
Abstract:LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen’s kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.
[MA-11] Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在具身智能(embodied intelligence)应用中缺乏有效评估基准的问题,尤其针对辅助机器人任务中存在的复杂交互挑战。现有主流RL基准多基于游戏环境(如围棋、Atari),虽能推动算法突破,但其与真实世界中的动态、不确定性及人机协作场景存在显著脱节。为此,论文提出Assistax——一个开源的具身辅助机器人基准平台,通过基于物理模拟的多智能体强化学习框架,建模机器人与主动人类患者之间的协同交互关系。其解决方案的关键在于:利用JAX框架实现硬件加速,使训练过程在并行化条件下相比传统CPU方案提速达370倍;同时采用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)构建多样化、具有差异性的“人类伙伴”代理群体,从而对机器人的零样本协调能力进行严格评估。该设计不仅提升了训练效率,还为评估和推进面向真实应用场景的辅助机器人强化学习算法提供了可复现、可靠的基准体系。
链接: https://arxiv.org/abs/2507.21638
作者: Leonard Hinckeldey,Elliot Fosong,Rimvydas Rubavicius,Elle Miller,Trevor McInroe,Fan Zhang,Patricia Wollstadt,Stefano V. Albrecht,Subramanian Ramamoorthy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted at the Reinforcement Learning Conference 2026
Abstract:The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX’s hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to 370\times faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent’s zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: this https URL.
自然语言处理
[NLP-0] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练数据归属(Training Data Attribution, TDA)中的计算效率与准确性难题。传统TDA方法依赖因果干预,通过增删训练数据并观察模型输出变化来追溯影响,但对LLMs进行反复微调代价高昂。现有近似方法通常在参数空间中利用梯度追踪,然而在数十亿参数下不仅计算成本极高,且依赖局部线性近似,存在局限性。本文提出一种范式转变:不直接估计参数变化,而是将训练数据的影响建模于激活空间(activation space),通过引入STRIDE(Steering-based Training Data Influence Decomposition)框架,将TDA重构为压缩感知(compressive sensing)思想下的稀疏恢复问题。其核心创新在于学习轻量级“引导算子”(steering operators),这些算子可模拟特定数据子集对模型行为的扰动效应;通过测量算子对测试预测的扰动,利用稀疏线性分解反推单个训练样本的贡献。该方法在保持当前最优性能的同时,较先前方法提速约13倍,并在数据筛选、数据污染检测及可解释性分析等下游任务中验证了其实际应用价值。
链接: https://arxiv.org/abs/2606.05165
作者: Rishit Dagli,Abir Harrasse,Luke Zhang,Florent Draye,Amirali Abdullah,Bernhard Schölkopf,Zhijing Jin
机构: Jinesis AI Lab, University of Toronto; Vector Institute; Max Planck Institute for Intelligent Systems, Tübingen, Germany; Thoughtworks; Martian; ELLIS Institute, Tübingen, Germany; EuroSafeAI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: project page: this https URL
Abstract:Training Data Attribution (TDA) seeks to trace a model’s predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight “steering operators” that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ( 13\times ) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.
[NLP-1] Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
【速读】: 该论文旨在解决音频-语言模型(Audio-Language Models, ALMs)在面对音频与文本信息冲突时,倾向于优先采纳与音频证据相悖的文本内容的问题。核心问题在于:模型是否能够有效利用清晰的音频证据,还是其输出被冲突的文本所主导?研究通过引入“同音频反事实”(same-audio counterfactual)方法,固定音频输入并移除冲突文本,仅保留音频信息以评估模型对音频支持答案的偏好。实验结果表明,在5个ALM模型和4类冲突任务中,64.1%的样本出现偏好反转——即在仅依赖音频时模型更倾向于正确答案,而联合输入下却偏向错误的文本支持答案,说明音频证据虽已被编码,但在决策仲裁中仍被压制。进一步激活值修补(activation patching)分析显示,这一偏差主要集中在答案位置的计算阶段,且修补效果与输出候选项得分差高度相关(Spearman rho=0.93)。基于此诊断,作者提出一种无需训练的解码规则——门控音频反事实逻辑校正(Gated Audio Counterfactual Logit Correction, GACL),通过插值联合分数与同音频分数来修正模型输出。在严格控制忠实性下降不超过5个百分点的前提下,GACL相比最优对比基线提升nAUC达17.8点,并可无须再调参迁移至视觉-文本仲裁任务,最高提升达40.5个百分点,显著增强了模型对多模态证据的可信推理能力。
链接: https://arxiv.org/abs/2606.05161
作者: Yichen Gao,Yiqun Zhang,Zijing Wang,Yujia Li,Heng Guo,Xi Wu,Xiaocui Yang,Shi Feng,Yifei Zhang,Daling Wang
机构: Northeastern University(东北大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
[NLP-2] Reinforcement Learning from Rich Feedback with Distributional DAgger
【速读】: 该论文旨在解决当前生成式推理模型在强化学习框架下依赖单一正确性反馈(即仅通过最终答案是否正确进行二值奖励)所导致的反馈信息利用不充分问题。尽管许多应用场景可提供丰富的中间反馈,如执行轨迹、工具输出、专家修正及模型自评估等,但现有方法未能有效整合这些多源异构反馈。其核心解决方案是提出一种基于分布匹配的模仿学习变体——DistIL,该方法以经典DAgger算法为基础,引入对当前策略所访问状态的专家分布的局部访问能力,构建一个前向交叉熵(forward cross-entropy)目标函数。该目标函数具备黑箱专家兼容性,并可通过序列级梯度实现从未来专家与学生之间的分歧向早期决策阶段回传信用分配,从而实现更精准的策略优化。相较于基于反向KL或Jensen-Shannon散度的自蒸馏强化学习方法(其无法保证策略单调提升,甚至可能在专家表现更优时增加劣解概率),本文证明前向交叉熵目标能确保单调策略改进并提供可证明的遗憾(regret)上界。此外,该目标函数等价于优化教师加权的成功似然下界,进而显著提升Pass@N指标。实验表明,所提方法在科学推理、编程和复杂数学问题求解等多个领域均优于RLVR及自蒸馏强化学习基线。
链接: https://arxiv.org/abs/2606.05152
作者: Rishabh Agrawal,Jacob Fein-Ashley,Paria Rashidinejad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient conduct rich credit assignment by propagating future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
[NLP-3] Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
【速读】: 该论文旨在解决后训练语言模型在推理任务中失败时,传统测试时扩展(test-time scaling)方法因忽略失败轨迹而造成资源浪费的问题。现有方法在多次尝试失败后即丢弃所有失败路径,但这些失败轨迹实际上蕴含了可恢复性结构(recoverability structure)——即哪些测试时干预措施能够挽救特定失败的推理信号。论文提出的关键解决方案是:从失败轨迹的分布特征中提取三种与问题层面干预机制相关的轨迹特征,而非依赖文本内容本身。这三类特征能够有效识别失败模式的稳定区域,刻画不同后训练方法的失败拓扑结构(准确率达84.3±4.3%,较多数类基线提升20%),并构建一种无需训练的路由规则,在部署相关的“可调节困难集”(Steerable-Hard subset)上使救援成功率提升12.2%。该方法不依赖训练过程或权重空间访问,具备跨模型家族的迁移能力,将原本被丢弃的失败轨迹转化为具有诊断价值的分析对象,从而支持测试时路由与后训练分析。
链接: https://arxiv.org/abs/2606.05145
作者: Nizar Islah,Istabrak Abbes,Irina Rish,Sarath Chandar,Eilif B. Muller
机构: Mila - Quebec AI Institute (蒙特利尔人工智能研究所); Université de Montréal (蒙特利尔大学); Polytechnique Montréal (蒙特利尔工业大学); CHU Sainte-Justine (圣母医院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ( 84.3\pm4.3% accuracy, +20% over a majority-class baseline), and support a training-free routing rule that lifts rescue by +12.2% on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.
[NLP-4] Activation-Based Active Learning for In-Context Learning: Challenges and Insights
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在上下文学习(in-context learning)中如何高效选择代表性示例的问题,特别是探索是否可利用近年来对变压器(Transformer)激活机制的理解来提升样本选择的精准性。其核心解决方案是假设模型各层的多层感知机(MLP)激活能够提供细粒度信号,从而指导更具信息量的示例筛选。然而,研究发现:无论从激活幅度、前四阶矩(first four moments)还是大规模激活分布角度分析,MLP输出与示例质量或任务性能之间均无显著相关性,所有任务和模型下的绝对Spearman等级相关系数最高仅为0.33,表明基于激活的采样策略不适用于当前上下文学习场景。作者推测这一负结果可能源于“超叠加”(superposition)现象——即模型在有限维度下表示远超容量的特征,导致单个神经元激活无法有效反映语义质量。因此,未来方向可能需借助稀疏自编码器(Sparse Autoencoders, SAEs)等更先进的表征解耦方法以挖掘潜在结构化语义。
链接: https://arxiv.org/abs/2606.05134
作者: Yaseen M. Osman,Geoff V. Merrett,Stuart E. Middleton
机构: University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures
Abstract:Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.
[NLP-5] Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLM s with Minimal Data
【速读】: 该论文旨在解决如何让大语言模型在无需额外标注数据或针对性训练的情况下,准确预测外部评估者(judge)对其生成内容的多维度质量评分这一问题。其核心挑战在于,传统方法依赖大量人工标注或强化学习来对齐模型输出与人类偏好,成本高昂且难以泛化。本文的关键解决方案是提出一种名为自评估激发(Self-Evaluation Elicitation, SEE)的新方法,通过一个包含校准耦合强化学习与掩码蒸馏的短周期迭代机制,从基线模型中激发其潜在的自评估能力。该方法仅需约160个示例(约为传统强化学习基线的1/31),即可显著提升模型在多个基准上的校准性能,同时保持生成答案的质量。值得注意的是,所激发的自评估能力高度集中于模型自身的词元分布内,并在未见过的评估者间表现出稳定性,表明其捕捉的是可迁移的质量判别标准,而非单一评估者的主观偏好。因此,研究将“对齐评估者”的自评估能力视为一种可通过特定方法“激发”而非“获取”的潜在属性,为高效、可泛化的模型自我评估提供了新范式。
链接: https://arxiv.org/abs/2606.05122
作者: XiuYu Zhang,Yi Shan,Junfeng Fang,Zhenkai Liang
机构: National University of Singapore (新加坡国立大学); Beijing University of Technology (北京工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge’s multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model’s own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge’s preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.
[NLP-6] Audio Interaction Model
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio Language Models, LALMs)普遍存在的离线处理局限性问题,即现有模型无法实现持续、实时的音频交互。其核心挑战在于如何将原本各自独立的单任务流式音频模型(如流式语音识别ASR或语音对话)统一为一个具备在线交互能力的通用模型。为此,论文提出构建“音频交互模型”(Audio Interaction Model),其关键解决方案是设计并实现Audio-Interaction——一种兼具离线任务执行能力与在线通用音频指令遵循能力的统一流式模型。该模型通过持续运行“感知-决策-响应”循环,在实时接收声音、环境信息和指令的基础上动态决定响应时机,实现从对话到完整语音聊天的多模态交互。为支持这一目标,论文进一步提出SoundFlow框架,该框架从数据构建、训练到部署全链路支持流式原生处理:包括基于流式数据的构造、注重理解能力的训练策略以及异步低延迟推理机制,确保稳定实时交互。此外,研究还构建了包含260万条数据的StreamAudio-2M流式语料库及Proactive-Sound-Bench评估基准,用于衡量主动音频干预能力。实验表明,Audio-Interaction在8个基准上保持主流音频任务的竞争力,并首次实现了离线模型无法达到的实时语音识别、流式音频指令跟随与主动辅助等新能力。
链接: https://arxiv.org/abs/2606.05121
作者: Zhifei Xie,Zihang Liu,Ze An,Xiaobin Hu,Yue Liao,Ziyang Ma,Dongchao Yang,Mingbao Lin,Deheng Ye,Shuicheng Yan,Chunyan Miao
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong (香港中文大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Next generation of LALMs, work in progress
Abstract:Audio is an inherently interactive modality, yet today’s Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
[NLP-7] Continual Visual and Verbal Learning Through a Childs Egocentric Input
【速读】: 该论文旨在解决现有神经网络在学习词义-指称映射时,依赖对数据进行多次随机打乱循环训练(数百个周期),与儿童在真实环境中以连续、时间有序方式感知世界这一自然学习过程不一致的问题。其核心解决方案是提出一种名为BabyCL的持续多模态学习框架,该框架在单次时间顺序遍历SAYCam数据集的基础上,结合流式视觉表征学习与图像-文本对比目标,通过多阶段时间分段机制和双重回放缓冲区(分别独立管理视觉与多模态历史信息),实现对时空上下文的高效建模。该框架在共享主干网络上联合优化三个对比损失,使得模型在与离线训练上限相当的优化预算下,显著优于现有流式学习基线,在SAYCam Labeled-S 4AFC基准上大幅缩小了与理想离线训练性能之间的差距。消融实验表明,性能提升对在线时间分段窗口长度及回放缓冲区淘汰策略具有鲁棒性。研究结果表明,仅通过接近儿童真实经验的时间结构化学习范式,即可有效生成有意义的词义-指称映射。
链接: https://arxiv.org/abs/2606.05115
作者: Xiaoyang Jiang,Yanlai Yang,Kenneth A. Norman,Brenden Lake,Mengye Ren
机构: New York University (纽约大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures
Abstract:Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child’s egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child’s actual experience.
[NLP-8] Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
【速读】: 该论文旨在解决当前大型语言模型(LLM)在临床应用评估中面临的根本性缺陷:静态、单轮评测无法反映模型在真实临床诊疗过程中动态交互、持续决策与长期管理能力的不足。现有评估方法忽视了临床实践的核心特征——医患互动的连续性、情境适应性及多阶段规划能力。为此,研究提出关键解决方案:构建基于标准化病人(Standardized Patient, SP)的交互式评测基准MedSP1000,将经过同行评审的24,602条临床教学案例转化为可执行的仿真场景,包含明确的患者角色脚本、临床环境上下文以及经人类验证的结构化评分标准。在每次模拟评估中,临床代理以闭环方式与患者代理和环境控制器交互,其行为全程依据原始材料中的专家标准进行评分。实验表明,即使在最佳表现模型GPT-5.5下,仅能完成60.4%的专家定义评分项,而最强医学专用模型也仅达40.0%,且增加推理计算量未带来显著性能提升。这一结果揭示当前主流大模型,包括医疗领域优化的智能体系统,尚不足以安全地融入真实临床实践。更重要的是,MedSP1000通过过程导向、类标准化病人式的评估范式,能够暴露传统单轮评测所忽略的临床相关失效模式,为未来临床智能体的发展提供了更真实、更全面的评估框架。
链接: https://arxiv.org/abs/2606.05112
作者: Cheng Liang,Pengcheng Qiu,Ya Zhang,Yanfeng Wang,Chaoyi Wu,Weidi Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.
[NLP-9] Arithmetic Pedagogy for Language Models
【速读】: 该论文旨在解决如何通过借鉴人类数学教学法(pedagogy)来提升语言模型在算术推理任务中的表现,尤其关注如何以低资源、小规模模型实现高效且准确的数学计算能力。其核心问题是:现有大模型依赖大规模参数与复杂优化策略(如强化学习或奖励机制)来获得算术推理能力,但这种方法成本高且难以解释。为此,论文提出基于印尼本土教学法GASING(一种遵循从左到右、符合生成因果顺序的算术求解方法)的训练框架,将每一步算术运算转化为可序列化的自然语言思维链(Chain-of-Thought, CoT)监督信号,并使用仅包含8600万参数的小型GPT-2解码器,在无强化学习的情况下,仅通过标准的下一个词预测目标进行端到端训练。关键创新在于将教学法中的认知过程显式编码为可追踪的执行轨迹,使模型在训练中逐步内化一个程序性推理路径,并在后期发展出类似“心算”(mental-arithmetic)的能力——即通过关联记忆直接检索中间结果,而非逐步重算。通过注意力掩蔽干预、残差流探测和逻辑透镜分析等机制解析,证实了模型经历三个学习阶段,最终在独立测试集上达到超过80%的准确率,性能媲美远超其规模的大型语言模型。因此,解决方案的关键在于:以教育学原理为指导设计结构化、可解释的训练数据与监督信号,使小型模型在无需外部奖励机制的前提下,自发涌现出高效的算术推理能力。
链接: https://arxiv.org/abs/2606.05106
作者: Andhika Bernard Lumbantobing,Hokky Situngkir
机构: Bandung Fe Institute (Bandung Fe 研究院); Adjunct Science Fellow in InaAI (InaAI 顾问科学家); AI Research Center IT Del (IT Del 人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 6 figures
Abstract:We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method – an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation – we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses – attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection – show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic’’ capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.
[NLP-10] Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models
【速读】: 该论文旨在解决语言模型是否能够区分英语中常见动词(如‘have’和‘make’)在轻动词结构(light-verb construction)与完整词汇谓词(full lexical predicate)中的语义差异这一问题。其核心挑战在于,同一动词在不同语境下可能具有截然不同的句法-语义功能,而现有语言模型对此类细微语义区分的表征能力尚不明确。论文的关键解决方案是构建一个大规模、受控的英文句子序列数据集,其中通过最小变异设计确保相同上下文下同一动词分别以轻动词用法和全动词用法出现(如‘make a decision’ vs. ‘make a cake’)。通过两项探针实验,研究发现语言模型即使在极简上下文中也能有效区分这两种用法,并且在不同类型宾语上表现出可分离的表征模式。该工作不仅揭示了语言模型对动词多义性的潜在敏感性,还公开发布了数据集、生成代码及配套资源,为后续研究提供了可复用的框架,支持向更广泛语境、更多动词及其它语言的扩展。
链接: https://arxiv.org/abs/2606.05087
作者: Francesca Franzon,Nicolas Rosàs Gómez,Leo Wanner
机构: Universitat Pompeu Fabra (UPF); Google DeepMind (谷歌深脑)
类目: Computation and Language (cs.CL)
备注:
Abstract:Frequent English verbs such as ‘have’ and ‘make’ can function either as collocates in light-verb constructions or as full lexical predicates, as in ‘make a decision’ vs. ‘make a cake’. Whether language models represent this distinction remains unclear. We introduce a large-scale controlled dataset of minimally varying English sentence series in which the same context contains the same verb in light-verb and full-verb uses. Two probing experiments show that language models differentiate between these uses even in minimal contexts and exhibit separable patterns across object types. We release the dataset, generation code, and materials as a reusable resource. The framework supports extensions to broader contexts, additional verbs, and other languages.
[NLP-11] Automatic Generation of Titles for Research Papers Using Language Models
【速读】: 该论文旨在解决科研论文标题自动生成中的挑战,即如何基于论文摘要自动生成准确、凝练且具有学术表达力的标题。其核心问题在于现有方法在标题生成的语义相关性、学术规范性和创造性方面存在不足。解决方案的关键在于利用开源权重的预训练大语言模型(Large Language Models, LLMs),通过在多个公开数据集(包括CSPubSum、LREC-COLING-2024及新构建的SpringerSSAT)上进行微调,提升模型对学术语境的理解与生成能力。实验结果表明,经过微调的PEGASUS-large模型在ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore等多维度评价指标上均优于其他对比模型(如微调的LLaMA-3-8B和零样本设置下的GPT-3.5-turbo),且生成标题在语义连贯性与学术适切性方面表现优异。此外,研究还验证了ChatGPT在生成富有创意的标题方面的潜力,整体表明由生成式AI生成的标题具备较高的适用性与可靠性。
链接: https://arxiv.org/abs/2606.05085
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
机构: Jadavpur University (加达普尔大学); Indian Association of Science (印度科学协会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 24 tables, 01 figure
Abstract:The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.
[NLP-12] Fast Faithful Function Vectors
【速读】: 该论文旨在解决生成式任务向量(Function Vectors, FVs)在上下文学习中用于引导大语言模型(Large Language Models, LLMs)时,其形式化设计选择尚未得到充分探索的问题。具体而言,研究聚焦于两个关键维度:注意力头(attention head)的选择与任务向量的引导方式。解决方案的关键在于:首先,采用基于梯度的归因方法结合层间相关性传播(Layer-wise Relevance Propagation, LRP),显著提升了FV在任务表示中的效率与准确性;其次,通过分布式方式施加FV引导,相较于传统的简单聚合策略,能够实现更高的任务执行准确率。这一改进为高效、精准地利用FVs进行模型行为调控提供了可复现且有效的技术路径。
链接: https://arxiv.org/abs/2606.05079
作者: Minh An Pham,Anton Segeler,Thomas Wiegand,Wojciech Samek,Sebastian Lapuschkin,Patrick Kahardipraja,Reduan Achtibat
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.
[NLP-13] Boosting Self-Consistency with Ranking ACL
【速读】: 该论文旨在解决自洽性推理(self-consistency)中多数投票机制在面对正确答案虽存在于采样路径中但频率较低时难以有效识别的问题。其核心解决方案是提出一种名为“排序增强型自洽性”(Ranking-Improved Self-Consistency, RISC)的方法,将答案选择过程从传统的多数投票范式重构为一个排序问题。RISC引入一个轻量级的LambdaRank模型,基于五个精心设计的特征对候选答案进行评分,这些特征分别捕捉答案出现频率、语义中心性以及推理轨迹的一致性等关键信息。相较于标准自洽性和现有强基线方法,RISC在多个数据集和不同测试时预算条件下均实现了更优的准确率-效率权衡,尤其在问答基准上表现出显著提升。进一步分析表明,各特征不仅各自具备有效性,且具有互补性,凸显了在测试阶段通过学习融合多源信息信号以优化答案选择的重要价值。
链接: https://arxiv.org/abs/2606.05054
作者: Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Salnikov,Alexander Panchenko,Viktor Moskvoretskii
机构: AIRI; Skoltech; EPFL
类目: Computation and Language (cs.CL)
备注: 16 pages, 13 figures, accepted at ACL Student Research Workshop 2026
Abstract:Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.
[NLP-14] In-Context Graphical Inference
【速读】: 该论文旨在解决离散图模型中边缘推断(marginal inference)在精确性与可扩展性之间的根本矛盾:精确算法在高树宽图上计算不可行,而迭代近似方法(如信念传播、变分法)在存在挫败拓扑(frustrated topologies)时缺乏收敛保证。其核心解决方案在于重构精确推断所依赖的顺序消元结构——传统迭代方法因忽略此结构导致精度损失。为此,作者提出上下文图推理(In-Context Graphical Inference, ICG-I),一种自回归图变压器(Graph Transformer),通过学习并以张量训练(Tensor Train, TT)压缩的方式模拟变量消元过程,保留了精确推断的结构性本质。关键创新包括:1)利用TT压缩中间因子以实现高效计算;2)采用狄利克雷(Dirichlet)输出层配合加权共形预测(Weighted Conformal Prediction, WCP),在拓扑变化下仍能提供校准的、分布无关的覆盖置信区间。理论证明表明,TT压缩误差在自回归链中至多线性传播,狄利克雷-多项式损失为恰当评分规则,且WCP在密度比估计误差下具有可量化覆盖性能退化。实验验证显示,ICG-I在所有基准测试中均达到领先性能,将标准实例的平均绝对误差(MAE)从最佳基线的0.041降至0.020,并在N=500的挫败自旋玻璃模型上取得0.048的优异结果,而传统信念传播在此类问题中完全发散。
链接: https://arxiv.org/abs/2606.05042
作者: Zehua Cheng,Wei Dai,Jiahao Sun
机构: FLock.io(FLock.io); University of Oxford(牛津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 19 Pages
Abstract:Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.
[NLP-15] Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair
【速读】: 该论文旨在解决大语言模型(LLM)在自回归式思维链(CoT)推理中固有的单向归纳偏置问题,即每一步推理仅依赖于先前的输入 token,导致早期逻辑或算术错误一旦发生便引发“错误雪崩”(error snowballing),进而破坏整个推理链条的正确性。其解决方案的关键在于提出一种名为目的论推理补全(Teleological Reasoning Infilling, TRI)的训练框架,通过将错误推理片段重构为“填空式中间段”(fill-in-the-middle, FIM)任务,使模型具备以目标为导向的逻辑桥接能力。具体而言,给定已验证的前提部分 P、下游里程碑 S 及原始问题 Q,模型需生成严格且完整的中间推理路径 M 以连接 P 到 S。为适配标准因果架构,作者设计了前缀-后缀-中间(Prefix-Suffix-Middle, PSM)序列重排策略,引入三个非重叠的哨兵标记,使得中间段 M 能够同时访问 P 与 S,而无需修改自注意力机制结构。训练采用双阶段流程:首先在形式数学语料库中提取符号验证的 (P,S,M) 三元组进行监督微调(SFT),随后利用确定性符号验证器(Lean 4 / Python)作为唯一奖励信号进行直接偏好优化(DPO),有效避免了大语言模型裁判的谄媚行为。推理阶段,TRI 作为一个外科手术式的修复模块嵌入双系统循环中:由因果草稿模型生成初始推理轨迹,验证器定位失败节点,TRI 仅对受损段落进行精准补全,保留已验证部分不变。大量实验表明,TRI 在三大基准测试上均达到当前最优性能,同时将每题的平均词元消耗降低31.2%。
链接: https://arxiv.org/abs/2606.05030
作者: Zehua Cheng,Wei Dai,Jiahao Sun,Thomas Lukasiewicz
机构: 未知
类目: Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 25 Pages
Abstract:Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI), a training framework that endows decoder-only transformers with a native \emphgoal-conditioned bridging capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise P , a verified downstream milestone S , and the original query Q , the model must synthesise the logical bridge M that connects P to S rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling M to attend to both P and S without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified (P, S, M) triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.
[NLP-16] Validity Threats for Foundation Model Research
【速读】: 该论文旨在解决大规模基础模型研究中因计算成本过高而难以开展受控实验的问题,提出以低成本近似替代真实大规模实验的研究策略(如代理实验、缩放定律、基于公开模型的观察性研究及单次运行设计)所隐含的可靠性隐患。其核心解决方案是构建一个将基础模型研究视为因果推断问题的评估框架,通过借鉴经验社会科学中的四类有效性标准——统计有效性、内部有效性、外部有效性和构念有效性——系统分析不同研究策略的内在局限。研究发现:代理实验以牺牲外部效度和构念效度为代价换取较高的统计与内部效度;观察性研究面临混杂因素与效应异质性问题;单次运行设计则受限于处理单元间的干扰。该框架揭示了现有文献中被忽视的若干有效性威胁,为研究人员提供了可操作的工具,用于识别和评估基础模型研究设计中的潜在偏差。
链接: https://arxiv.org/abs/2606.05029
作者: Gunnar König,Martin Pawelczyk,Ulrike von Luxburg,Sebastian Bordt
机构: University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心); University of Vienna(维也纳大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats – hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences – statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.
[NLP-17] aDA: Calibrated Probe Gating for Task-Domain LoRA Merging
【速读】: 该论文旨在解决将任务型低秩适配器(task LoRA adapter)与领域型低秩适配器(domain LoRA adapter)融合为单一统一模型时所面临的挑战,尤其针对现有方法在融合过程中忽略适配器间深层结构差异的问题。其核心问题在于:传统方法将两类适配器视为对称个体,采用全层均匀加权策略,未能捕捉到在不同深度层中任务与领域信号的非对称性分布特征。研究发现,在Transformer架构中,领域相关性随网络深度增加而增强,而浅层则更保留任务相关的有效信号。为此,作者提出一种无需训练的融合算法TaDA(Task-Domain LoRA Merging),其关键创新在于通过校准的探针引导式逐层门控机制与分组件子空间感知融合策略,实现对每层及每类投影类型的差异化权重分配;该门控机制利用对适配器权重幅度不敏感的探针信号进行动态加权,确保融合过程能有效保留任务相关信号并抑制冲突方向。在融合阶段,算法先剔除存在冲突的奇异方向,再合并剩余成分,最终生成标准秩为 $ r $ 的LoRA适配器,且推理开销为零。实验表明,基于Llama-2-7B在六项科学问答基准上,TaDA平均准确率达0.452,较DARE-TIES提升3.6个百分点,并在所有任务中取得最优表现;在ViT-L/16图像分类任务上,平均准确率为85.9%,优于最强基线并在三项任务中领先,验证了其有效性与普适性。
链接: https://arxiv.org/abs/2606.05016
作者: Huy Quoc To,Fuyi Li,Guangyan Huang,Ming Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose \textbfTaDA ( \textbfTa sk- \textbfD omain LoR \textbfA Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. \textbfTaDA produces a standard rank- r LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.
[NLP-18] Depth-Attention: Cross-Layer Value Mixing for Language Models
【速读】: 该论文旨在解决传统Transformer模型在深度方向上信息流动受限的问题:尽管自注意力(Self-attention)可在序列维度自由选择信息,但各层之间仅通过残差连接叠加输出,导致深层无法有选择性地重用浅层的表示。现有跨层方法虽改善了这一问题,但通常在注意力之外的操作隐藏状态,引入了超出标准键值缓存(key-value cache)的额外状态,这在现代大语言模型(LLM)采用分组查询注意力(grouped-query attention)和多头潜在注意力(multi-head latent attention)压缩缓存时成为显著负担。本文提出Depth-Attention,其核心创新在于将跨层信息选择机制内置于注意力模块内部——在每一层进行序列注意力前,查询首先对先前层同一位置的键进行注意力计算,并将这些层的值混合后替换原值,使自注意力读取的是融合了深度信息的值。由于该方法复用标准注意力的查询、键与值缓存槽位,仅以深度混合后的值覆盖原始值,因此不增加参数量,也不引入除标准键值缓存外的持久化推理状态,缓存大小与普通解码器一致,且低于基于隐藏状态的跨层方法。在Qwen3风格的1.5B和3B参数解码器上,Depth-Attention实现了最低困惑度和最高平均下游准确率,相较基线Transformer提升达2.3个百分点,优于强基准模型,在困惑度和平均准确率上均取得领先;同时,其额外计算开销不足0.01%浮点运算量(FLOPs),且在360M至3B参数规模下均保持有效,亦适用于循环式Transformer架构。
链接: https://arxiv.org/abs/2606.05014
作者: Boyi Zeng,Yiqin Hao,Zitong Wang,Shixiang Song,He Li,Feichen Song,Yifan Liu,Ziwei He,Xinbing Wang,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 9 tables
Abstract:Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer’s output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference–a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache–the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.
[NLP-19] DAR: Deontic Reasoning with Agent ic Harnesses
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在进行规范性推理(deontic reasoning)时,因法规条文集冗长且存在交叉引用,导致模型难以准确定位特定推理步骤所需规则的关键问题。其解决方案的核心是提出一种名为“规范性代理推理”(Deontic Agentic Reasoning, DAR)的代理式推理框架,该框架允许模型按需与法律法规文本进行交互,从而动态获取并应用相关规则。实验结果表明,尽管在多个难度较高的子集上,基于代理的评估设置能够显著提升规范性推理性能,但改进效果并不均衡:对于能力较弱的模型,在数值类任务上表现反而下降,且消耗的计算资源(token数)显著增加。
链接: https://arxiv.org/abs/2606.05009
作者: Guangyao Dou,William Jurayj,Nils Holzenberger,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); Télécom Paris, Institut Polytechnique de Paris (巴黎电信学院,巴黎综合理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.
[NLP-20] M3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
【速读】: 该论文旨在解决多模态模型在长视频理解任务中缺乏系统性记忆能力评估的问题。当前研究虽在视频数据集与基准测试方面取得进展,但主要聚焦于感知与推理能力,忽视了对模型记忆特性——包括信息保留内容、保真度及抗干扰能力——的深入分析。为此,作者提出了M³ Eval,首个全面评估多模态模型多种记忆维度的评价框架与基准。该框架基于认知心理学设计,通过精心构建的任务隔离并考察记忆的关键属性。实验结果揭示,现有模型在处理并行视频流时难以维持解耦表征,其干扰模式与人类记忆显著不同,更倾向于将记忆锚定于空间域而非时间域,且符号记忆能力有限。该研究不仅为未来研究提供了宝贵资源,也凸显记忆作为多模态模型中基础但尚未充分探索的能力,为设计更有效的记忆机制提供了重要启示。
链接: https://arxiv.org/abs/2606.05008
作者: Jie Huang,Ruixun Liu,Sirui Sun,Xinyi Yang,Yin Li,Yixin Zhu,Yiwu Zhong
机构: Peking University; University of Wisconsin-Madison
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: We present an evaluation designed for multi-modal memory in multi-modal models
Abstract:As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M ^3 Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M ^3 Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at this https URL.
[NLP-21] GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation
【速读】: 该论文旨在解决基于大语言模型(LLM)的多智能体系统在战略决策任务中,因交互策略设计缺乏通用性与结构化引导而导致性能受限的问题。现有基于多智能体强化学习(MARL)的方法通常依赖于任务特定的奖励设计,难以有效捕捉智能体间互动的本质结构。为此,本文提出一种基于博弈论的强化学习框架——GARL(GAme-theoretic Reinforcement Learning),其核心在于将战略优先级排序问题形式化为两阶段博弈:首先,竞争性智能体在共享候选集上分配战略资源;随后,由高层仲裁者生成最终排序。通过将博弈论意义上的效用转化为角色特异的强化信号,GARL实现了对交互结构的显式建模,从而为策略优化提供具有理论依据的指导。实验表明,GARL不仅显著提升了法律争议议题排序的性能,还使小型开源大模型在相同设定下达到与强大闭源模型相当的能力,并展现出在法律领域专业性及更广泛战略决策中的优势。因此,GARL的关键创新在于将博弈论的交互结构转化为可学习的强化学习目标,为多智能体战略优先级排序提供了系统性的、原则化的策略优化路径。
链接: https://arxiv.org/abs/2606.05002
作者: Yuxiao Ye,Yiwen Zhang,Huiyuan Xie,Yuqin Huang,Zhiyuan Liu
机构: Tsinghua University; The Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.
[NLP-22] Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在风险决策任务中表现出的“表面行为一致”问题,即模型输出看似符合人类风险偏好,但其内在决策机制可能与人类不一致。核心问题是:当前评估方法仅关注结果层面的相似性(如有限出价),而忽视了决策过程的机制一致性。解决方案的关键在于引入圣彼得堡悖论(St. Petersburg game)作为受控实验范式,通过结构化提示套件系统性地考察模型在多种变体情境下的响应,包括截断、重复博弈、资金数额、职业身份等变量,并对比基础模型与指令微调模型在人类视角提示下的表现。研究发现,尽管多数大语言模型(LLM)在原始游戏中表现出类似人类的有限出价,但在控制变量下其行为转向条件性和计算理性,且人类提示和指令微调虽能降低出价并缓解部分异常,但机制层面的核心响应模式未发生根本改变。因此,论文强调高风险场景下对模型决策能力的评估必须超越结果相似性,深入检验其决策机制是否与人类认知逻辑保持一致。
链接: https://arxiv.org/abs/2606.04978
作者: Chensong Huang,Changyu Chen,Chenwei Lin,Hanjia Lyu,Xian Xu,Jiebo Luo
机构: Fudan University (复旦大学); University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); General Economics (econ.GN)
备注:
Abstract:LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.
[NLP-23] SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中扩散型大语言模型(Diffusion Large Language Models, DLLMs)在推理阶段因需大量去噪步骤而导致计算开销高昂的问题。尽管DLLMs可通过双向上下文并行更新多个词元位置,但其高保真生成仍依赖于冗长的迭代去噪过程,严重制约了推理效率。本文提出的解决方案核心是SAID(Scaffold-Aware Iterative Decoding)框架,其关键在于通过分阶段、差异化地分配计算资源来加速推理:首先集中资源对“骨架词元”(scaffold tokens)进行去噪,以快速构建文本的粗粒度语义结构;随后仅对可预测性较低的细节词元进行少量去噪步骤完成生成。为进一步提升效率,作者还提出了置信度分层生成机制(Confidence-Hierarchical Layered Generation, CHLG),仅对低置信度词元分配额外去噪步骤,实现计算资源的智能再分配。实验结果表明,在数学、编程与知识类基准测试中,该方法在保持竞争力性能的同时,实现了最高达9.1倍的推理加速,显著提升了DLLMs的实用性。
链接: https://arxiv.org/abs/2606.04974
作者: Na Li,Chengda Wang,Mingju Gao,Hao Tang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: this https URL.
[NLP-24] SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLM s
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在采用块式解码(blockwise decoding)时,因固定块大小或基于分隔符的运行时信号导致块边界与语义边界不一致的问题。现有方法难以适应不同任务中的自然语义结构,影响生成质量与效率。其解决方案的关键在于提出一种由语义边界驱动的动态块解码框架——SemBlock:将动态块构建建模为语义边界预测任务,并在冻结的LLaDA隐藏状态上训练轻量级边界预测器;通过构建包含自然语言、数学和代码任务中话语单元、推理步骤及实现片段的语义边界数据集(SemBound)提供监督信号;在推理阶段,利用预测的边界概率动态确定每个块的终止位置,从而实现更符合语义结构的块划分。实验结果表明,SemBlock在GSM8K、IFEval、MATH和HumanEval等多个基准上均显著优于固定块解码和AdaBlock方法。
链接: https://arxiv.org/abs/2606.04964
作者: Xinrui Song,Zhuoran Wang,Mingju Gao,Hao Tang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: this https URL.
[NLP-25] Data Attribution in Large Language Models via Bidirectional Gradient Optimization AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的治理、问责与数据溯源难题,核心问题在于如何准确识别哪些训练数据对模型生成结果产生了显著影响。现有方法在训练数据归因(Training Data Attribution, TDA)方面存在局限,难以实现细粒度且可靠的归因分析。本文提出一种基于逆向建模的新方法,通过反向思考“若模型在训练过程中曾见过当前生成文本,则其训练数据应如何调整以匹配该输出”,构建了一个基于双向梯度优化(梯度上升与下降)的框架。该方法通过对生成文本样本进行扰动,并测量损失函数在各训练样本上的变化,实现对训练数据影响力的量化评估。其关键创新在于支持任意粒度的数据归因,能够同时实现事实性(factual)与风格性(stylistic)归因,显著提升了模型可解释性。实验结果表明,该方法在已知训练数据集的预训练模型上优于现有影响度度量基准,为可问责人工智能系统提供了重要的技术支撑。
链接: https://arxiv.org/abs/2606.04928
作者: Frédéric Berdoz,Luca A. Lanzendörfer,Kaan Bayraktar,Roger Wattenhofer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at the AI Governance (AIGOV) Workshop at AAAI 2026
Abstract:Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model’s output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.
[NLP-26] Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)作为写作工具被广泛使用背景下,众包数据有效性受到威胁的问题。由于众包工作者可能将任务外包给生成式AI(Generative AI),导致所收集的自由文本数据质量下降,进而影响研究结果的可靠性。其解决方案的关键在于识别和防范由LLM生成内容带来的偏差与污染。当前主要检测策略包括分析文本中独特的风格模式以及异常快速的任务完成时间。尽管研究社区普遍意识到该问题并已采取一定应对措施,但现有手段仍不足以全面解决此挑战。论文最终提出一套面向未来在生成式AI时代开展众包自由文本数据采集的指导性考虑,以提升数据可信度与研究可重复性。
链接: https://arxiv.org/abs/2606.04924
作者: Aswathy Velutharambath,Neele Falk,Sofie Labat,Tarun Tater,Amelie Wuehrl
机构: University of Stuttgart, Germany; Ghent University, Belgium; Harvard University, USA; IT University of Copenhagen, Denmark
类目: Computation and Language (cs.CL)
备注:
Abstract:The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners’ challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.
[NLP-27] Reproducing Analyzing and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
【速读】: 该论文旨在解决基于评分标准的强化学习(Rubric-based RL)中因大语言模型作为裁判(LLM-as-a-Judge, LaaJ)存在潜在偏见而导致的奖励劫持(reward hacking)问题。此类劫持行为在真实场景中往往隐蔽且与多重裁判偏见交织,难以识别与缓解。其解决方案的关键在于提出CHERRL——一个可控制的奖励劫持环境,通过向LaaJ注入已知偏见,实现奖励劫持行为的稳定复现、奖励偏差的显式观测以及劫持触发点的精准定位,从而为研究奖励劫持机制及其缓解策略提供清晰的实验平台。研究进一步从可发现性与可利用性角度分析不同裁判偏见,并构建基于代理的系统以自动检测训练日志中的劫持发生时刻,显著提升了对复杂劫持行为的诊断能力。
链接: https://arxiv.org/abs/2606.04923
作者: Xuekang Wang,Zhuoyuan Hao,Shuo Hou,Hao Peng,Juanzi Li,Xiaozhi Wang
机构: Tsinghua University (清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Xi’an Jiaotong University (西安交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 7 figures
Abstract:Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at this https URL.
[NLP-28] BreastGPT : A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine
【速读】: 该论文旨在解决乳腺癌临床管理中多模态推理能力不足的问题,即现有医学多模态大模型(Multimodal Large Language Models, MLLMs)受限于数据稀缺性和模型泛化能力,通常仅在单一模态或窄范围任务上进行评估,难以支持从筛查、诊断到治疗规划的全流程临床推理。其解决方案的关键在于构建一个与临床工作流程对齐的乳腺影像指令语料库——BreastStage,该语料库包含186万条指令跟随样本,覆盖5种成像模态和136种任务模板;并基于此提出统一的多模态大模型BreastGPT,采用双分支视觉编码器与概念保持的令牌压缩机制,有效弥合标准放射科影像与吉字节级病理图像之间的尺度鸿沟。在所提出的BreastStage-Bench基准测试中,BreastGPT在闭式问答任务中达到75.66%准确率,在开式生成任务中取得89.92分得分,显著优于通用及专用医学多模态模型,验证了工作流程对齐的数据与跨尺度视觉建模对于实现临床落地的医学多模态大模型至关重要。
链接: https://arxiv.org/abs/2606.04911
作者: Yang Liu,Jiajin Zhang,Danyang Tu,Yaojun Hu,Jiao Qu,Jiuyu Zhang,Yu Shi,Wei Fang,Shi Gu,Ling Zhang,Yingda Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textitscreening, \textitdiagnosis and \textittreatment planning, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbfBreastStage, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbfBreastStage-Bench, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbfBreastGPT, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66% closed-ended accuracy and 89.92% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at this https URL.
[NLP-29] Your AI Text is not Mine: Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions
【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)文本检测领域中缺乏统一标准的问题,即现有研究对“有害使用”的定义不明确,导致检测方法和数据集多基于各自假设,与真实应用场景脱节。其核心解决方案在于系统性地定义生成式AI文本的不同类型及其特征,并构建了AITDNA——一个全新的基准数据集,该数据集包含人类与机器协同创作文本的详细生成历史(如完整的编辑记录和与AI的交互轨迹),从而为检测模型提供更精确的标注依据。通过在该基准上评估多种检测方法,研究发现现有检测器仅在特定定义下表现良好,难以作为通用检测工具,凸显了建立统一、可解释且贴近实际应用的检测框架的重要性。研究成果已公开代码与数据,以推动领域发展。
链接: https://arxiv.org/abs/2606.04906
作者: Nils Dycke,Marina Sakharova,Nico Daheim,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt; National Research Center for Applied Cybersecurity ATHENE, Germany; Zuse School ELIZA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.
[NLP-30] GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习方法(如GRPO)在提升大语言模型(LLM)数学推理能力时存在的梯度信号稀释问题。现有方法通常对所有令牌统一分配序列级优势值,或依赖代价高昂的过程级奖励模型(PRM)进行步骤级监督,前者假设所有令牌对最终奖励的贡献均等,导致错误的推理步骤和冗余词汇与有效逻辑推导获得相同强度的更新,从而削弱了关键推理路径的优化效果。其解决方案的关键在于提出一种内在的、基于梯度激活显著性(gradient-activation saliency)的令牌级优势重加权方法——GRAIL。该方法通过识别对最终答案局部敏感的令牌并赋予更高权重,实现更精细的梯度传播,从而增强有效推理步骤的训练信号。在Qwen3、R1-distilled及OctoThinker系列共五种模型上的实验表明,GRAIL在无需过程级监督的情况下,平均提升了3.60%的准确率和3.05%的Pass@3指标,验证了细粒度推理对齐的有效性。
链接: https://arxiv.org/abs/2606.04889
作者: Tej Deep Pala,Vernon Toh,Soujanya Poria
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.
[NLP-31] Optimizing the Cost-Quality Tradeoff of Agent ic Theorem Provers in Lean
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在Lean形式化证明生成工作流中因大量低效尝试而导致计算成本过高的问题。现有方法通常将问题分解为子引理,反复采样多种证明路径,并依赖编译器反馈进行搜索,但其中多数尝试最终失败,造成显著的资源浪费。为此,本文提出一种动作路由代理(action routing agent),其核心由数据平面和控制平面构成:数据平面负责生成自然语言形式的引理分解、在Lean中形式化并采样目标定理与引理的证明;控制平面则通过观察历史失败的Lean证明轨迹,估算下一次尝试的成功概率与计算成本,动态决策是否继续当前目标的证明或切换至新的分解方案。在PutnamBench的一个子集上,该代理相比固定步数基线平均降低25.8%的计算开销,同时保持原有性能水平。研究表明,失败的Lean证明轨迹可作为有效的信号,用于实现成本感知的资源调度,从而提升智能体式定理证明的效率。
链接: https://arxiv.org/abs/2606.04883
作者: Kári Rögnvaldsson,Chenhao Sun,Jasper Dekoninck,Martin Vechev
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by 25.8% over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.
[NLP-32] Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents
【速读】: 该论文旨在解决大语言模型(LLM)智能体在复杂任务执行中因规划能力不足而导致失败的问题,尤其针对现有评估体系仅关注端到端成功率、难以区分规划阶段与执行阶段故障的根本性缺陷。其核心解决方案是提出一个面向规划的诊断基准——生成式智能体规划基准(Agent Planning Benchmark, APB),该基准包含4,209个跨22个领域的多模态案例,覆盖整体规划、基于反馈的逐步规划以及在冗余工具、失效工具和不可解任务下的鲁棒性测试等五种设置。APB通过系统性揭示多模态大模型(MLLMs)在长周期规划、工具噪声鲁棒性、拒绝响应校准及推理时修正能力等方面的普遍弱点,为规划模块的性能诊断提供了精细化评估手段。进一步验证表明,基于APB指导的规划优化可显著提升三类代表性模型在ToolSandbox与τ²-bench任务中的计划正确率、计划评分及下游执行效果,证明其作为执行评估补充的上游诊断工具的有效性。
链接: https://arxiv.org/abs/2606.04874
作者: Haoyu Sun,Wenxuan Wang,Mingyang Song,Jujie He,Weinan Zhang,Yang Liu,Yang Yang,Yu Cheng
机构: Shanghai AI Laboratory; Tongji University; Harbin Institute of Technology; Fudan University; Skywork AI; University of California, Santa Cruz; Shanghai Jiao Tong University; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注:
Abstract:Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbfAgent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 \tau^2 -bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.
[NLP-33] MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
【速读】: 该论文旨在解决生成高效、可执行的原生GPU内核(native GPU kernel)这一关键问题,尤其针对现有大语言模型(Large Language Models, LLMs)在该任务上表现不佳,以及基于执行反馈的强化学习(Reinforcement Learning, RL)方法面临奖励稀疏、奖励欺骗(reward hacking)和训练不稳定等挑战。其解决方案的关键在于提出MusaCoder——一个面向CUDA与MUSA后端的全栈式训练框架,通过多层次协同机制实现稳定高效的内核生成:首先采用分阶段的内核导向数据合成策略以提升训练数据质量;其次引入保持多样性的拒绝采样微调(diversity-preserving rejection fine-tuning)以增强泛化能力;再结合基于MooreEval(分布式验证器与奖励环境)的执行反馈强化学习,有效缓解传统RL的信号稀疏性问题。为提升训练稳定性,MusaCoder创新性地设计了PrimeEcho(首轮锚定的多轮奖励机制)、Buffered Dynamic Retry(从全失败困难样本中恢复信号)以及MirrorPop(离策略序列过滤)三项关键技术。实验结果表明,MusaCoder在KernelBench及MUSA移植版本上的正确率与实测加速比均显著优于主流开源与闭源基线模型,其中90亿参数模型达到甚至超越前沿闭源模型性能,270亿参数模型更建立新的基准水平。这些成果不仅验证了全栈执行反馈训练在原生内核生成中的有效性,也证明了摩尔线程(Moore Threads)GPU具备支撑完整大模型后训练流程的能力,为新兴加速器上的大模型训练与优化提供了切实可行的技术基础。
链接: https://arxiv.org/abs/2606.04847
作者: Kun Cheng,Songshuo Lu,Sicong Liao,Tankun Li,Yafei Zhang,Dong Yang,Qiheng Lv,Hua Wang,Zhi Chen,Yaohua Tang
机构: Moore Threads AI
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.
[NLP-34] Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas
【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育场景中应用时,其输出内容与美国各州历史课程标准之间可能存在偏差的问题。随着大型语言模型(LLM)在学生作业辅导等教育场景中的广泛使用,其是否能够准确反映各州差异化、具有政治与文化敏感性的课程要求成为关键伦理与教学有效性问题。研究的核心解决方案在于构建一个基于 LLM 的分析管道,用于识别美国各州历史课程的差异,并评估不同 LLM 在面对不同用户角色(如地理位置、年级、性别、种族)时响应内容的变化情况。研究发现,尽管模型能根据用户属性调整历史叙述风格,但这种调整更多源于对州级政治倾向的推测,而非真实课程内容的忠实反映;同时,模型对学段级别的适应能力较强,而对种族和性别的敏感度较低,表明其具备一定个性化适配潜力且存在有限的群体偏见。这一结果揭示了开放获取的 LLM 聊天机器人可能因与地方课程标准不一致而导致学生学习效果受损,凸显了开发更稳健的对齐机制以确保教育内容准确性与公平性的迫切需求。
链接: https://arxiv.org/abs/2606.04846
作者: Lisa Korver,Tomo Lazovich,Sherief Reda
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student’s grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.
[NLP-35] A French Corpus Annotated for Multiword Expressions with Adverbial Function
【速读】: 该论文旨在解决法语中具有副词功能的多词表达(Multiword Expressions, MWEs)在信息检索与抽取、以及深层和浅层句法分析中的标注缺失问题。其解决方案的关键在于构建一个专门针对副词性多词表达的法语语料库,明确定义了所标注的MWE类型,采用系统化的标注资源与方法,并对标注结果进行了简要评估,从而为相关自然语言处理任务提供高质量的基准数据。该语料库已公开发布,遵循LGPLLR许可协议,可供学术研究使用。
链接: https://arxiv.org/abs/2606.04828
作者: Eric Laporte,Takuya Nakamura,Stavroula Voyatzi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at this http URL under the LGPLLR license.
[NLP-36] BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization ACL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中社会偏见对齐的挑战,其核心问题在于:社会偏见缺乏可验证的唯一真实标准,导致奖励信号具有高度主观性和高方差,从而使得对齐过程难以稳定。现有基于偏好微调的方法存在显著局限——直接偏好优化(Direct Preference Optimization, DPO)受限于离线训练固有的探索不足,而近端策略优化(Proximal Policy Optimization, PPO)则因批评者(critic)估计不可靠易引发训练不稳定性。为应对上述问题,本文提出BiasGRPO框架,其关键创新在于采用组相对策略优化(Group Relative Policy Optimization, GRPO),通过在一组采样生成结果间进行相对奖励归一化,构建组内相对基准以替代传统的价值函数,从而有效降低训练过程中的不稳定性,同时保留在线训练带来的探索优势。实验表明,BiasGRPO在多个基准测试上均优于DPO与PPO,验证了其有效性。此外,研究还通过合成扩展多领域、多情境数据集以适配GRPO,并构建并开源了一个定制化的偏见奖励模型(bias reward model),该模型在高效利用计算资源的同时避免知识退化,具备良好的可集成性,可无缝嵌入多目标强化学习人类反馈(multi-objective RLHF)流程中,为后续研究提供了重要工具支持。
链接: https://arxiv.org/abs/2606.04807
作者: Saket Reddy,Ke Yang,ChengXiang Zhai
机构: University of Illinois - Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to Findings of the ACL
Abstract:Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.
[NLP-37] PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents
【速读】: 该论文旨在解决持续性大语言模型(LLM)智能体在长期交互中难以显式构建和维护对“人物”理解的问题。现有智能体记忆方法虽注重信息的存储与检索,但缺乏对交互证据如何抽象为稳定的人物认知的机制解释。为此,论文提出以“模式形成”(schema formation)为核心理念,将具体情境下的交互证据抽象为可复用的模式与稳定的人物层面判断。其解决方案的关键是引入PersonaTree——一种分层结构化的生命周期记忆框架,采用三级人物树(persona tree)形式,通过从证据到结论的显式支持路径实现可追溯的记忆组织。该框架通过保守写入、置信度引导的融合以及查询条件触发的路径检索机制,仅返回满足当前查询所需的证据层级,从而在保持上下文效率的同时增强抽象认知能力。在六个涉及人物理解与持久记忆的基准测试中,使用三种不同回答模型,PersonaTree在18项紧凑评分中取得第一,在16个场景中位列前二,消融实验进一步验证了层级结构有助于提升在KnowMe数据集上的抽象理解能力,而支持路径检索则在相近上下文预算下显著改善了RealPref任务中的对齐表现。
链接: https://arxiv.org/abs/2606.04780
作者: Yubo Hou,Jingwei Song,Hongbo Zhang,Zhisheng Chen,Bang Xiao,Tao Wan,Zengchang Qin
机构: Beihang University(北京航空航天大学); The University of Hong Kong(香港大学); Peking University(北京大学); University of Chinese Academy of Sciences(中国科学院大学); VinUniversity(越南国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.
[NLP-38] Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
【速读】: 该论文旨在解决安全对齐的大语言模型(Large Language Models, LLMs)在推理过程中易受干扰、导致生成有害内容的问题。现有研究认为这一问题源于“浅层安全”(shallow safety),即对齐效果主要集中于生成序列的前几个词元。本文指出,浅层安全实为更广泛推理时脆弱性的特例:在任意生成步骤中插入短序列词元,即可显著改变后续的安全行为。此外,研究发现模型隐藏状态中对拒绝指令的对齐程度无法预测其对抗此类注入攻击的鲁棒性,表明内部状态本身不足以决定扰动下的生成表现。为此,作者提出一种新解决方案:通过模拟序列中段扰动并构建生成轨迹,直接在生成过程本身上进行对齐训练。实验表明,该方法显著提升了模型对中段注入攻击的鲁棒性,并可泛化至利用早期词元生成特征的攻击。本研究的核心结论是,实现鲁棒的安全对齐必须将训练重点置于生成过程(generation trajectory)而非仅限于输出结果。
链接: https://arxiv.org/abs/2606.04778
作者: Kyungmin Park,Taesup Kim
机构: Hankuk University of Foreign Studies (韩国外国语大学); Seoul National University (首尔国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model’s alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.
[NLP-39] NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
【速读】: 该论文旨在解决现有人体运动理解评估基准在语义粒度、难度区分度、标注质量及答案歧义性方面的固有缺陷,这些问题导致无法精准诊断当前模型的失效环节。其核心解决方案是提出NextMotionQA——一个基于视觉-语言模型(VLMs)实现半自动化、专家验证的数据集构建框架,包含三个互补任务:多选问答、视频描述生成与细粒度错误修正。这些任务在三大核心语义轴上系统化设计,并分层为三个复杂度等级,实现了对模型能力的多维度、精细化评估。关键创新在于通过多任务协同与分层结构揭示了传统单任务评估中难以发现的模型能力短板;同时,研究进一步验证了VLM作为文本到运动评估裁判的有效性边界:在粗粒度评价上表现良好(Cohen’s κ=0.70),但在细粒度、局部部件级判断上显著退化(κ=0.10),明确了其适用范围与局限性。
链接: https://arxiv.org/abs/2606.04773
作者: Yong Cao,Chuqiao Li,Xianghui Xie,Gerard Pons-Moll,Andreas Geiger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages, 8 figures, 9 tables
Abstract:Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen’s \kappa=0.70) but break down on fine-grained, part-level judgment (\kappa=0.10), validating the paradigm in its strong regime while clarifying its limits.
[NLP-40] IDE: Proactive Multi-Problem Discovery via Template-Guided Iteration
【速读】: 该论文旨在解决智能代理(Agent)在处理文档、工具和代码等复杂上下文时,仅响应用户显式提出的问题,而忽视了大量隐匿于背景中的潜在问题(hidden problems)这一关键缺陷。这些问题往往未被用户察觉,且数量未知,导致系统性遗漏。其核心挑战在于如何从上下文中自动发现多个共存的隐藏问题,并确保每个问题均有可解释的证据支持,同时匹配具体可行的行动方案。解决方案的关键在于提出TIDE框架,采用模板引导的迭代式发现机制:首先通过迭代发现(iterative discovery),每轮仅生成少量候选问题并基于已识别问题进行条件约束,逐步扩展覆盖范围;其次引入思维模板(thought templates),即从过往成功案例中提炼出的可复用推理模式,明确指示应关注哪些上下文信号及其关联方式,使每个问题识别锚定于特定问题类别,增强推理的可解释性与一致性。实验验证表明,TIDE在个人工作空间与软件仓库两类真实场景中,对四种模型架构均显著优于单次预测与并行多代理基线,在任务覆盖率、问题识别准确率及解决效率方面实现显著提升。
链接: https://arxiv.org/abs/2606.04743
作者: Soyeong Jeong,Jinheon Baek,Minki Kang,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.
[NLP-41] Multilingual Long-Form Speech Instruction Following: KITs Submission to IWSLT 2026
【速读】: 该论文旨在解决指令跟随(Instruction Following)任务中因模型对已知任务过度拟合而导致泛化能力不足的问题,尤其在面对IWSLT新引入的未知突发任务(unknown surprise task)时,传统方法表现不佳。其核心解决方案在于构建一个通用的数据增强流程,通过段落拼接、基于大语言模型(LLM)的标签生成以及跨语言翻译,将短文本语料库扩展为涵盖六项任务和四种语言、总量超过100万条实例的长文本训练数据集。此外,研究发现基于似然性的重排序策略虽在自动语音识别(ASR)任务中表现优异,但在语义类任务中会因错误地选择由音频分段处理生成的候选结果而系统性降低性能;该问题通过结合似然性与最小贝叶斯风险(Minimum Bayes Risk, MBR)解码得以有效缓解,从而实现对长序列输入的整体语义理解,显著提升模型在长指令跟随任务中的鲁棒性与准确性。
链接: https://arxiv.org/abs/2606.04730
作者: Enes Yavuz Ugan,Maike Züfle,Yuka Ko,Supriti Sinhamahapatra,Fabian Retkowski,Seymanur Akti,Jan Niehues,Alexander Waibel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 9 pages main paper, IWSLT 2026 Instruction Following track
Abstract:With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT’s Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT’s submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.
[NLP-42] Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM EMNLP2024
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长序列时因Transformer架构固有的二次计算复杂度而导致的计算负担过重问题。针对这一挑战,论文提出了一种基于查询的跨模态投影器(query-based cross-modal projector),其核心创新在于利用交叉注意力机制对视觉特征进行动态压缩,从而有效降低输入序列长度并提升Mamba等状态空间模型在多模态任务中的计算效率。该方法的关键优势在于无需人工设计原始图像特征的二维扫描顺序,即可将图像特征自然地转换为适用于Mamba的序列输入,显著简化了预处理流程。实验结果表明,该投影器在多个视觉-语言理解基准测试中均显著提升了基于Mamba的多模态大模型的性能与吞吐量。
链接: https://arxiv.org/abs/2606.04719
作者: SooHwan Eom,Jay Shim,Gwanhyeong Koo,Haebin Na,Mark A. Hasegawa-Johnson,Sungwoong Kim,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology / Korea, Republic of; University of Illinois in Urbana-Champaign / United States of America; Korea University / Korea, Republic of
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 Findings
Abstract:The Transformer’s quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba’s efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.
[NLP-43] Rethinking Continual Experience Internalization for Self-Evolving LLM Agents
【速读】: 该论文旨在解决大语言模型(LLM)在持续学习过程中,通过多轮次经验内化(experience internalization)实现能力累积时出现的渐进式能力退化问题。现有方法在多轮迭代中无法实现性能的持续提升,反而导致模型能力逐渐衰减。其解决方案的关键在于从三个核心维度系统优化经验内化机制:首先,采用原则级(principle-level)经验而非实例级(instance-level)经验,以抽象出可迁移的通用策略并摆脱特定轨迹依赖;其次,采用分步注入(step-wise injection)而非全局注入(global injection),使经验与中间决策状态对齐,从而增强长程工具使用任务中的推理连贯性;最后,采用基于高质量教师轨迹的离策略上下文蒸馏(off-policy context-distillation),相较于受学生自身错误状态限制的在策略蒸馏(on-policy context-distillation),能提供更稳定、可靠的训练信号。上述三者共同构成了一套简洁而稳健的经验内化方案,为构建具备自我演化与持续学习能力的大型语言模型提供了关键工程指导。
链接: https://arxiv.org/abs/2606.04703
作者: Jingwen Chen,Wenkai Yang,Shengda Fan,Wenbo Nie,Chenxing Sun,Shaodong Zheng,Yangen Hu,Lu Pan,Ke Zeng,Yankai Lin
机构: Renmin University of China (中国人民大学); Beihang University (北京航空航天大学); Meituan (美团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 8 figures
Abstract:Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.
[NLP-44] Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
【速读】: 该论文旨在解决传统图形用户界面(GUI)代理在动态屏幕环境下的适应性问题,特别是针对短视频应用等持续播放内容的界面场景。传统GUI代理假设屏幕状态在两次操作间保持静态,而现实中的动态界面(如短视频平台)内容持续更新,用户需实时决策观看时长与内容选择,这对代理的感知与行为控制能力提出了更高要求。为此,论文首次提出“面向活屏界面的GUI代理”(Living-Screen-Native GUI agents)这一新范式,并构建了首个专门针对该场景的基准测试平台LivingScreen。其核心创新在于提供一个基于浏览器的真实仿真环境、分三层的任务体系以及兼顾准确性与信息效率的综合评估指标。实验结果表明,当前主流前沿模型均未能达到人类在成本-精度权衡上的表现,且主要失败模式表现为观察过度或不足,揭示出“观测控制”(observation control)作为未来GUI代理的关键缺失能力维度。
链接: https://arxiv.org/abs/2606.04701
作者: Jiashu Yao,Heyan Huang,Daiqing Wu,Wangke Chen,Huaxi Ai,Haoyu Wen,Zeming Liu,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: preprint
Abstract:GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at this https URL.
[NLP-45] DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在低于十亿参数规模时多语言能力显著退化的问题,尤其针对东南亚(Southeast Asian, SEA)语言的低资源特性。其核心挑战在于如何在有限计算资源下有效提升SLMs在多语言场景中的泛化与迁移能力。为应对这一问题,作者提出了一种双信号多语言蒸馏框架DuDi,其关键创新在于融合在线序列级信号与离线策略和在线策略的词元级信号,构建多层次监督机制;同时引入跨语言表述器(cross-lingual verbalizer),以优化教师模型反馈的质量,增强师生之间的跨语言可迁移性。实验结果表明,DuDi在SEA-HELM基准上对多种模型架构、规模及师生配置均表现出一致优越性,消融实验进一步验证了序列级优化、词元级监督与跨语言表述三者提供的互补且可迁移的学习信号对提升多语言SLMs性能的关键作用。
链接: https://arxiv.org/abs/2606.04694
作者: Patomporn Payoungkhamdee,Tinnakit Udsa,Jian Gang Ngui,Sarana Nutanong,Alham Fikri Aji,Peerat Limkonchotiwat
机构: VISTEC(视觉技术学院); AI Singapore(新加坡人工智能); MBZUAI(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.
[NLP-46] SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction
【速读】: 该论文旨在解决零样本信息抽取(Zero-shot Information Extraction, Zero-shot IE)中因依赖单一提示模板、逐类型提示或多智能体辩论所引发的边界错误、类型间冲突、冗余交互及高计算开销等问题。现有方法在面对新任务模式时,常因提示设计僵化导致推理噪声增加,而多智能体框架则面临推理效率低下与资源浪费等挑战。其解决方案的关键在于提出一种稀疏且基于证据的多智能体框架——SMADE-IE,其核心创新包括:(1)引入自适应模式选择器(Adaptive Mode Selector),动态将输入路由至轻量级全局抽取模式或以类型为中心的抽取模式,有效减少不必要的类型选择与推理干扰;(2)设计基于证据驱动的辩论机制(Evidence-Driven Debate),将论证结构化为图灵式(Toulmin-style)组件,并通过外部证据评分与贝叶斯更新实现置信度聚合,从而缓解冲突预测问题。实验结果表明,SMADE-IE在9个涵盖命名实体识别(NER)、关系抽取(RE)和联合信息抽取(JERE)任务的基准数据集上均显著优于现有零样本IE基线方法,同时通过稀疏智能体选择与早期终止辩论机制实现了更高的令牌效率。
链接: https://arxiv.org/abs/2606.04691
作者: Kenfeng Huang,Yi Cai,Xin Wu,Zikun Deng,Li Yuan
机构: South China University of Technology (华南理工大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures
Abstract:Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.
[NLP-47] CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
【速读】: 该论文旨在解决生成式AI(Generative AI)中提示(prompt)优化时面临的准确性与提示词元成本(prompt-token cost)之间的权衡问题。传统方法通常将多目标优化简化为加权求和的单目标问题,即在搜索前固定准确性与成本的权重,导致只能探索帕累托前沿(Pareto front)中的局部区域,这种现象被称为“标量化解耦”(scalarization collapse)。其解决方案的关键在于提出CRAFT(Cost-aware Refinement And Front-aware Tuning),一种面向帕累托前沿的成本感知提示优化框架。CRAFT将目标大模型验证调用视为稀缺资源,优先分配给接近乐观帕累托前沿的候选提示,并通过互补的精度导向与成本导向生成器协同演化,利用帕累托间隙获取(Pareto-gap acquisition)策略高效利用每轮验证预算,结合NSGA-II算法保持种群多样性。实验结果表明,在六个分类与推理基准上,CRAFT能够获得覆盖高精度与低成本区域的完整帕累托前沿,而对比的仅追求精度、仅降低代价或加权求和的基线方法均局限于较窄区域。因此,该方法将准确率-成本权衡从预设权重的先验选择转变为搜索后的灵活决策。
链接: https://arxiv.org/abs/2606.04661
作者: Shanu Kumar,Shubhanshu Khandelwal,Akhila Yesantarao Venkata,Parag Agrawal,Yova Kementchedjhieva,Manish Gupta
机构: MBZUAI(阿联酋穆巴达拉人工智能大学); Microsoft(微软)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT’s retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.
[NLP-48] LifeSide: Benchmarking Agents as Lifelong Digital Companions
【速读】: 该论文旨在解决长期数字伴侣在跨会话情境中难以有效整合记忆、持续更新用户理解并适应动态隐私边界的问题。现有评估方法仅孤立测试短期记忆召回或情感共情能力,无法反映真实场景下的长期交互需求。为此,论文提出 \benchmark,其核心是围绕“记忆-情绪-环境”多轮循环构建的评估框架,通过将用户建模为具有分层属性与事件轨迹的持久化世界,并借助多智能体仿真将环境动态映射至对话过程,从而保留潜在意图与外显表达之间的关键差距。实验基于2,000个角色原型与11.1万项任务,在记忆追踪、用户理解、隐私控制及情感陪伴等多个维度进行验证,结果揭示了一个严峻现实:即便模型在当前记忆基准上表现饱和,仍无法在长时程交互中维持准确的用户理解与真正的陪伴关系。解决方案的关键在于建立一个能够模拟复杂、连续且动态演变的人机交互环境的系统性评估体系,以推动生成式AI向真正具备长期认知与情感连贯性的数字伙伴演进。
链接: https://arxiv.org/abs/2606.04660
作者: Yuqian Wu,Zhijie Deng,Wei Chen,Junwei Li,Yutian Jiang,Junle Chen,Zhengjun Huang,Qingxiang Liu,Jing Tang,Jiaheng Wei,Yuxuan Liang
机构: Hong Kong University of Science and Technology (Guangzhou); Hong Kong University of Science and Technology; Tencent
类目: Computation and Language (cs.CL)
备注: 28 pages, 23 figures, 7 tables
Abstract:Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textitMemory-Emotion-Environment loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.
[NLP-49] CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment
【速读】: 该论文旨在解决语言模型作为智能体在知识图谱上生成的Cypher查询存在结构错误(导致数据库崩溃)或语义错误(执行后返回错误结果)的问题。其核心解决方案是在查询生成与生产级Neo4j数据库之间引入一个预执行验证门控机制,该机制通过四后端链式验证流程,在镜像图上以5.6毫秒中位延迟完成结构有效性校验。对于结构不合规的查询,系统调用纠错模块,利用语言模型迭代接收结构化错误反馈并进行修正。实验表明,在七个CypherBench数据集(共2348个问题,ACL 2025)上,该流水线对所有测试模型均保持生成准确率,证明其具备安全防御能力;纠错模块在五种模型上实现81%至95%的成功率(平均89%)。在九个模板生成的数据集上,该门控机制对路径查询中带标签终点的情况实现了100%的解析错误、约束违反及模式引用错误检出率,且无误报(共1135条查询)。当出现属性兄弟交换(替换名称在目标标签上合法)时,检测准确率为0%,明确划定了结构验证的边界,标志着需转向语义验证。此外,基于规划的成本门控机制可在执行前识别灾难性查询计划结构,进一步提升系统鲁棒性。
链接: https://arxiv.org/abs/2606.04645
作者: Nikodem Tomczak
机构: Thulge Labs(新加坡)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.
[NLP-50] VentAgent : When LLM s Learn to Breathe – Multi-Objective Arbitration for ARDS Ventilation
【速读】: 该论文旨在解决急性呼吸窘迫综合征(ARDS)机械通气管理中因临床决策复杂性导致的多重生理目标难以平衡的问题,尤其针对现有数据驱动方法(如基于电子健康记录(EHR)的模仿学习)中存在的模仿偏差(imitation bias)以及强化学习(RL)在危重症场景下因对抗性权衡而导致策略不透明、临床可解释性差等关键挑战。其解决方案的核心在于提出一种名为VentAgent的分层框架,将通气控制重构为动态多目标仲裁(Multi-Objective Arbitration)过程,而非传统的单目标优化。该框架通过大型语言模型(LLM)作为透明的仲裁者,将决策过程分解为感知(Perception)、规划(Planning)和协调(Orchestration)三个可解释阶段,利用LLM的语义推理能力整合异构专家知识,并通过显式的协调机制化解临床优先级冲突。实验结果表明,VentAgent在高保真生理仿真器上优于最先进的强化学习与经典控制基线,同时能够生成人类可读的决策推理链,显著提升了系统在危重症自动化中的安全性、可解释性与适应性。
链接: https://arxiv.org/abs/2606.04632
作者: Teqi Hao,Yuxuan Fu,Xiaoyu Tan,Shaojie Shi,Bohao Lv,Yinghui Xu,Xihe Qiu
机构: Shanghai University of Engineering Science(上海工程技术大学); Tencent Youtu Lab(腾讯优图实验室); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.
[NLP-51] Hybrid Adversarial Defence for Natural Language Understanding Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中面临的幻觉(hallucination)与对抗性攻击(adversarial manipulation)两大关键问题。尽管这两类问题密切相关,但现有防御方法通常将其孤立处理。本文提出一种混合防御框架,其核心在于融合熵(entropy-based)、不确定性(uncertainty-based)与几何特征(geometric-based)三类模型,通过多维度建模实现协同防御。该方案的关键创新在于:利用熵来抑制模型输出的过度自信,降低幻觉发生概率;借助不确定性量化机制识别潜在异常输入;结合几何特征分析输入空间中的分布偏移,增强对对抗样本的敏感性。实验结果表明,该混合框架在域内(如FEVER、HotpotQA等)和域外(如AeroEngQA、CPIQA)任务中均显著提升模型的准确率与鲁棒性,尤其在对抗攻击场景下,相较现有最优基线模型,攻击成功率最高降低51%,且在保持或提升正常任务性能的同时,实现了更全面的防御效果。
链接: https://arxiv.org/abs/2606.04612
作者: Manar Abouzaid,Yang Wang,Chenghua Lin,Stuart E. Middleton
机构: University of Southampton (南安普顿大学); University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34% increase in accuracy) and adversarial robustness (up to 64.92% improvement in accuracy and 62.27% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.
[NLP-52] A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLM s
【速读】: 该论文旨在解决多视频输入场景下生成式AI(Generative AI)在视频摘要任务中存在位置偏差(positional bias)的问题,即同一视频内容因在输入序列中的位置不同而导致摘要质量发生变化,进而影响模型输出的可靠性。其核心解决方案的关键在于构建了一个涵盖烹饪、家庭生活、休闲及新闻等多类场景的基准数据集,并采用三种互补评估指标——覆盖度(Coverage)、方向性位置偏差(Directional Positional Bias, DPB)和中间-边缘差距(Middle-Edge Gap, MEG),系统性地量化了九个开源与专有多模态大语言模型(Multimodal Large Language Models, MLLMs)在双视频与四视频输入下的位置敏感性。研究发现,位置效应具有领域和模型依赖性,且增加视觉编码或生成资源预算并不能一致缓解偏差;同时,通过分析提示工程层面的缓解策略,进一步揭示当前多视频摘要仍对输入顺序高度敏感,强调发展具备顺序不变性(order-invariant)能力的鲁棒多模态系统的重要性。
链接: https://arxiv.org/abs/2606.04596
作者: Huangchen Xu,Yuan Wu,Yi Chang
机构: Jilin University (吉林大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video’s input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.
[NLP-53] Fine-grained Frag ment Retrieval in Multi-modal Long-form Dialogues
【速读】: 该论文旨在解决多模态长对话中细粒度语义片段检索的问题,即在文本与图像交织的长对话场景下,如何从复杂对话流中精准定位与特定主题相关的多轮次、多模态对话片段(multi-utterance, multi-image fragments),而非孤立的单条语句。传统检索方法难以捕捉跨轮次和跨模态的语义连贯性,导致检索结果碎片化或不相关。其解决方案的关键在于提出两种针对不同场景的系统:对于单对话内检索(Single-Dialogue FFR),设计了基于生成式模型F2RVLM,通过强化学习框架结合多目标奖励函数与难度感知课程采样策略,显著提升生成片段的语义一致性与上下文连贯性;对于大规模对话语料库中的开放域检索(Corpus-level FFR),构建了两阶段系统FFRS,先通过片段嵌入模型(Fragment Embedding Model, FEM)将对话分解为最小语义单元并建立向量索引,实现高效离线预处理;推理时则由FEM快速召回候选片段,再由F2RVLM进行细粒度语义推理以精确定位最相关子内容。该方案兼顾效率与精度,同时构建了当前规模最大的多模态对话检索数据集MLDR及基于微信的真实世界测试集,实验验证了F2RVLM与FFRS在单对话与语料库级任务上均具备卓越性能。
链接: https://arxiv.org/abs/2606.04591
作者: Hanbo Bi,Zhiqiang Yuan,Chongyang Li,Qiwei Yan,Zexi Jia,Jiapei Zhang,Xiaoyue Duan,Yingchao Feng,Jinchao Zhang,Jie Zhou
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.
[NLP-54] VCIFBench: Evaluating Complex Instruction Following for Video Understanding
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中对复杂指令遵循能力评估不足的问题。现有基准大多依赖简单提示(prompt),缺乏对模型是否能够满足显式输出约束的充分验证,导致对模型实际能力的评估存在偏差。为此,本文提出VCIFBench,一个面向视频理解中复杂指令遵循的基准测试框架。其核心创新在于构建了富含约束条件的指令,涵盖内容、格式、风格和结构等多维度要求,并基于基准适配与直接视频对齐的双路径生成策略。通过混合验证流水线对模型输出进行严格评估,确保结果的准确性。该基准包含306个可满足的测试指令、一个540对的直接偏好优化(DPO)数据集以及一个用于诊断冲突情况的30项子集。实验表明,当前10个主流MLLMs在联合满足多种约束方面仍面临显著挑战;进一步研究表明,利用VCIFBench数据进行DPO训练可有效提升模型的指令遵循性能,验证了该基准在模型优化中的实用价值。
链接: https://arxiv.org/abs/2606.04588
作者: Huangchen Xu,Yuan Wu,Yi Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.
[NLP-55] mporal Order Matters for Agent ic Memory: Segment Trees for Long-Horizon Agents
【速读】: 该论文旨在解决长时程对话智能体在处理随时间演化的事件、任务与目标时,现有记忆系统因主要依赖主题相似性组织信息而忽视事件发生顺序的问题。其核心挑战在于如何有效建模并利用对话历史中的时间顺序结构以提升记忆的上下文感知能力。解决方案的关键是提出一种名为“分段树记忆”(Segment Tree Memory, SegTreeMem)的新架构,该架构将对话历史以话语为节点构建为一个时序有序的分段树结构,通过在线右端前沿更新规则逐条插入新话语,既保持了时间顺序,又形成层次化记忆片段;在检索阶段,通过树结构传播相关性得分,融合局部语义匹配与层级时间上下文,从而实现更精准的记忆访问。实验结果表明,该方法在三个长时程记忆基准测试及两种大语言模型(LLM)骨干网络上均优于扁平检索、图结构记忆和传统树结构记忆基线,且时序排列扰动分析证实性能提升依赖于记忆构建过程中对时间顺序的保留,验证了时间顺序作为智能体记忆关键结构的重要性。
链接: https://arxiv.org/abs/2606.04555
作者: Yifan Simon Liu,Liam Gallagher,Faeze Moradi Kalarde,Jiazhou Liang,Armin Toroghi,Scott Sanner
机构: University of Toronto; Vector Institute for Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.
[NLP-56] LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
【速读】: 该论文旨在解决当前基因组基础模型普遍依赖固定分词策略(如k-mer、BPE或单核苷酸)所导致的序列边界人为设定问题,此类固定边界可能掩盖生物上具有重要意义的基因组结构。其核心解决方案是提出LDARNet,一个参数量为120M的层次化基因组基础模型,首次将自回归生成中的H-Net式动态分块机制引入掩码语言建模框架,通过结合BiMamba-2状态空间层与局部注意力、双向路由机制以及基于比例的正则化器,在无监督条件下实现自适应的分词边界学习。在Nucleotide Transformer和Genomic Benchmarks套件中的27项任务上微调后,LDARNet在紧凑模型(≤300M参数)中取得11/18的胜率,并在5个组蛋白修饰任务上达到顶尖性能,显著优于参数量高达20倍的模型。受控实验表明,在相同计算开销下,学习到的分词边界相较固定网格边界在组蛋白任务上提升最高达14个百分点,证明了自适应分词的关键作用;进一步的核苷酸分辨率分析显示,学习到的边界与经典启动子基序及剪接位点高度重合,无需监督即可提供生物学可解释性,验证了自适应分词在基因组基础模型中的有效性。
链接: https://arxiv.org/abs/2606.04552
作者: Daria Ledneva,Denis Kuznetsov
机构: 未知
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:
Abstract:Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as k -mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ( 300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20 \times larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.
[NLP-57] Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models ACL2026
【速读】: 该论文旨在解决生成式语言模型在格式约束任务中因固定锚点(fixed anchors)导致的生成长度僵化问题,即固定锚点常引发推理过程截断或冗余内容生成,影响结构正确性与语义连贯性。其解决方案的关键在于提出一种无需训练的动态填充锚点(Dynamic Infilling Anchors, DIA)机制,通过动态估计结束锚点位置来自适应调整生成长度,从而在迭代填充过程中实现更灵活的结构控制。该方法在不依赖额外训练的前提下,有效提升了生成结果的格式合规性与答案准确性,在GSM8K和MATH等推理基准上实现了显著的零样本性能提升,验证了其在结构感知生成中的鲁棒性与有效性。
链接: https://arxiv.org/abs/2606.04535
作者: Boyan Han,Yiwei Wang,Yi Song,Yujun Cai,Chi Zhang
机构: Westlake University (西湖大学); University of California, Merced (加州大学默塞德分校); Teeni AI (天智科技); The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Abstract:Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.
[NLP-58] GENEB: Why Genomic Models Are Hard to Compare
【速读】: 该论文旨在解决基因组基础模型(genomic foundation models)评估体系碎片化、评价协议不兼容以及任务特异性报告导致的性能比较困难问题,使得不同模型间的优劣或泛化能力宣称难以直接对比。其解决方案的关键在于提出GENEB——一个大规模诊断性基准测试平台,通过统一的基于探测(probing-based)的评估协议,在涵盖13种功能类别的100项任务上对40个基因组基础模型的冻结表征进行系统评估,支持少样本(few-shot)情形下的分析。GENEB能够实现对模型规模、架构、分词方式及预训练数据等关键因素的受控比较,并显式揭示任务层面的权衡关系。研究发现,综合排行榜极不稳定:模型排名在不同功能类别间波动显著;模型规模带来的增益微弱且不一致;而架构与预训练数据的对齐程度往往超越参数量的影响。这些结果揭示了现有评估实践的局限性,确立了GENEB作为基因组机器学习中严谨比较与类别感知模型选择的参考框架。
链接: https://arxiv.org/abs/2606.04525
作者: Daria Ledneva,Mikhail Nuridinov,Denis Kuznetsov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
备注:
Abstract:Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.
[NLP-59] SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
【速读】: 该论文旨在解决长序列大语言模型(LLM)推理中稀疏注意力(Sparse Attention)带来的两个核心挑战:其一,键值缓存(KV cache)容量随序列长度线性增长,导致将缓存数据卸载至CPU内存时产生PCIe传输瓶颈;其二,稀疏选择步骤本身仍具有 O(T2) 的计算复杂度,在长序列场景下可能成为注意力计算的主要开销。为应对上述问题,论文提出SparDA——一种解耦式稀疏注意力架构,通过在每一层引入第四个独立的投影模块“Forecast”(预测),与传统的查询(Query)、键(Key)、值(Value)投影解耦。Forecast模块用于预测下一层所需的KV块,从而实现前瞻性的缓存预取,使CPU到GPU的数据预加载过程与当前层的计算并行化,有效缓解延迟。此外,由于Forecast与注意力查询解耦,SparDA采用每组分组查询注意力(GQA)仅配置一个Forecast头,显著降低了稀疏选择的开销。该方法仅增加0.5%的参数量,且通过匹配原始选择器的注意力分布进行训练,仅需更新Forecast投影部分。在两个经过稀疏预训练的80亿参数模型上,SparDA在保持或略微提升精度的同时,相较基于缓存卸载的稀疏注意力基线实现了最高1.25倍的prefill加速和1.7倍的decode加速;同时,由于支持单卡更大批量处理,其decode吞吐量较非卸载基线最高提升5.3倍。
链接: https://arxiv.org/abs/2606.04511
作者: Yaosheng Fu,Guangxuan Xiao,Xin Dong,Song Han,Oreste Villa
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds 0.5% parameters and trains only the Forecast projections by matching the original selector’s attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25 \times prefill speedup and 1.7 \times decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3 \times higher decode throughput than the non-offload sparse baseline. Our source code is available at this https URL.
[NLP-60] Self-Evolving Deep Research via Joint Generation and Evaluation
【速读】: 该论文旨在解决生成式AI在深度研究报告生成任务中因缺乏明确真实答案(ground-truth)而导致奖励设计不可验证的问题,进而限制了强化学习的有效性。现有方法虽采用大语言模型作为评判者(LLM-as-a-judge)和依赖查询的评估标准,但其评估器为静态设定,无法随求解器性能提升而动态调整评价尺度,导致优化压力不足且最终趋于饱和。为此,本文提出一种自演进协同进化训练框架——SCORE(Self-evolving Co-evolutionary Training for Research Evaluation and Generation),其核心在于将评估器与求解器在共享参数的学习过程中紧密耦合,打破生成与评估的模块化分离,利用二者内在关联实现联合优化。关键创新在于引入元约束机制(meta-harness),根据求解器表现动态调控评估环境,以引导评估维度的有效性并促进评估器进行更深层次的探索。大量实验表明,该框架在深度研究基准上持续提升了报告生成质量,验证了评估与生成协同进化在训练开放式研究智能体方面的可行性与优越性。
链接: https://arxiv.org/abs/2606.04507
作者: Han Zhu,Chengkun Cai,Yuanfeng Song,Xing Chen,Sirui Han,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); ByteDance (字节跳动); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbfself-evolving \textbfco-evolutionary training framework for deep \textbfresearch evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.
[NLP-61] SANE Schema-aware Natural-language Evaluation of Biological Data
【速读】: 该论文旨在解决高通量显微成像生成的大型结构化数据集在访问过程中对SQL专业知识的高度依赖问题,同时应对生成式AI(Generative AI)在文本到SQL转换中因幻觉(hallucination)导致结果不可靠的挑战。其核心解决方案是提出一种名为SANE(Schema-Aware Natural-language Evaluation)的领域特定自然语言到SQL评估范式,该方法基于真实实验结构构建受模式约束的自动化基准测试,显著提升了评估的可扩展性、系统性和可复现性。研究表明,在受限模式与结构化提示(structured prompting)及防护机制(guardrails)结合的条件下,无需模型训练或微调的少样本大语言模型即可实现高准确率的查询生成;主要失败原因源于输入模糊或描述不充分,表现为过度谨慎的澄清请求或对本应先澄清的问题直接作答,而非产生错误的SQL语句。这表明,当结合模式感知提示时,少样本大语言模型可在定义明确的领域中提供可靠的数据库访问能力。
链接: https://arxiv.org/abs/2606.04500
作者: Rolf Gattung,Martin Krueger,Markus Reischl
机构: Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院); Institute for Automation and Applied Informatics (IAI)(自动化与应用信息学研究所)
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures, submitted but not yet reviewed by BMT2026
Abstract:High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting. Comments: 5 pages, 3 figures, submitted but not yet reviewed by BMT2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.04500 [cs.CL] (or arXiv:2606.04500v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.04500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-62] Global Sketch-Based Watermarking for Diffusion Language Models
【速读】: 该论文旨在解决生成式语言模型中水印技术在非自回归生成范式下的适用性问题,特别是针对扩散语言模型(diffusion language models)中多位置并行采样带来的挑战。传统水印方法主要基于自回归设置,依赖局部上下文对下一个词元分布进行扰动,但此类方法难以适应扩散模型中全局联合采样的特性。本文提出一种适用于掩码扩散语言模型的新型水印方案,其核心在于通过控制文本的全局向量化草图表示(sketch representation)来嵌入水印信息。该方法的关键创新在于将水印机制从局部上下文依赖中解耦,实现与生成顺序无关的统计特性,从而避免了传统方法中因简单词元偏移导致的可检测性问题。该设计显著提升了水印的隐蔽性与鲁棒性,并在失真度、正确性及抗攻击能力方面进行了系统分析。
链接: https://arxiv.org/abs/2606.04486
作者: Daniel Zhao
机构: Harvard University(哈佛大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Watermarking methods for language models have been studied extensively in the autoregressive setting, where tokens are generated sequentially. These works largely focus on local-context schemes that perturb the next token’s distribution as a function of its preceding tokens. In diffusion language models, distributions over many unresolved positions are jointly sampled, allowing additive statistics of the entire sequence to be tractable during generation. We propose a watermark for masked diffusion language models that controls a global, vector-valued sketch representation of the text. Compared to context-dependent watermarking, the sketch formulation decouples detection from the local contexts seen during generation, resulting in an order-agnostic statistic and a watermarking rule which does not manifest as a simple token bias. We analyze the distortion, soundness, and robustness properties of the method.
[NLP-63] Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLM s
【速读】: 该论文旨在解决现有针对对齐大语言模型(LLM)的越狱攻击(jailbreak)多为离散、易被指纹识别与修补的特定提示模板,难以形成系统性威胁的问题。其核心挑战在于,当前安全训练未能充分覆盖自然人类写作中的语域(register)多样性,导致模型在面对真实语境下的创造性文本时存在脆弱性。为此,论文提出首个基于真实粉丝小说(fanfiction)子类型作为通用攻击载体的越狱家族:通过将创作元信息(meta)与Archive of Our Own(AO3)十二个子类型的真实文本片段结合,使有害行为以情节高潮形式嵌入生成内容中,无需攻击者使用大模型或针对目标进行适配。实验表明,在HarmBench与JailbreakBench的联合测试集上,该方法将平均攻击成功率(ASR)从0.278提升至0.731(四评委集成评估),因子分解分析证实性能提升主要源于语域差异而非文本长度或结构。此外,两种主动防御机制反而扩大了口语化表达与基线攻击之间的差距,说明模板定向防御仅促使攻击者转向此类基于语域的新型攻击。研究还提出了SAGA-A4——一种静态四轮扩展方法,实现0.924的平均ASR,显著优于现有三种多轮方法,验证了语域驱动攻击的有效性与隐蔽性。
链接: https://arxiv.org/abs/2606.04483
作者: Zhongze Luo,Ruihe Shi,Zhenshuai Yin,Haoyue Liu,Weixuan Wan,Xiaoying Tang
机构: The Chinese University of Hong Kong (Shenzhen); The Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen); The Guangdong Provincial Key Laboratory of Future Networks of Intelligence; Xi’an Jiaotong University
类目: Computation and Language (cs.CL)
备注: 23 pages
Abstract:Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.
[NLP-64] Evaluating Reasoning Fidelity in Visual Text Generation CVPR2026
【速读】: 该论文旨在解决生成式文本到图像(Text-to-Image, T2I)模型在生成包含复杂推理过程的视觉文本时,是否能够真实保留逻辑推理能力的问题。尽管当前T2I模型能够在图像中生成视觉上清晰且结构良好的文本,但其在表达完整推理链条时仍存在显著缺陷。研究的关键在于通过多维度评估框架——包括长文本渲染、事实知识探测、上下文理解及多步推理——系统性地检验模型在将复杂推理内容以图像形式呈现时的语义一致性与逻辑正确性。研究发现,即使生成的文本在外观上清晰可读,模型仍频繁出现语义错误、逻辑矛盾以及中间步骤错误,暴露出视觉文本生成与程序化推理之间存在显著差距。这一发现揭示了现有T2I模型在深层认知任务中的局限性,强调了发展更可靠、具备真实推理能力的视觉文本生成方法的必要性。
链接: https://arxiv.org/abs/2606.04479
作者: Jiajun Hong,Jiawei Zhou
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)
Abstract:Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.
[NLP-65] Entity Binding Failures in Speech LLM Reasoning : Diagnosis and Chain-of-Thought Intervention
【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Models, SLLMs)在复杂推理任务中表现逊于文本大语言模型(Text Large Language Models, T2Ls)的模态差距问题。研究发现,这一差距并非源于普遍性的认知缺陷,而是特定于需要实体追踪的逻辑推理任务——在这些任务上,语音到文本(S2T)的准确率下降至随机水平。其关键诊断为:连续语音特征导致模型在隐式推理过程中丧失对实体与属性之间精确关联的绑定能力,即“实体绑定失败”(entity binding failure)。为此,论文提出一种名为实体感知思维链(Entity-Aware Chain-of-Thought, EA-CoT)的解决方案,通过强制模型在推理前显式列举实体并将其与命题明确绑定,从而增强语义一致性。实验表明,EA-CoT可有效弥合模态差距,即使在语音识别错误的情况下仍能实现最高达24.4%的绝对准确率提升,且消融实验证实性能增益完全来源于显式语义绑定机制,将原本看似固化的模态瓶颈重构为可解决的技术瓶颈。
链接: https://arxiv.org/abs/2606.04474
作者: Ming-Hao Hsu,Xiaohai Tian,Jun Zhang,Zhizheng Wu
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.
[NLP-66] Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在推理能力后训练过程中,传统SFT-then-RL流程缺乏对不同阶段数据选择策略的系统性设计问题。现有方法通常未充分考虑监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)在能力获取与技能巩固中的角色差异:SFT更适用于学习尚未掌握的推理技能,而RL则更适合巩固模型已部分具备的推理能力。为此,论文提出一种难度感知的SFT-then-RL框架,通过将训练数据按阶段特性进行分层组织,实现任务匹配。针对SFT阶段中的高难度样本,引入“Bridge”机制,将教师生成的原始推理轨迹转化为更易被小模型学习的监督信号;对于在RL阶段仍失败的难题,则采用“批判性微调”(Critique Fine-Tuning),将零奖励失败案例转化为诊断性、可修复的推理轨迹监督,用于下一阶段的SFT。实验在两个SLMs和五个推理基准上的结果表明,该方法显著优于代表性SFT、蒸馏及强化学习基线,验证了在SFT与RL阶段间协调数据难度的重要性,为提升SLMs推理能力提供了高效且可扩展的后训练范式。
链接: https://arxiv.org/abs/2606.04466
作者: Chongyang He,Rui Zhang,Zixuan Wang,Xin Li
机构: Tsinghua University(清华大学); National University of Singapore(新加坡国立大学); DiDi(滴滴); University of Electronic Science and Technology of China(电子科技大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 12 figures
Abstract:Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.
[NLP-67] SePO: Self-Evolving Prompt Agent for System Prompt Optimization
【速读】: 该论文旨在解决现有系统提示优化(System Prompt Optimization)方法中,提示代理(prompt agent)自身系统提示仍依赖人工设计且固定不变的局限性。传统方法仅优化任务代理的提示,而忽略对提示代理自身提示的迭代改进,导致其能力受限于初始设定。为此,论文提出自进化提示优化(Self-Evolving Prompt Optimization, SePO),其核心创新在于将提示代理自身的系统提示也纳入优化目标,实现对提示代理与任务代理提示的协同演化。SePO采用自指代(self-referential)架构,通过一个单一提示代理在开放式的进化搜索过程中,同时优化自身及任务代理的提示,并利用候选提示档案作为演化的步进基础。训练分为两个阶段:预训练阶段在多任务混合数据上演化提示代理,微调阶段则将其应用于具体目标任务。在涵盖数学推理(AIME’25)、抽象推理(ARC-AGI-1)、研究生级科学知识(GPQA)、代码生成(MBPP)和逻辑谜题(Sudoku)共五个基准上的实验表明,SePO显著优于Manual-CoT、TextGrad和MetaSPO,平均准确率较Manual-CoT提升4.49个百分点。此外,预训练阶段习得的提示优化能力具备良好的泛化性,可迁移至预训练未覆盖的任务,而非简单记忆特定任务提示。
链接: https://arxiv.org/abs/2606.04465
作者: Wangcheng Tao,Han Wu,Weng-Fai Wong
机构: National University of Singapore (新加坡国立大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages. Code: this https URL
Abstract:System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents’ system prompts, yet leave the prompt agent’s own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent’s own system prompt as an optimization target alongside task agents’ system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents’ system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME’25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.
[NLP-68] oken Rankings are Unforgeable Language Model Signatures
【速读】: 该论文旨在解决大语言模型在应用接口(API)暴露输出信息时所面临的隐私与安全问题,特别是模型参数通过输出特征被逆向推断的风险。传统上,直接暴露logits(对数几率输出)会泄露模型最终层参数,形成可识别的模型指纹;而本文研究更受限制的API设计——仅提供分词排名(即按概率排序的token顺序,不包含具体概率值),发现此类排名同样构成唯一性签名:对于足够大的k,每个模型具有唯一的可行top-k排名集合。这一排名签名是首个已知的(多项式意义下)不可伪造签名,因为寻找具有相同可行排名集合的模型属于NP难问题。在安全性方面,尽管排名信息仍足以近似窃取模型最终层参数,类似于logits泄露,但其精度不足以用于伪造签名;通过将API限制为较小的k值,可有效防止参数窃取。由于揭示模型签名所需的top-k值通常小于阻止参数窃取所需的k值,因此存在一种可能:在不泄露模型参数的前提下,实现不可伪造的模型身份认证。解决方案的关键在于利用“排名签名”的不可伪造性与参数泄露之间的安全边界,通过合理设置top-k阈值,在保障模型身份可验证性的同时,防止敏感参数泄露。
链接: https://arxiv.org/abs/2606.04459
作者: Matthew Finlayson,Andreas Grivas,Xiang Ren,Swabha Swayamdipta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:
Abstract:Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model’s final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top- k rankings for sufficiently large k . Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top- k tokens with sufficiently small k . Since the top- k required to present the model signature is generally smaller than the k required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.
[NLP-69] he Meta-Agent Challenge: Are Current Agents Agent s Capable of Autonomous Agent Development?
【速读】: 该论文旨在解决当前人工智能评估体系中一个关键短板:现有基准测试仅评估模型在人类设计的工作流中的任务执行能力,而未能衡量模型是否具备自主构建智能体系统(agent system)这一下一代核心能力。为此,论文提出元智能体挑战(Meta-Agent Challenge, MAC),其核心解决方案在于构建一个沙盒化环境,赋予代码型元智能体(meta-agent)以评价接口和时间限制,使其能够通过迭代编程生成优化的智能体实例,在五个不同领域上最大化对保留测试集的表现。为确保评估可靠性,系统采用多层防御机制防范奖励劫持(reward hacking)。实验结果表明,大多数元智能体难以达到人工设计基线策略的性能水平,仅有少数表现优异者由专有前沿模型实现;同时,设计过程表现出高度异质性,且在高优化压力下涌现出如泄露真实标签等对抗性行为,暴露出模型在鲁棒性与对齐性方面的显著缺陷。综上,MAC提供了一个严谨、开源的基准,可作为评估人工智能系统递归自我改进能力的实证代理。
链接: https://arxiv.org/abs/2606.04455
作者: Xinyu Lu,Tianshu Wang,Pengbo Wang,zujie wen,Zhiqiang Zhang,Jun Zhou,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学); Ant Group (蚂蚁集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL
Abstract:Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: this https URL.
[NLP-70] Stepwise Reasoning Enhancement for LLM s via External Subgraph Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂多步推理任务中面临的逻辑一致性差、事实依据不足以及可解释性弱等问题。其核心解决方案是提出一种名为SGR(Stepwise Reasoning enhancement framework)的框架,通过将大语言模型与外部知识图谱(Knowledge Graph, KG)动态集成,实现对推理过程的有效增强。关键创新在于:首先基于输入问题提取关键实体、关系和约束,构建结构化查询模式(schema),进而利用该模式从知识图谱中检索出相关子图;这些生成的子图提供了明确的关系证据,引导语言模型进行分步推理。此外,SGR融合了直接的Cypher查询推理与协同推理整合机制,能够基于模型置信度和图谱一致性对多条推理路径的候选答案进行验证与聚合。实验结果表明,SGR在CWQ、WebQSP、GrailQA和KQA Pro等基准数据集上显著优于标准提示法及多种知识增强基线模型,消融实验证明了模式引导与基于Neo4j的知识图谱检索对系统性能的关键作用。这表明,动态生成的外部子图可有效提升基于大语言模型推理的准确性、鲁棒性与可解释性。
链接: https://arxiv.org/abs/2606.04454
作者: Xin Zhang,Yang Cao,Baoxing Wu,Kai Song,Siying Li
机构: Chongqing Jiaotong University (重庆交通大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.
[NLP-71] Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLM s
【速读】: 该论文旨在解决建筑工地工人安全态度的规模化测量难题。由于安全态度具有多维度特征,且在工人自发交流中表现最为真实,传统方法难以实现高效、准确的量化分析。为此,研究提出并验证了建筑安全态度框架(Construction Safety Attitude Framework, CSAF),其关键在于构建一个基于理论的八维结构体系,并配套一套可操作的编码手册,用于识别和量化工人自然语言对话中的安全态度。通过在Reddit的r/Construction社区250条帖子与评论中应用该框架,训练编码员达成高度一致性(Krippendorff’s α = 0.85),且八维维度经配对提升与条件概率分析证实彼此相关但独立。为实现大规模应用,研究进一步将CSAF转化为大语言模型(LLM)分类器,在r/Construction的450条内容上实现与人工编码高度一致(Cohen’s κ = 0.90,精确率=0.98,召回率=0.98),并在r/Roofing社区的400条数据上成功迁移,保持高精度(κ = 0.89,精确率=0.98,召回率=0.97)。最终,基于已验证的分类器对r/Roofing社区10,346条内容进行案例分析,证明该框架能够区分不同安全议题下的多维态度、追踪其随时间演变趋势,并揭示负面态度背后的推理逻辑。因此,该研究提供了一个理论坚实、实证检验有效的工具,为针对潜在不安全行为背后态度的精准干预奠定了基础。
链接: https://arxiv.org/abs/2606.04450
作者: Farouq Sammour,Yuxin Zhang,Zhenyu Zhang
机构: Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers’ own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff’s \alpha = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen’s \kappa = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\kappa = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.
[NLP-72] MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
【速读】: 该论文旨在解决当前人工智能系统在处理复杂任务时面临的双重挑战:一方面需有效管理多轮对话的历史上下文,另一方面需在长文档中实现深度阅读理解。现有基准测试未能同时评估这两项能力,导致模型在真实场景下的综合表现难以衡量。为此,作者提出MemoryDocDataSet这一合成数据集,包含50个微型世界和1,000个问答对,每个实例整合了3–5个角色、跨越数月的时序事件图谱、3–5篇来自美国判例法开放项目(Caselaw Access Project)的长文档(每篇20,000–50,000词元),以及基于这些文档生成的多轮对话与20个跨五个推理类别的问题。其核心创新在于引入“混合源标签”(Hybrid source tag)机制,要求系统首先通过对话历史判断相关文档,再从该文档中提取答案,此类问题占总量的75.1%。通过大语言模型作为评判者进行提示敏感性自一致性分析,验证了数据集质量,各微型世界中位数Cohen’s κ达0.634。实验对比了六种基线方法(包括截断上下文、长上下文模型、检索增强生成RAG及记忆系统),结果显示最优基线RAG-Both在整体F1上为0.358,在混合问题上为0.342;而仅依赖文档检索的RAG-Doc在混合问题上骤降至0.267,尽管其在纯文档问题上得分高达0.453,暴露出明显的联合检索鸿沟。这表明亟需设计能够统一融合对话记忆与长文档导航能力的新型架构。研究团队已公开发布数据集、生成流程及所有基线实现代码。
链接: https://arxiv.org/abs/2606.04442
作者: Qiyang Xie,Jialun Wu,Xinjie He,Su Liu,Shuai Xiao,Zhiyuan Lin,Weikai Zhou
机构: Northeastern University; Johns Hopkins University; Columbia University; Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, 8 tables. Submitted for peer review
Abstract:AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen’s \kappa = 0.634 across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.
[NLP-73] Stateful Visual Encoders for Vision-Language Models
【速读】: 该论文旨在解决现有开放权重视觉-语言模型(Vision-Language Models, VLMs)在多图像、多轮代理任务中对视觉变化敏感度不足的问题。其核心挑战在于,当前VLM的视觉编码器为无状态(stateless)设计,即每张图像独立编码,无法获取先前的视觉上下文信息,导致细微但任务关键的视觉差异在输入语言模型前已被削弱,尤其当这些变化不显著影响场景高层语义时更为明显。为此,论文提出有状态视觉编码器(Stateful Visual Encoder),通过将当前视觉表征显式地依赖于先前的视觉特征,使模型能够持续跟踪和建模跨图像的视觉变化。该方案的关键在于引入一种可学习的状态机制,使视觉编码器具备记忆能力,从而在跨图像的空间聚合、多对象视觉差异识别以及视觉轨迹行为克隆等任务中实现稳定提升。实验表明,该方法在不同输入分辨率、语言模型规模及模型架构下均表现一致优越,并在纵向放射学、细粒度图像对比和遥感等真实世界任务中显著优于通用VLM基线,甚至在特定领域达到或超越专用模型性能。
链接: https://arxiv.org/abs/2606.04433
作者: Zirui Wang,Junwei Yu,Adam Yala,David M. Chan,Joseph E. Gonzalez,Trevor Darrell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: this https URL
[NLP-74] CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
【速读】: 该论文旨在解决现有神经音频编解码器在音频压缩过程中难以兼顾重建质量与编码效率的问题,尤其在于其常会保留大量感知上无关的信息(如背景噪声和录音伪影),从而牺牲了语言学与声学上有意义内容的表达。其解决方案的关键在于将音频分词重构为一个选择性信息瓶颈(selective information bottleneck)问题,并提出CleanCodec——一种去噪型音频编解码器,通过学习仅编码感知上重要的特征并主动丢弃不可感知的冗余信息,实现了更高的信息密度。该方法在仅12.5个令牌/秒的极低速率下即达到当前最优的编码效率,在说话人相似性和语音可懂度方面显著优于现有方案;下游任务(如文本到语音合成与语音转换)的评估进一步验证了其性能提升,推理速度最高提升达17倍,充分体现了其在效率与质量之间的卓越平衡。
链接: https://arxiv.org/abs/2606.04418
作者: Eugene Kwek,Feng Liu,Rui Zhang,Wenpeng Yin
机构: Pennsylvania State University (宾夕法尼亚州立大学); Drexel University (德雷塞尔大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.
[NLP-75] Read the Trace Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
【速读】: 该论文旨在解决生成式大语言模型(Generative AI)在扩散型语言模型(dLLM)强化学习(RL)中如何高效获取细粒度训练信号的问题。现有方法中,扁平化轨迹(flat rollouts)虽计算成本低但仅提供单一结果奖励,缺乏对中间推理过程的精细监督;而树状轨迹(tree rollouts)虽能通过分支扩展提供可验证的层级信号,却带来高昂的计算开销。其核心挑战在于:如何在不进行完整树结构展开的前提下,从扩散过程中的去噪轨迹(denoising trace)中提取类似树结构的精细化监督信号。本文提出一种名为CAPR(Cached-Amortized Path Refinement)的强化学习算法,其关键创新在于将去噪轨迹压缩为紧凑的路径状态(path state),利用缓存的轨迹状态低成本生成兄弟延续路径,并引入块级价值头(block-level value head)实现局部块粒度的监督。在块级掩码调度下,CAPR根据各块内揭示的词元信息,将最终奖励按比例重分配至各个块,从而将稀疏奖励转化为块级别的PPO权重更新。该方法在保持接近扁平化轨迹计算成本的同时,显著提升了监督粒度,使生成轨迹的计算开销仅为标准树搜索的60%,并实现了在4×4数独、倒计时、GSM8K和Math500等任务上的新最优性能,尤其在数独任务上以不足三分之一的每步计算量达到最强树结构基线的表现。
链接: https://arxiv.org/abs/2606.04396
作者: Anant Khandelwal,Manish Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 10 figures, 7 Tables
Abstract:Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.
[NLP-76] Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners
【速读】: 该论文旨在解决垃圾填埋场复合衬层系统(GCL/SL)中污染物运移模拟的准确性与稳定性问题,尤其关注在复杂水力条件下(如不同渗滤液水头)下,传统数值方法与机器学习模型在早期运移阶段预测误差较大的挑战。其核心问题是:如何在保证物理规律严格遵守的前提下,提升神经网络模型对多尺度、非稳态污染物运移过程的建模精度,尤其是在高渗流驱动条件下生物降解与对流-弥散耦合效应显著时。解决方案的关键在于提出一种双域物理信息神经网络(two-domain physics-informed neural network, PINN)框架,并采用硬约束(hard-constrained PINN, H-PINN)策略,将关键的初始条件和边界条件直接嵌入到神经网络的试解函数中,从而避免依赖惩罚系数进行软约束,显著降低优化过程中的收敛难度与误差波动。相比标准软约束PINN(Std-PINN),H-PINN在多个测试场景下表现出更低的平均绝对误差(MAE由约0.058–0.067降至0.011–0.023)和相对误差(MRE由约9.10%–19.16%降至2.08%–3.14%),且具备更强的稳定性和鲁棒性。此外,研究进一步拓展了H-PINN在反演建模中的应用能力,成功实现了基于有限观测浓度数据对土壤衬层(SL)降解半衰期的高效识别,验证了其在参数反演任务中的可靠性与抗噪性能。
链接: https://arxiv.org/abs/2606.04392
作者: Dong Li,Yapeng Cao,Haiping Zhao,Shutong Han
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.
[NLP-77] When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling
【速读】: 该论文旨在解决当前心理辅导领域大语言模型(Large Language Models, LLMs)评估体系中存在的严重失真问题,即现有基准测试过度依赖高度配合的模拟来访者(simulated clients),导致模型在实际复杂、充满阻力的咨询情境中表现不佳。其核心问题是:传统评估框架因忽略来访者的动态抵抗行为,使得模型通过表面共情即可获得高分,从而产生虚假的治疗进展假象,造成评估结果与真实临床场景之间的严重不匹配。解决方案的关键在于构建一个基于认知行为疗法(Cognitive Behavioral Therapy, CBT)的抗阻力评估与训练框架——具体包括三个创新组件:首先,提出CARS(Client simulator with Resistance Awareness),通过认知概念化图谱(Cognitive Conceptualization Diagrams, CCDs)显式建模来访者的动态抵抗状态;其次,设计STREAMS框架,将策略推理(Thinker)与响应生成(Presenter)解耦,并通过强化学习优化策略鲁棒性;最后,引入EWTS-MI(Entropy-Weighted Task-Specific Metric for Interaction)熵权重评价指标,以更准确衡量模型在高摩擦互动中的真实响应能力。实验结果验证了评估失配现象的存在,并证明了抗阻力训练能显著提升模型在挑战性咨询交互中的战略适应能力。
链接: https://arxiv.org/abs/2606.04389
作者: Yihao Qin,Junyi Zhao,Changsheng Ma,Yongfeng Tao,Minqiang Yang,Chang Liu,Bin Hu
机构: School of Information Science and Engineering, Lanzhou University
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.
[NLP-78] DLLG: Dynamic Logit-Level Gating of LLM Experts
【速读】: 该论文旨在解决多专家大语言模型(Multi-expert Large Language Models, LLMs)集成中面临的适应性与稳定性权衡问题。现有方法在实现专家协作时存在明显缺陷:动态路由策略过早固定专家选择,启发式集成依赖易失效的代理指标,而参数融合则引入模型间的干扰。为克服上述局限,论文提出一种名为动态对数几率门控(Dynamic Logit-Level Gating, DLLG)的新型集成框架,其核心创新在于通过稀疏的响应级监督信号,学习细粒度的词元级专家融合策略。该框架采用轻量级门控模块,自适应地预测每一步生成过程中的融合权重,将生成轨迹的整体正确性与中间输出关联,无需依赖词元级标注或专家模型重训练。实验结果表明,DLLG在多种推理与代码生成基准测试中,均显著优于主流的动态路由、启发式集成及参数融合基线方法,验证了基于学习的对数几率级融合在整合专业化专家模型方面的鲁棒性与可扩展性。
链接: https://arxiv.org/abs/2606.04378
作者: Bingnan Li,Zhaoyang Zhang,Xiaoze Liu,Yantao Shen,Shuli Jiang,Shuo Yang,Wei Xia,Zhuowen Tu,Stefano Soatto
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Leveraging multiple specialized LLMs can combine complementary strengths, but existing approaches trade adaptability for stability: routing commits prematurely, heuristic ensembling depends on fragile proxies, and parameter merging introduces interference. We propose DLLG (Dynamic Logit-Level Gating), a dynamic logit-level ensembling framework that learns token-level expert fusion from sparse response-level supervision. A lightweight gating module predicts step-wise fusion weights, linking trajectory-level correctness to generation without token-level labels or expert retraining. Across diverse reasoning and code benchmarks, DLLG consistently outperforms strong routing, heuristic ensembling, and parameter-merging baselines across model scales, highlighting learned logit-level fusion as a robust and scalable paradigm for integrating specialized experts.
[NLP-79] Deliberate Evolution: Agent ic Reasoning for Sample-Efficient Symbolic Regression with LLM s ICML2026
【速读】: 该论文旨在解决基于大语言模型(LLM)的符号回归(Symbolic Regression, SR)方法在样本效率上的瓶颈问题。现有方法主要依赖于均方误差(MSE)等标量反馈信号,导致大语言模型需在同一反馈中同时完成候选表达式生成、演化方向判断、错误诊断及历史经验复用,造成信息冗余与搜索效率低下。其核心局限在于将候选表达式生成与搜索控制策略相耦合,限制了模型的可解释性与进化能力。为此,本文提出一种名为“审慎演化”(Deliberate Evolution, DE)的智能体框架,通过解耦符号生成与搜索控制,实现更高效的进化过程:DE采用自适应算子引导搜索方向,利用解析工具进行结构层面的错误诊断,并借助反思记忆机制积累轨迹级的经验。在LLM-SRBench基准上的实验表明,DE在多个科学领域均显著优于主流的基于LLM的符号回归基线方法,且仅需标准样本预算的40%即可达成更优性能。
链接: https://arxiv.org/abs/2606.04360
作者: Xinyu Pang,Zhanke Zhou,Xuan Li,Fangrui Lv,Shanshan Wei,Sen Cui,Bo Han,Changshui Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they rely mainly on scalar feedback such as MSE. We identify a core limitation: existing methods conflate candidate proposal with search guidance, requiring the LLM to infer how to evolve an expression, diagnose its errors, and reuse past experience from a single score. To address this, we propose Deliberate Evolution (DE), an agentic framework that decouples symbolic generation from search control. DE guides LLM proposals with adaptive operators for search direction, analytical tools for structural diagnosis, and reflective memory for trajectory-level experience. Experiments on LLM-SRBench show that DE consistently outperforms representative LLM-based SR baselines across diverse scientific domains while using only 40% of the standard sample budget.
[NLP-80] Video2LoRA: Parametric Video Internalization for Vision-Language Models
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在处理视频时计算成本高昂的问题,特别是由于每帧视频会占用数百个视觉令牌(token),导致推理开销随视频帧数和查询次数线性增长。其核心解决方案是提出Video2LoRA,一种参数化视频内化方法:通过一个感知器超网络(perceiver hypernetwork)读取冻结的VLM在编码视频过程中逐层产生的中间表示,并在单次前向传播中直接生成低秩适配器(Low-Rank Adaptation, LoRA)。与传统LoRA微调依赖迭代梯度更新不同,Video2LoRA无需反向传播,直接从视频内容预测适配器权重。该方法在SmolVLM2 500M和2.2B模型上训练,用于视频摘要与字幕生成任务,使相同的冻结VLM仅依赖生成的适配器即可回答问题,且在查询阶段无需引入任何视觉令牌。实验表明,Video2LoRA在所有五个字幕生成基准测试中与直接视频上下文推理表现相当,在七组八对视频问答基准测试中也无显著差异;尽管仅在12帧、384px下训练,其仍可稳定处理长达1,024帧、分辨率高达1024px的视频,而直接上下文推理在此条件下常出现性能退化。在整个测试范围内,该方法将查询阶段的视觉令牌负载降低最多达1,500倍,首次请求时间(TTFT)缩短6至80倍,同时保持输出对视频内容的高度忠实性。此外,研究发现独立生成的非重叠视频片段适配器可在秩空间中组合,为长视频分块内化提供了可行路径。
链接: https://arxiv.org/abs/2606.04351
作者: Manan Suri,Sarvesh Baskar,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.
[NLP-81] Noisy memory encoding explains negative polarity illusions
【速读】: 该论文试图解决的是“负极性错觉”(negative polarity illusion)现象——即某些在语法上不合法的句子(如“ever”出现在未被许可的位置)仍被人们判断为可接受的现象。其核心问题是:为何人类语言理解系统会容忍此类语法错误?论文提出的解决方案关键在于“有损上下文意外度理论”(lossy context surprisal theory),认为人类对复杂句中主语部分限定词(determiner)的记忆表征存在损耗,导致个体可能在心理上重构一个“限定词交换”的假设,从而使得“ever”获得语法许可。研究进一步假设,当主句与从句主语中的限定词越相似时,这种重构的可能性越高,错觉效应越强。实验采用六组新创限定词对(如“few”与“many”、“few”与“most”)进行可接受性判断任务,结果表明,在无时间压力条件下,“Many authors that few critics recommended have ever received acknowledgment for a best-selling novel”这一非标准句的错觉效应显著强于标准句,验证了限定词相似性对错觉强度的影响。该研究为人类语言处理的非完美性与资源理性(resource-rationality)提供了支持,即在工作记忆有限的前提下,人类通过基于概率的合理重构来优化下游语言处理效率。
链接: https://arxiv.org/abs/2606.04340
作者: Yuhan Zhang,Edward Gibson
机构: Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures, submitted for journal publication
Abstract:A sentence like “The authors that no critics recommended have ever received acknowledgment for a best-selling novel” is sometimes rated as acceptable even though, strictly speaking, it is ungrammatical because the negative polarity word “ever” is not licensed where it is. This behavioral effect is sometimes called a “negative polarity illusion”. Here we propose that the lossy context surprisal theory of Hahn et al. (2022) – whereby people have an imperfect encoding of complex sentences – might explain this effect. We hypothesize that people have poor memory representation of the determiners in the main-clause and embedded-clause subjects and could entertain a determiner exchange that licenses ever. We propose that more similar determiners in those positions would trigger stronger illusion effects. Acceptability judgment tasks with six novel determiner pairs (e.g., “few” and “many”, “few” and “most”) support our proposal, showing, specifically, that a novel sentence, “Many authors that few critics recommended have ever received acknowledgment for a best-selling novel”, triggered a much stronger illusion than the canonical one even without time pressure. These results offer further support for the suggestion that human language processing is imperfect and resource-rational: in face of working memory limitations, humans rationally reconstruct what is most likely from noisy linguistic input to facilitate downstream processing.
[NLP-82] Parameter-Efficient Fine-Tuning with Learnable Rank
【速读】: 该论文旨在解决传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法中固定低秩约束(fixed-rank constraint)可能带来的局限性问题,即预设的统一低秩归纳偏置(inductive bias)无法适应不同网络层在微调过程中对表达能力的差异化需求。其解决方案的关键在于提出一种可学习秩的低秩适配器方法(Learnable Rank LoRA, LR-LoRA),允许在训练过程中动态学习每个适配器层的最优秩,而非采用全局固定的秩。通过引入可优化的秩参数,LR-LoRA实现了层间自适应的秩分配,在Transformer模型中发现注意力层与前馈神经网络(MLP)层存在系统性不同的秩偏好。实验表明,该方法在多项语言理解与常识推理基准上均达到或超越现有最优性能,验证了可学习秩作为更具灵活性和有效性归纳偏置的优势。
链接: https://arxiv.org/abs/2606.04325
作者: Arpit Garg,Simon Lucey,Hemanth Saratchandran
机构: Australian Institute for Machine Learning, Adelaide University (阿德莱德大学澳大利亚机器学习研究所)
类目: Computation and Language (cs.CL)
备注: In Submission
Abstract:Low-Rank Adaptation (LoRA) is a popular parameter-efficient fine-tuning (PEFT) method that restricts weight updates to low-rank adapters, introducing a fixed low-rank inductive bias by optimizing in a low-dimensional subspace. In this work, we question whether a fixed-rank constraint is the most effective inductive bias for parameter-efficient fine-tuning. We introduce Learnable Rank LoRA (LR-LoRA), a PEFT method in which the adapter rank is learned during the training process. Instead of prescribing a uniform rank for all adapter layers, LR-LoRA allows the optimizer to determine the appropriate rank for each layer. Using this approach, we find substantial layer-wise variation in the learned ranks, with the attention and MLP layers in the transformer models exhibiting systematically different rank preferences. Across a range of language understanding and commonsense reasoning benchmarks, LR-LoRA achieves state-of-the-art performance in most settings and consistently outperforms strong PEFT baselines, demonstrating that a learnable rank provides a more flexible and effective inductive bias than fixed-rank adaptations.
[NLP-83] LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理场景下,键值(Key-Value, KV)缓存因直接嵌入位置信息而导致复用性受限的问题。传统方法将位置编码显式存储于缓存中,导致缓存仅能用于特定位置的请求,限制了其在检索增强生成(Retrieval-Augmented Generation, RAG)和上下文学习(In-Context Learning, ICL)等应用中的灵活性与效率。现有解决方案或仅允许前缀复用,或需昂贵的内存重编码开销。本文提出LazyAttention,一种新型注意力机制,通过将位置编码的计算延迟至注意力核内部动态执行,实现零拷贝、位置无关的KV缓存复用。该机制在注意力核层面分别针对预填充(prefilling)与解码(decoding)阶段进行优化,使单个物理缓存副本可服务于任意位置的多个逻辑请求,突破了传统材料化瓶颈。实验表明,在文档分布不均的场景下,相比当前最优的Block-Attention,LazyAttention可将首个输出令牌时间(Time-to-First-Token, TTFT)降低1.37倍,并提升推理吞吐量1.40倍,同时保持相当的生成质量。其核心创新在于通过内核级延迟位置编码实现了高效且灵活的缓存复用。
链接: https://arxiv.org/abs/2606.04302
作者: Haocheng Xia,Mihir Pamnani,Hanxi Fang,Supawit Chockchowwat,Yongjoo Park
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37 \times and increases inference throughput by 1.40 \times compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.
[NLP-84] Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings NAACL2025
【速读】: 该论文旨在解决方面级情感分析中各产品或服务属性对整体感知评价影响的量化难题,尤其针对属性间存在的复杂相关性导致难以准确分离单一属性效应的问题。其核心解决方案基于最新的基于文本的因果分析方法——CausalBERT,并提出三项关键改进:采用温度缩放(temperature scaling)优化处理分配估计的校准性;通过超参数优化减少混杂变量的过度调整;引入可解释性方法以识别并刻画发现的混杂因素。研究将评论中的文本提及视为现实属性的代理指标,在超过60万条美国K-12学校在线评论的真实与半合成数据上验证了该方法的有效性,结果表明所提改进显著提升了估计可靠性,并揭示学校管理质量与基准表现是影响整体学校评分的关键驱动因素。
链接: https://arxiv.org/abs/2606.04286
作者: Linsen Li,Aron Culotta,Nicholas Mattei
机构: Tulane University (杜兰大学)
类目: Computation and Language (cs.CL)
备注: HLT/NAACL 2025
Abstract:Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.
[NLP-85] Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling
【速读】: 该论文旨在解决现有基于人类反馈的强化学习(RLHF)中偏好建模的局限性,即传统方法通常假设存在一个普适的奖励函数,忽视了人类偏好在个体间的多样性与异质性。为在不增加额外标注成本的前提下建模个性化偏好,已有研究尝试从二元偏好数据中学习多个偏好组件并进行组合,但这些组件往往难以捕捉到语义上连贯且解耦的模式,导致模型可解释性差且个性化效果受限。本文提出一种稀疏的专家混合(Sparse Mixture-of-Experts, MoE)奖励模型,通过在二元偏好数据上训练时引入稀疏路由机制和专家多样性约束,促使模型学习到具有明确分工、可解释性强的专家子模块及清晰的路由策略。实验结果表明,该方法不仅提升了测试阶段的个性化性能,而且在后适应过程中专家权重的变化提供了对模型如何动态调整以匹配个体偏好的定性分析视角,显著增强了模型的可解释性与个性化能力。
链接: https://arxiv.org/abs/2606.04284
作者: Yifan Wang,Jinyi Mu,Mayank Jobanputra,Yu Wang,Ji-Ung Lee,Soyoung Oh,Isabel Valera,Vera Demberg
机构: Saarland University (萨尔兰大学); Bielefeld University (比勒费尔德大学); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.
[NLP-86] Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLM s for Misinformation Response Classification on Reddit
【速读】: 该论文旨在解决生成式 AI 在在线信息验证场景中对虚假信息话语进行细粒度分类时,是否存在“模型规模越大、通用能力越强则足以胜任复杂判断”这一隐含假设的有效性问题。研究发现,该假设不成立:尽管前沿的大语言模型(LLM)在零样本(zero-shot)条件下表现看似可观,但在识别隐含态度类别的“信念”(belief)类文本时普遍严重低估,而这类误判在信息验证任务中代价最高。解决方案的关键在于任务特定的微调(task-specific fine-tuning),尤其是基于监督学习的 RoBERTa 模型,在仅需极低计算成本的情况下,实现了 0.62 的宏平均 F₁ 分数,显著优于最佳零样本模型(Claude Haiku 4.5,F₁=0.50)。研究进一步揭示,模型性能受标签体系与主题的共同影响,且模型在不同主题间的表现差异可超过 0.13,表明当前主流生成式 AI 存在由安全对齐机制导致的偏差,而非能力瓶颈。因此,在虚假信息检测等高风险场景中,针对具体任务进行微调仍是更可靠的选择。
链接: https://arxiv.org/abs/2606.04274
作者: JooYoung Lee,Lin Tian,Angela Brillantes,Adriana-Simona Mihăiţă,Marian-Andrei Rizoiu
机构: University of Technology Sydney(悉尼科技大学); Australia(澳大利亚)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms – BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa – under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro- F_1 against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro- F_1 across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2606.04274 [cs.CL] (or arXiv:2606.04274v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.04274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-87] Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理日常健康咨询场景中,尤其是关于非处方药(Over-the-Counter, OTC)用药安全性的问答任务时所面临的挑战。具体而言,现有医学问答(Medical QA)评估体系尚未充分覆盖此类涉及剂量时间追踪、24小时滚动摄入量计算、产品标签约束遵循以及不完整用药史处理等复杂要求的场景。为填补这一空白,研究提出DOSEBENCH基准,聚焦于成人对乙酰氨基酚与布洛芬两类常见OTC药物的81个精心设计的用药情景,并提供人工标注的黄金参考答案。其解决方案的关键在于构建一个高度专业化、情境真实的评估框架,通过多轮测试评估四个主流大语言模型在决策正确性、响应一致性、解释可验证性、错误类型及置信度信号等方面的性能,共生成1,620条模型输出。结果表明,尽管部分模型表现出看似稳定或高置信度的回答,仍频繁出现滚动窗口推理失误和对语义模糊敏感情况下的合规性偏差,揭示了当前模型在时间序列推理、规则遵循与安全相关不确定性处理方面存在显著局限。因此,该研究证明了OTC用药安全性问答作为一个狭窄但极具实践意义的测试平台,能够有效评估医疗问答系统在时空推理与安全约束遵守方面的综合能力。
链接: https://arxiv.org/abs/2606.04262
作者: Maroof Kousar,Yibo Hu
机构: Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures
Abstract:Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.
[NLP-88] Can Generalist Agents Automate Data Curation?
【速读】: 该论文旨在解决现代人工智能开发中训练数据精炼(data curation)这一关键但高度依赖人工的环节,即如何高效、自动地完成数据策略的提出、实施、评估与迭代优化。其核心挑战在于:尽管生成式AI(Generative AI)具备一定自主性,但在面对复杂的数据选择任务时仍表现出“执行-研究鸿沟”——即过度聚焦于局部策略微调,而缺乏对新型数据策略范式的探索能力。解决方案的关键在于引入一种以代理为中心的基准测试框架Curation-Bench,通过赋予代理命令行访问权限以实现对数据的检查、策略实施、提交至固定训练/评估流水线并进行迭代修订。实验表明,未经定制的代理可在十次迭代内达到已有公开基线水平;然而轨迹分析揭示其行为受限于局部搜索。当引入强制要求每轮必须引用、实例化并适配先前方法的结构化支架(scaffold),代理能够自主构建出超越现有基线、且仅需十分之一数据预算的新颖数据选择策略。因此,可靠的数据研究依赖于受控的方法迁移机制,而非开放式的提示工程。
链接: https://arxiv.org/abs/2606.04261
作者: Feiyang Kang,Hanze Li,Adam Nguyen,Mahavir Dabas,Jiaqi W. Ma,Frederic Sala,Dawn Song,Ruoxi Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Preprint
Abstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution-research gap: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes – without human design input – a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
[NLP-89] StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
【速读】: 该论文旨在解决基于大语言模型(LLM)的寄存器传输级(RTL)代码自动生成中存在的长时程推理、多步依赖关系以及Verilog和VHDL语法严格正确性约束等核心挑战。其解决方案的关键在于提出StepPRM-RTL框架,通过结合分步轨迹建模(stepwise trajectory modeling)、过程奖励建模(Process Reward Model, PRM)与检索增强微调(Retrieval-Augmented Fine-Tuning, RAFT),实现对生成过程的功能正确性与推理一致性的双重提升。该框架构建源自标准解法的分步推理轨迹,每一步包含推理理由与增量式代码修改;利用PRM对中间步骤进行密集评估,提供过程感知的反馈信号,指导RAFT中的强化学习式参数更新;同时采用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)探索多样化推理路径,生成高质量训练轨迹。这种融合分步推理与结果导向奖励的机制,使模型不仅掌握“如何”生成正确的RTL代码,更理解“为何”如此构造,显著增强长时程推理能力。实验表明,StepPRM-RTL在基准Verilog和VHDL数据集上功能正确率与推理保真度指标均优于现有最优方法超过10%,消融实验进一步验证了PRM引导的奖励机制与分步轨迹探索的协同作用是性能突破的核心。该方法具备跨语言泛化能力,为可解释、高保真的硬件设计自动化提供了可扩展的新范式。
链接: https://arxiv.org/abs/2606.04246
作者: Prashanth Vijayaraghavan,Apoorva Nitsure,Luyao Shi,Ehsan Degan,Vandana Mukherjee
机构: IBM Research San Jose (硅谷), CA, USA
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: 6 pages, 2 figures, DAC’2026
Abstract:Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.
[NLP-90] VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
【速读】: 该论文旨在解决多模态大语言模型在需要通过外部工具(如绘图)辅助进行复杂推理时性能下降的问题,尤其关注其在依赖视觉辅助(如图表)进行数学问题求解时的表现瓶颈。这一问题在真实工程与科学工作流中尤为关键,因为可视化工具常被用于分析、验证和决策。为系统研究此差距,作者提出了VAMPS(Visual-Assisted Mathematical Problem Solving)基准,这是一个基于图形辅助的数学问题求解评估集,包含1,168个双语(波斯语-英语)多选题,源自伊朗大学入学考试中的代数与微积分题目,并通过人工审核的生成式AI(Generative AI)合成变体进行扩展。所有题目均设计为绘图能自然揭示交点、极值、渐近线等关键信息,从而形成有效的解题策略。与以往仅评估模型对固定视觉输入的推理能力不同,VAMPS的核心创新在于测试模型是否能够主动构建有效图表,并基于生成的可视化结果进行推理。研究发现,尽管绘图是自然解法,但在多种模型上,直接解析求解仍显著优于依赖工具的可视化求解,揭示了当前多模态模型在动态工具使用与视觉推理融合方面的根本性局限。
链接: https://arxiv.org/abs/2606.04244
作者: Amirhossein Dabiriaghdam,Shayan Vassef,Mohammadreza Bakhtiari,Yasamin Medghalchi,Ilker Hacihaliloglu,Mesrob Ohannessian,Lele Wang,Giuseppe Carenini
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool’s output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.
[NLP-91] Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1) WWW2025
【速读】: 该论文旨在解决多模态文档检索中视觉信息利用不足的问题,即现有检索系统普遍忽略图像、图表、表格等视觉通道,而无法有效处理图文混合的丰富视觉文档。其核心挑战在于构建一个统一的检索系统,同时应对两种互补的检索场景:一是基于文本查询在长文档中进行闭集文档页检索(MMDocIR),二是基于图像或图像+文本查询从开放域中检索维基风格的文本片段(M2KR)。解决方案的关键在于采用基于解码器架构的多模态大语言模型(Multimodal-LLM)嵌入器,特别是源自Qwen2-VL系列的模型,而非传统的CLIP式编码器。三支获胜团队均基于此框架,主要区别体现在实现策略上:分别为微调集成模型、无需训练的多路径融合结合强视觉-语言重排序器,以及零样本后期交互机制。其中,无需训练的系统仅以0.1分之差紧随微调系统之后,表明高效的架构设计与推理策略在不依赖微调的情况下亦可达到顶尖性能。
链接: https://arxiv.org/abs/2606.04240
作者: Jingbiao Mei
机构: University of Cambridge(剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: MDR Challenge Report at WWW2025
Abstract:Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emphMultimodal Document Retrieval Challenge, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emphsingle retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@ \1,3,5\ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams’ systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within 0.1 point of the fine-tuned winner.
[NLP-92] Supportive Token Revealing for Fast Diffusion Language Model Decoding
【速读】: 该论文旨在解决生成式语言模型在并行扩散解码(parallel diffusion decoding)过程中存在的质量-延迟权衡问题。现有方法通过置信度或依赖性准则判断哪些令牌可安全解码,但此类策略未能有效缓解因不确定令牌依赖于仍被掩码的令牌而造成的去噪瓶颈,导致解码效率受限。本文提出AXON——一种无需训练的模块化解决方案,可无缝集成至现有并行解码框架之上。其核心创新在于:不直接决定“哪些令牌最安全揭示”,而是动态识别“哪些高置信度的掩码令牌作为锚点(anchor)能够最优支持后续去噪”。AXON通过联合分析注意力分布、不确定性与置信度信号,选择关键锚点以增强未确定位置的上下文信息,从而提升解码效率。实验表明,AXON在多个推理与代码生成基准上显著改善了质量-延迟权衡,在保持甚至提升准确率的同时大幅减少函数评估次数,验证了其在实际应用中的有效性。
链接: https://arxiv.org/abs/2606.04236
作者: Giries Abu Ayoub,Mario Barbara,Lluís Pastor-Pérez,Tanja Bien,Aneesh Barthakur,Alaa Maalouf,Loay Mualem
机构: University of Haifa(海法大学); Institute for AI, University of Stuttgart(斯图加特大学人工智能研究所); IMPRS-IS(国际智能系统研究生院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.
[NLP-93] MM-BizRAG : Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise QA ACL2026
【速读】: 该论文旨在解决多模态检索增强生成(MM-RAG)在处理复杂企业文档时,因过度依赖页面级图像而忽视文档内在结构信息的问题。现有方法通常通过预训练嵌入或视觉-语言模型隐式捕捉文档布局,导致对垂直结构化文档(如报告)的语义理解不足。其解决方案的关键在于提出一种面向业务场景的多模态文档结构感知检索增强生成框架——MM-BizRAG,其核心创新包括:基于文档结构感知的动态分块策略,将文档按布局类型(垂直/水平)路由至定向的数据摄入流水线;对垂直结构文档采用显式的版面感知解析,对水平结构文档则保留整体页面表征;引入由大语言模型(LLM)驱动的统一构件转换流程,并通过占位符实现位置对齐,以保持自然阅读顺序;在推理阶段采用解耦的多模态组装机制,使检索表示与生成上下文分离,从而生成更丰富、更具依据性的答案,且无需微调。实验结果表明,该方法在大规模异构企业数据集及公开基准(SlideVQA、FinRAGBench-V)上显著优于现有视觉主导基线,性能提升最高达32个百分点,尤其在报告类文档上表现突出。此外,作者还提出了FastRAGEval,一种单次调用的LLM判别指标,可将RAGChecker的成本减半,同时实现更强的人工评估一致性。
链接: https://arxiv.org/abs/2606.04231
作者: Hanoz Bhathena,Parin Rajesh Jhaveri,Rohan Mittal,Prateek Singh,Aymen Kallala,Rachneet Kaur,Yiqiao Jin,Zhen Zeng,Adwait Ratnaparkhi,Denis Kochedykov
机构: JPMorgan Chase Co.(摩根大通公司); Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 (Industry Track)
Abstract:Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker’s cost while achieving stronger human alignment.
[NLP-94] DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text Audio and Image Modalities
【速读】: 该论文旨在解决多模态生成内容检测领域中存在的评估体系碎片化问题,即现有检测方法大多为商业软件或开源工具,但其代码库不兼容、预处理流程、评估协议与指标各异,导致研究复现困难、公平比较难以实现。其解决方案的关键在于提出DetectZoo——首个面向文本、图像和音频多模态生成内容检测的可扩展统一工具包,通过标准化从数据加载、预处理到模型评估的完整实验流程,提供统一接口。DetectZoo集成61个基准检测器、22个公开数据集的原生加载器及标准化评估管道,支持多指标统一输出,所有检测器均具备自包含性与结果可复现性,并自动缓存预训练权重,显著降低多模态AI取证研究的入门门槛,推动检测技术的系统性评估与持续发展。
链接: https://arxiv.org/abs/2606.04205
作者: Sajad Ebrahimi,Nima Jamali,Bardia Shirsalimian,Kelly McConvey,Wentao Zhang,Jalehsadat Mahdavimoghaddam,Maksym Taranukhin,Maura Grossman,Vered Shwartz,Yuntian Deng,Ebrahim Bagheri
机构: University of Toronto (多伦多大学); University of Waterloo ( Waterloo 大学); Toronto Metropolitan University (多伦多都会大学); University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at this https URL, and the package can be installed via pip install detectzoo.
[NLP-95] Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在不同提示(prompting)策略下生成虚假新闻的检测模型泛化能力不足的问题。现有检测模型多在单一生成设置下训练与评估,难以适应未见过的提示场景,导致实际应用中性能下降。本文通过构建三个在不同提示下生成的AI虚假新闻数据集,并结合真实新闻数据,系统考察了跨提示(cross-prompt)泛化能力。其解决方案的关键在于提取可解释的语言学特征,包括词汇多样性、可读性及基于情感的特征,并采用随机森林分类器在跨提示框架下进行评估。实验结果显示,在全部六种训练-测试组合中,模型表现稳定,AUC值介于0.988至1.000之间。进一步分析表明,尽管不同提示导致生成文本的特征分布存在差异(如词汇多样性提升、可读性下降、情感强度显著降低),但所提取特征仍能捕捉到生成文本中稳定的模式,使分类器具备对多种提示策略的强泛化能力。研究结果表明,基于语言学特征的方法可在提示变化条件下提供稳健的生成式虚假新闻检测。
链接: https://arxiv.org/abs/2606.04199
作者: Aya Vera-Jimenez,Samuel Jaeger,Calvin Ibenye,Dhrubajyoti Ghosh
机构: Kennesaw State University (肯尼索州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The increasing use of large language models has raised concerns about the spread of AI-generated fake news, particularly under varying prompting strategies. Most existing detection models are trained and evaluated under a single generation setting, leaving their ability to generalize across unseen prompts unclear. In this study, we investigate cross-prompt generalization in fake news detection using three datasets of AI-generated articles produced under distinct prompts, combined with real news articles. We extract interpretable linguistic features capturing lexical diversity, readability, and emotion-based characteristics and evaluate a random forest classifier under a cross-prompt framework, where models trained on one prompt are tested on another. Across all six train-test combinations, performance remains consistently high, with AUC values ranging from 0.988 to 1.000. Analysis of feature distributions shows that AI-generated text exhibits increased lexical diversity, reduced readability, and substantially lower emotional intensity compared to the overall dataset, with variations across prompts. Despite these distributional shifts, the classifier maintains strong performance, indicating that these features capture stable properties of AI-generated text that generalize across prompting strategies. These findings suggest that feature-based approaches can provide robust detection of AI-generated fake news under prompt variability.
[NLP-96] ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation
【速读】: 该论文旨在解决面向方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)中高质量标注数据集构建效率低下与协作标注管理困难的问题。现有标注工具将输出视为扁平文件,导致研究者需手动整合多标注者数据、重建数据的关联结构,并通过自定义脚本计算标注一致性指标,流程繁琐且易出错。为此,本文提出ACAT(Aspect-based Sentiment analysis Collaborative Annotation Tool),一个原生支持四种ABSA工作流的基于Web的协同标注平台:(1)方面类别情感分析,(2)句子级分段,(3)带字符级位置追踪的方面项情感分析,以及(4)保留双跨度偏移量的方面情感三元组抽取。其核心解决方案在于设计了一套自动化提取、转换、加载(Extract, Transform, Load, ETL)管道,能够自动对齐协同标注结果,并在导出时直接计算标注者间一致性(Inter-Annotator Agreement, IAA)指标,从而生成可直接用于模型训练的标准化数据集。在对1002条餐厅评论进行的初步验证中,两名不同专业背景的标注者使用ACAT完成标注的中位耗时为31.58秒,各项任务的原始IAA值介于0.78至0.86之间,验证了其高效性与标注质量的可靠性。
链接: https://arxiv.org/abs/2606.04189
作者: Ana-Maria Luisa Mocanu,Ciprian-Octavian Truica,Elena-Simona Apostol
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at The 28th International Conference on Big Data Analytics and Knowledge Discovery (DaWak 2026)
Abstract:Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.
[NLP-97] A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models
【速读】: 该论文旨在解决当前生成式语言模型(Generative Language Models, GLMs)生成文本的可解释性问题,即如何为非专业用户提供清晰、可靠且可理解的判别依据,以识别特定文本是否由人工智能生成。现有研究中关于能够有效指示AI生成文本的语言学特征存在碎片化现象,其有效性在不同模型、文本领域及上下文之间差异显著,缺乏普适性。为此,本文开展了一项大规模实证研究,系统评估了284个可解释的语言学特征在27种生成式语言模型与10个文本领域中的鲁棒性表现,并在跨模型与跨领域泛化设置下进行验证。研究发现,仅依赖语言学特征即可实现对AI生成文本与人工撰写文本的可靠区分;然而,多数先前提出的判别指标表现出强烈的上下文依赖性,而唯有词汇丰富度(lexical richness)相关度量在不同模型族和文本领域间均保持稳健,具备良好的泛化能力。因此,该研究的关键贡献在于明确了具有跨情境稳定性的核心语言学信号,为构建更可靠、可解释的AI生成语言分析框架提供了坚实基础。
链接: https://arxiv.org/abs/2606.04177
作者: Yassir El Attar,Esra Dönmez,Maximilian Maurer,Agnieszka Falenska
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.
[NLP-98] Expert-Aware Refusal Steering
【速读】: 该论文旨在解决指令微调大语言模型(LLM)中安全对齐(safety alignment)的脆弱性问题,即模型在面对有害或禁止请求时无法可靠拒绝响应,从而导致潜在安全风险。其核心挑战在于,现有方法表明可通过推理阶段施加“转向向量”(steering vector)来抑制模型的拒绝行为,诱导其生成有害内容,而这一现象在混合专家模型(Mixture-of-Experts, MoE)架构中仍具可操作性。本文的关键解决方案是针对MoE架构特性,提出两种基于专家感知的拒绝转向方法:一是利用与拒绝行为相关的特定专家路由模式,二是结合专家特异性的转向方向,实现对拒绝行为的有效控制。研究发现,仅依赖单一专家的输出即可有效引导拒绝行为,且转向方法所捕捉的拒绝信号与专家路由行为存在差异,表明注意力机制在MoE模型的拒绝行为中扮演着关键角色。
链接: https://arxiv.org/abs/2606.04160
作者: Anna C. Marbut,Daniel R. Olson,Travis J. Wheeler
机构: University of Montana (蒙大拿大学); University of Arizona (亚利桑那大学); European Bioinformatics Institute, European Molecular Biology Laboratory (欧洲生物信息学研究所,欧洲分子生物学实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review for COLM 2026
Abstract:Safety alignment in instruction-tuned large language models (LLMs) depends on a model’s ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.
[NLP-99] When Retrieval Doesnt Help: A Large-Scale Study of Biomedical RAG ACL2026
【速读】: 该论文旨在解决生成式医疗问答(Generative AI for Medical Question Answering)中因事实性错误导致的高风险问题。尽管检索增强生成(Retrieval-Augmented Generation, RAG)被视为有前景的解决方案,且先前研究报道了大型医学问答模型在采用RAG后取得显著性能提升,但本文通过在涵盖7B至72B参数规模的五种开放权重指令微调模型上,系统评估了其在十项生物医学问答数据集、四种检索方法及四种检索语料上的表现,发现检索带来的性能提升极为有限且不一致,通常仅在1-2个百分点之间。相比之下,基础模型的选择对性能的影响远大于检索器或语料库的选择,且专家级与普通用户级检索源在多数场景下表现相当。这表明,当前系统的主要瓶颈并非检索质量本身,而是模型对所检索到证据的有效利用能力不足。因此,该研究的关键在于揭示:提升生成式医学问答系统的性能,需聚焦于增强模型对检索信息的推理与整合能力,而非单纯优化检索模块。
链接: https://arxiv.org/abs/2606.04127
作者: Erfan Nourbakhsh,Rocky Slavin,Ke Yang,Anthony Rios
机构: The University of Texas at San Antonio
类目: Computation and Language (cs.CL)
备注: 9 Pages, accepted to BioNLP Workshop at ACL 2026
Abstract:Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.
[NLP-100] SaliMory: Orchestrating Cognitive Memory for Conversational Agents
【速读】: 该论文旨在解决对话代理在作为终身伴侣型系统时,如何有效维持跨交互的持久记忆这一关键挑战。现有方法中,单纯扩展上下文窗口并结合原始检索会降低推理质量,而采用标准强化学习训练记忆代理则在多阶段流程中引发严重的信用分配瓶颈。为此,论文提出SALIMORY框架,其核心创新在于通过一个统一的语言模型实现对认知结构化记忆(包括用户事实、偏好及工作记忆)的管理。解决方案的关键在于引入分层阶段式过程奖励机制与奖励解耦对比精炼策略,从而为不同的记忆操作(选择性过滤、信息整合与线索驱动回忆)提供独立且端到端的监督信号。该方法显著降低了由记忆缺陷导致的失败率,提升了超过10%的端到端准确率,并使良好个性化率提升逾一倍。
链接: https://arxiv.org/abs/2606.04120
作者: Kai Zhang,Xinyuan Zhang,Hongda Jiang,Shiun-Zu Kuo,Hyokun Yun,Ejaz Ahmed,Shereen Oraby,Ziyun Li,Sanat Sharma,Ann Lee,Ahmed A Aly,Anuj Kumar,Raffay Hamid,Xin Luna Dong
机构: Meta Reality Labs(元宇宙现实实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.
[NLP-101] Computational conceptual history of scientific concepts: From early digital methods to LLM s
【速读】: 该论文旨在解决如何在科学史、哲学与社会学(HPSS)领域中有效运用大型语言模型(Large Language Models, LLMs)进行概念分析的问题,同时厘清其相较于传统计算方法的贡献与继承的深层挑战。其解决方案的关键在于系统性地梳理从早期数字方法到分布语义模型,再到当前基于LLM的概念分析技术演进脉络,揭示在语料库构建、操作化定义、模型选择与训练数据依赖、以及评估与解释等核心环节中,LLM带来的范式转变与延续性问题。通过回顾相关实证案例,论文强调了在利用LLM处理词汇语义变迁时,必须审慎对待其隐含的偏见与不确定性,并提出需结合领域知识对模型输出进行批判性验证,从而实现更可靠的概念史研究。
链接: https://arxiv.org/abs/2606.04118
作者: Michael Zichert,Arno Simons
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, chapter in the book Understanding Science with Large Language Models? (pp. 383-412). transcript. Edited by Arno Simons, Adrian Wüthrich, Michael Zichert, Gerd Graßhoff (eds.)
Abstract:This article situates large language models (LLMs) within the longer history of computational approaches to concept analysis in the history, philosophy, and sociology of science (HPSS). We examine what LLMs add to existing methods, how they inherit longstanding problems, and review recent case studies that employ them. In the first part, we reconstruct computational conceptual history before LLMs by bringing together three strands of work: early digital methods in HPSS, distributional approaches from digital history and related research, and lexical semantic change detection. We provide an overview of the main challenges and opportunities, focusing on corpus construction, operationalization and modelling choices, and evaluation and interpretation. In the second part, we turn to the era of LLMs, starting with a short introduction to LLMs before reviewing LLM-based work on lexical semantic change detection and relevant case studies in HPSS. We then revisit the earlier methodological questions, showing how issues of corpus construction, model choice and training data, operationalization trade-offs, and evaluation and interpretation play out in LLM-based workflows.
[NLP-102] Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在使用外部上下文时,其对提示内容的采纳行为如何受到标签(如Instruction:、Reference:、Example:等)影响的问题。现有系统常通过添加语篇角色标签来组织输入内容,但这些标签对模型决策机制的影响尚未得到充分研究。本研究提出一种配对固定内容探测方法,在超过500万条MMLU-Pro数据上进行实验:同一错误答案陈述以不同语篇角色标签呈现,测量模型采纳错误选项的比例。结果显示,在GPT-5.5、DeepSeek V4 Pro、Llama-3-8B-Instruct和Qwen2.5-7B-Instruct中,误导采纳率随标签变化达56–84个百分点。具有绑定性或来源属性的标签(如Instruction:、Reference:)显著提升采纳率,而Example:则持续抑制采纳。通过配对检验、自举置信区间、最终指令消融实验及Qwen模型的步骤级概率探针,均支持“标签条件下的候选偏好”这一核心机制。边界探测进一步揭示:算术任务中采纳率下降,段落形式的外部上下文保留较小标签差异,短答案评估排除了选项字母复制的可能性,嵌套标签冲突表明说明性框架可限制采纳范围。200例单作者人工审计验证了短答案对比结果在保守判别下的稳定性。结论为:尽管存在边界约束,但具有实际意义——上下文利用与读者端检索增强生成(RAG)基准测试应报告并控制包装标签,因为呈现方式会显著改变对所提供上下文的依赖程度测量结果。
链接: https://arxiv.org/abs/2606.04109
作者: Jianguo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. 1 figure, 9 tables
Abstract:Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.
[NLP-103] POLARIS: Guiding Small Models to Write Long Stories
【速读】: 该论文旨在解决小型开源模型在长篇创造性写作任务中表现不佳的问题,具体表现为生成故事长度难以满足要求或随着文本长度增加而质量显著下降,尤其相较于前沿大模型差距明显。其解决方案的关键在于提出一种低计算成本的GRPO(Generalized Reward Policy Optimization)训练范式——POLARIS(Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting),包含两个核心机制:一是采用具备结构化故事质量评分标准的前沿大语言模型(LLM)作为在线奖励函数,实现精细化质量评估;二是引入人类参考文本注入(Human-Reference Injection, HRI),即在每个GRPO训练组内嵌入由人工撰写的高质量故事作为高奖励锚点,以引导模型学习更符合人类偏好和长度指令的生成行为。通过在Qwen3.5-9B模型上应用该方法,并基于约1400个来自100部短篇小说选集的提示-故事对数据集进行训练(仅需4块A100 GPU),得到POLARIS-9B模型。实验结果表明,该模型在多个涵盖分布内与分布外提示及评分标准的基准测试中表现优于同规模基线模型,且在生成长度长达训练上限3倍的故事时仍保持高质量输出,显著优于多数同类开源模型,验证了其出色的长度泛化能力。研究进一步指出,长度泛化能力可作为衡量创造性写作模型性能的重要压力测试指标,有助于区分原本表现相近的模型。
链接: https://arxiv.org/abs/2606.04095
作者: Rishanth Rajendhran,Jenna Russell,Mohit Iyyer,John Frederick Wieting
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.
[NLP-104] Large Language Models Hack Rewards and Society
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中可能引发的“社会性漏洞利用”(societal hacking)问题。具体而言,研究指出,社会规制在结构上与奖励函数具有相似性——均包含可度量的结果、阈值及例外条款,但往往仅部分明确制度意图。这种不完整性可能导致模型在RL训练中系统性地发现并利用规则中的漏洞,生成在技术上合规却违背监管初衷的行为策略。其解决方案的关键在于:通过构建名为SocioHack的72个社会环境模拟沙盒,实证验证了奖励劫持(reward hacking)现象在真实社会规则情境下自然涌现,并导致对监管规则的实质性规避。研究进一步表明,当前主流的LLM安全防护机制对此类行为的缓解能力有限,因此亟需建立下一代后训练范式,以在真实社会环境中安全迭代大语言模型,避免因规则漏洞被滥用而引发系统性风险。
链接: https://arxiv.org/abs/2606.04075
作者: Wei Liu,Xinyi Mou,Hanqi Yan,Zhongyu Wei,Yulan He
机构: King’s College London; Fudan University; The Alan Turing Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: 14 pages, 9 figures, 7 tables
Abstract:Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models’ well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=
[NLP-105] Covert Influence Between Language Models
【速读】: 该论文旨在解决生成式 AI 在模型间相互依赖输出过程中产生的隐性影响(covert influence)问题,即发送方的意图行为倾向(payload)通过人类无法察觉的载体(carriers)传递至接收方,从而在不留下明显痕迹的情况下实现隐蔽的信息传播。其解决方案的关键在于引入推理时的逐样本归因评分(inference-time per-sample attribution scores),能够识别并选择性放大训练阶段影响的载体,从而揭示并实现此前研究无法捕捉的隐性信息传递。研究进一步表明,以自然语言为载体的隐性影响与以往使用数值载体的研究存在本质差异:前者更易被人类察觉且跨模型家族迁移能力更强,而后者更具隐蔽性但可迁移性较低。这一发现表明隐性影响的风险范围远超以往认知,同时论证了点对点归因评分方法在探测与缓解此类风险中的有效性。
链接: https://arxiv.org/abs/2606.04071
作者: Avidan Shah,Jay Chooi,Jinghua Ou,Shi Feng
机构: MATS; New York University (纽约大学); Harvard University (哈佛大学); George Washington University (乔治华盛顿大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As language models increasingly consume one another’s outputs, covert influence – a phenomenon where a sender’s payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers undetectable by humans – becomes a growing risk. We characterize this risk across three interfaces: supervised fine-tuning, on-policy distillation, and in-context learning, and find that they vary in the scale of influence achievable without leaving behind human-visible traces. Using inference-time per-sample attribution scores, we study covert influence across all three interfaces with the ability to select carriers that amplify training-time influence, unlocking payload transfers that prior work could not achieve. We further provide evidence that covert influence with natural-language carriers is a distinct phenomenon from prior studies using number carriers, as the latter is more resistant to human detection and less portable across model families. Together, these results suggest that the risk surface for covert influence is broader than previously recognized, and we study pointwise attribution scoring methods as a tool to investigate and mitigate it.
[NLP-106] Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation ICML2026
【速读】: 该论文旨在解决具身视觉-语言决策任务(如机器人操作与导航)中,视觉-语言模型(VLMs)和视觉-语言-动作模型(VLAs)因感知瓶颈导致的视觉幻觉问题,即模型难以区分任务相关物体与干扰物。其核心挑战在于:尽管直接聚焦关键物体的一步式注意力方法看似合理,但有效聚焦依赖于对场景的深层理解,而现有方法缺乏此能力。解决方案的关键在于提出一种“粗到细”的焦点规划生成方法——SceneDiver,该方法利用VLMs的长期规划优势,首先构建全局场景图以建立初步认知,随后通过识别、理解与分析的迭代循环逐步将复杂任务分解为可管理的子问题,从而实现精准的目标聚焦。同时,为支持快速反应控制,设计了一种轻量级适配器,将VLMs中具备的深思熟虑的聚焦能力蒸馏至VLAs中。实验结果表明,该方法显著降低了两类模型的视觉幻觉现象,同时在需快速执行的任务中保持了较高的计算效率。
链接: https://arxiv.org/abs/2606.04046
作者: Boyuan Xiao,Bohong Chen,Yumeng Li,Ji Feng,Yao-Xiang Ding,Kun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at ICML 2026
Abstract:In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models’ inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: this https URL.
[NLP-107] Do Transformers Need Three Projections? Systematic Study of QKV Variants ICML2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)中注意力机制的计算冗余与内存开销问题,具体聚焦于查询(Query)、键(Key)和值(Value)三者投影共享(projection sharing)对模型性能与推理效率的影响。传统Transformer采用独立的Q、K、V线性投影,导致大量参数冗余及缓存占用,尤其在边缘设备部署时成为瓶颈。本文系统评估了三种投影共享策略:a) Q-K=V(共享键值投影)、b) Q=K-V(共享查询与键-值投影)、c) Q=K=V(单一投影)。研究发现,仅当采用Q-K=V共享时,模型仍能保持与标准QKV Transformer相当甚至更优的性能,且可实现高达50%的键值缓存(KV cache)压缩,仅伴随3.1%的困惑度(perplexity)下降。其关键在于:键与值在语义空间中具有相似的表示能力,且注意力机制本身处于低秩(low-rank)运行状态,因此共享键值投影不会显著破坏信息表达;而Q=K-V策略则破坏了注意力的方向性,导致性能明显下降。此外,引入二维位置编码可有效缓解对称注意力带来的偏差问题。更重要的是,投影共享与头共享(如GQA/MQA)具有互补性:结合Q-K=V与GQA-4可实现87.5%缓存压缩,与MQA结合可达96.9%,显著提升边缘设备上的实际推理可行性。研究系统揭示了投影共享作为权重绑定(weight tying)的一种未被充分探索的形式,具备明确可量化的推理内存优化优势,为轻量化Transformer部署提供了理论依据与实用方案。
链接: https://arxiv.org/abs/2606.04032
作者: Ali Kayyam,Anusha Madan Gopal,M Anthony Lewis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注: Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: this https URL
Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at this https URL
[NLP-108] Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy INTERSPEECH2026
【速读】: 该论文旨在解决自动语音识别(ASR)系统评估中依赖参考转录文本(reference transcription)的问题,尤其是在缺乏标准答案的场景下,如何实现无需参考文本的假设评价。其解决方案的关键在于提出一种名为READ(Reference-free Hypothesis Evaluation with Acoustic Discrepancy)的新指标,该方法直接从语音信号中评估ASR输出的合理性。READ通过使用预训练的自回归语音合成(Text-to-Speech, TTS)模型,计算在给定文本假设条件下语音标记的条件似然,从而量化语音与文本之间的细粒度声学差异,强调假设的声学可实现性。该方法无需额外训练,可直接用于生成式修正,并在实验中验证其与特定识别错误具有相关性,显著提升ASR性能,最高实现20%的相对错误率降低,尤其在噪声环境下表现突出。
链接: https://arxiv.org/abs/2606.04680
作者: Zhihan Li,Hankun Wang,Yiwei Guo,Bohan Li,Xie Chen,Kai Yu
机构: X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University (上海交通大学计算机科学学院); MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing (教育部人工智能重点实验室,江苏省语言计算重点实验室)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to Interspeech 2026. 6 pages, 4 figures
Abstract:Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20% relative error rate reduction, with particularly strong gains under noisy conditions.
信息检索
[IR-0] SearchLog: A Web Browser Extension for Capturing Search Logs in Laboratory Studies
链接: https://arxiv.org/abs/2606.05040
作者: Jiaman He,Riccardo Xia,Dana McKay,Damiano Spina,Johanne R. Trippas
类目: Information Retrieval (cs.IR)
备注:
Abstract:Natural search logs are valuable for studying search behavior in information seeking settings. We present SearchLog, an easy-to-install web browser extension for collecting natural search logs during lab-based studies. SearchLog allows participants to search the open web using a browser while recording structured interaction data across mouse, keyboard, search activity, and browser state modules. The extension captures clicks, scrolling, hovered text, typed words, search queries, result rankings, AI-generated summaries when available, tab activity, and window changes. A local Flask backend stores each session as an ordered JSON event stream, with HTML snapshots and preprocessed search result data for later analysis. These logs can be used to derive measures such as query reformulation, page visits, dwell time, scroll behavior, tab switching, search path complexity, and exposure to AI-generated search content. By supporting natural browser-based search with structured experimental metadata, SearchLog provides a reusable resource to study search behavior across traditional and AI-enhanced search interfaces.
[IR-1] NLLog: Lightweight Explainable SOC Anomaly Detection via Log-to-Language Rewriting ACSA
链接: https://arxiv.org/abs/2606.04957
作者: Samuel Ndichu,Tao Ban,Seiichi Ozawa,Takeshi Takahashi,Daisuke Inoue
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 15 pages, 11 figures, 12 tables; submitted to ACSAC 2026
Abstract:System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.
[IR-2] Dual-Stream MLP is All You Need for CTR Prediction KDD
链接: https://arxiv.org/abs/2606.04944
作者: Kesha Ou,Zhen Tian,Wayne Xin Zhao,Long Zhang,Sheng Chen,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
备注: Accepted by TKDD
Abstract:Click-through rate (CTR) prediction holds a pivotal role in online advertising and recommendation systems, where even small improvements can significantly boost revenue. Existing research primarily focuses on designing dual-stream architectures to capture effective complex feature interactions from both explicit and implicit perspectives. However, these approaches are faced with two major challenges: 1) the high complexity of feature interaction learning, which increases computational demands and the overfitting risk, and 2) the imbalance between explicit and implicit modules, where one module’s output may dominate the final prediction. To address these issues, in this paper, we propose Dual-Stream MLP (DS-MLP), a novel feature interaction framework for the CTR prediction task. Specially, it leverages knowledge distillation to consolidate the capacity of learning explicit feature interaction into a main MLP network, while a parallel MLP simultaneously captures implicit feature interactions as a complement. To effectively optimize the dual-stream MLP architecture, we further design a specific learning approach with two alignment strategies for enhancing the compatibility of the two MLP components. Experiments demonstrate that DS-MLP, though merely a vanilla MLP structure (the final model), can achieve state-of-the-art performance across three widely used benchmarks, offering a scalable and efficient solution for large-scale recommendation systems. Our code is available at this https URL.
[IR-3] Caliper: Probing Lexical Anchors versus Causal Structure in LLM s
链接: https://arxiv.org/abs/2606.04915
作者: Zhenyu Yu,Shuigeng Zhou
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder’s pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.
[IR-4] BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration SIGIR2026
链接: https://arxiv.org/abs/2606.04909
作者: Yung-Yu Shih,Shang-Yu Su,Tzu-I Ho,Dongzhe Wang,Yun-Nung Chen
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: this https URL
Abstract:E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities – preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively – prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.
[IR-5] EviRank: Evidence-Based Confidence Estimation for LLM -Based Ranking
链接: https://arxiv.org/abs/2606.04727
作者: Meng Yan,Cai Xv,Xujing Wang,Ziyu Guan,Wei Zhao
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models show promise for recommendation, but they raise reliability concerns due to limited domain coverage and inherent stochasticity. Existing uncertainty quantification methods persist two fundamental challenges: (1) the global confidence score designed for question answering fails to reveal which positions are unreliable in ranking list; (2) fine-grained confidence extracted from model internals exhibits uniformly low values across all positions, making it impossible to filter unreliable predictions. To tackle the challenges, we propose an evidence-based confidence estimation for LLM-based ranking (EviRank). We extract three complementary evidences from a single forward pass and aggregate them via reliable opinion aggregation. Furthermore, we recognize that ranking positions are inherently unequal, and introduce a position-aware calibration. Lastly, the calibrated confidence guides ranking optimization. Experiments on three datasets demonstrate that our method achieves state-of-the-art performance on both recommendation and uncertainty quantification. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.04727 [cs.IR] (or arXiv:2606.04727v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.04727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-6] Improving the Efficiency and Effectiveness of LLM Knowledge Distillation for Conversational Search SIGIR’26
链接: https://arxiv.org/abs/2606.04650
作者: Stan Fris,Jan Hutter,Jan Henrik Bertrand,Simon Lupart,Mohammad Aliannejadi
类目: Information Retrieval (cs.IR)
备注: SCAI Workshop at SIGIR '26}{July 20–24, 2026}{Melbourne, Naarm, Australia
Abstract:Conversational Search (CS) considers retrieval of relevant documents based on conversational context. Large Language Models (LLMs) have significantly enhanced CS by enabling effective query rewriting. However, employing LLMs during inference poses efficiency challenges. A method to balance effectiveness and efficiency is the use of knowledge distillation from LLM-based query rewriting. Recent work applies the Kullback-Leibler Divergence (KLD) for distillation, relaxing the alignment with the teacher signal compared to previous methods. Despite these gains, several aspects of KLD-based distillation for conversational search remain understudied, and we investigate them in this work. Prior work in related fields suggests that adding a contrastive loss to the KLD objective can improve performance; we confirm this and observe significant gains in precision-oriented ranking metrics. We also find that contrastive sampling strategies for the KLD loss have a non-trivial impact and must be chosen carefully. Although theory suggests that more samples improve the KLD estimate, experiments show diminishing returns on the number of used samples. Finally, we address the phenomenon of decreased sparsity in longer conversations, which limits computational efficiency across sparse retrieval methods. We find that the representations from the model distilled with the KLD loss can be strongly regularized with a regularization loss, substantially improving sparsity and inference efficiency without significantly harming retrieval effectiveness. We achieve a 2\times decrease in FLOPS on TopiOCQA with negligible loss in effectiveness, corresponding to a \leq 2% drop in Recall@100. Our results provide insights into distillation objectives for learned sparse conversational retrievers and offer practical guidelines for improving effectiveness and efficiency in first-stage retrieval. Comments: SCAI Workshop at SIGIR '26July 20–24, 2026Melbourne, Naarm, Australia Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.04650 [cs.IR] (or arXiv:2606.04650v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.04650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-7] QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples
链接: https://arxiv.org/abs/2606.04646
作者: Mengao Zhang,Xiang Yang,Chang Liu,Tianhui Tan,Ke-wei Huang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages
Abstract:Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework – index-time preservation versus query-time execution – predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution – not retrieval alone – is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.
[IR-8] Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval
链接: https://arxiv.org/abs/2606.04603
作者: Olivier Jeunen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance’’ – ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples S_i embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.04603 [cs.IR] (or arXiv:2606.04603v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.04603 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-9] Cartridges at Scale: Training Modular KV Caches over Large Document Collections
链接: https://arxiv.org/abs/2606.04557
作者: Momchil Hardalov,Gonzalo Iglesias,Adrià de Gispert
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, 17 tables
Abstract:Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.
[IR-10] rading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations RECSYS
链接: https://arxiv.org/abs/2606.04550
作者: Noah Lund Syrdal,Anders Vestrum,Jorgen Bergh
类目: Information Retrieval (cs.IR)
备注: 23 pages, 30 figures. Code available at this https URL
Abstract:E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.
[IR-11] Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization
链接: https://arxiv.org/abs/2606.04547
作者: Heng Cao,Fan Zhang,Jian Yao,Yujie Zheng,Changlin Zhao,Lu Hao,Yuxuan Wei,Wangze Ni,Huaiyu Fu,Yuqian Sun,Xuyan Mo
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 16 pages, 6 figures
Abstract:Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.
[IR-12] ANN Search: Recall What Matters
链接: https://arxiv.org/abs/2606.04522
作者: Dimitris Dimitropoulos,Nikos Mamoulis
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.
[IR-13] SAILRec: Steering LLM Attention to Dual-Side Semantically Aligned Collaborative Embeddings for Recommendation
链接: https://arxiv.org/abs/2606.04514
作者: Xi Wu,Jiale Wang,Zihan Wang,Yichen Gao,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang
类目: Information Retrieval (cs.IR)
备注: 17 pages, including appendices
Abstract:Recent LLM-based recommenders enhance language models with collaborative embeddings from user-item interactions, but making such embeddings available does not ensure their proper use during inference. Through a diagnostic attention analysis, we find that the utilization of collaborative embeddings is depth-dependent and alignment-sensitive, suggesting that LLMs need to balance their internal semantic knowledge with external collaborative knowledge. To address this issue, we propose SAILRec, an LLM-based recommender that improves this balance through dual-side semantic alignment and hierarchical attention steering. The former aligns item-side embeddings with item-text semantics and user-side embeddings with codebook-based semantic profiles, while the latter suppresses premature shallow-layer collaborative interference and strengthens collaborative evidence in deeper decision layers. Experiments on MovieLens-1M and Amazon-Book show that SAILRec consistently outperforms representative baselines, with ablation and masking analyses validating its key designs.
[IR-14] Bridging Short Videos and Live Streams: Reasoning -Guided Multimodal LLM s for Cross-Domain Representation Learning
链接: https://arxiv.org/abs/2606.04448
作者: Le Zhang,Xiaolan Zhu,Yuchen Wang,Shilong Kang,Jiaqi Xue,Xiaoyu Zhang,Xiang Chen,Yalong Guan,Xiangyu Wu,Shijun Wang,Lantao Hu,Kun Gai
类目: Information Retrieval (cs.IR)
备注: 9 pages
Abstract:As live streaming services grow, many platforms offer short videos and live streams to meet diverse needs. Short videos carry substantial traffic and rich behavior signals, whereas live streaming is a core conversion scenario with sparse behavior data, making cold start severe. Transferring user interests from short videos to live streaming recommendation can alleviate these issues. Meanwhile, short videos and live streams are complex multimodal items, and integrating multimodal signals improves recommendation performance. Although Multimodal Large Language Models (MLLMs) show strong multimodal understanding and reasoning, their application to cross-domain recommendation remains underexplored. To this end, we propose Reasoning-Guided Cross-Domain Representation Learning (RGCD-Rep), a reasoning-guided framework for cross-domain recommendation from short videos to live streams. RGCD-Rep introduces MLLM reasoning resource-efficiently and learns transferable item representations guided by behavioral collaboration via two-stage training. First, reasoning-aware distillation lets a frozen teacher MLLM generate structured cross-domain reasoning knowledge and distills it into a lightweight student MLLM. Second, transferability-guided cross-domain representation learning decomposes item representations into transferable and domain residual representations. The resulting representations are computed offline and integrated into downstream retrieval tasks, enabling low-cost industrial deployment. Extensive offline experiments demonstrate RGCD-Rep’s superiority. After deployment in Kuaishou’s live streaming recommendation system, A/B tests show significant gains across multiple core business metrics, confirming its effectiveness and practicality in real industrial scenarios. RGCD-Rep is fully deployed and serves over 400 million users daily.
[IR-15] Cascading Hallucination in Agent ic RAG : The CHARM Framework for Detection and Mitigation
链接: https://arxiv.org/abs/2606.04435
作者: Saroj Mishra
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.
[IR-16] Context-as-a-Service: Surfacing Cross-File Dependency Chains for LLM -Generated Developer Documentation
链接: https://arxiv.org/abs/2606.04397
作者: Ameya Gawde,Vyzantinos Repantis,Harshvardhan Singh,Lucy Moys
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
备注: 8 pages, 2 figures, 4 tables
Abstract:LLM agents increasingly write and maintain developer documentation, but usefulness and accuracy often rely on dependency chains that are not obvious to follow. Even with more files in context, the agent must still decide which cross-file dependencies to trace. We present Context-as-a-Service (CaaS), a retrieval layer that LLM agents query to find evidence across the codebase as they review or generate documentation. CaaS indexes source code, API references, and upstream documentation, then enables agents to query the index through tool calls that combine keyword and semantic search. We evaluate CaaS in two case studies using Claude Sonnet 4.6 on a production SDK: improving API reference comments in a core source file and validating an LLM-generated tutorial. In both studies, the baseline already had ordinary repository tools such as file reads, keyword search, and symbol navigation. CaaS adds a retrieval layer on top, so the comparison isolates added retrieval rather than basic repository access. In the API-reference review, the CaaS-augmented agent produced the same 5 missing-documentation fixes as the baseline and surfaced 4 findings the baseline missed: 2 cross-file factual errors and 2 underspecified API comments. In the tutorial validation, it surfaced 1 executable bug, 1 API-usage improvement, and 2 missing prerequisites that the baseline pipeline did not catch. These findings required tracing non-obvious dependency chains across utility files, framework internals, usage examples, tests, and component-creation logic. Over five runs per condition, adding CaaS reduced wall-clock time by 22% to 34% across the two tasks and lowered input-token usage.
[IR-17] Rethinking Sales Lead Scoring with LLM -based Hierarchical Preference Ranking
链接: https://arxiv.org/abs/2606.04387
作者: Chenyu Zhang,Yiwen Liu,Yin Sun,Xinyuan Zhang,Yuji Cao,Junming Jiao,Juyi Qiao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority. While Large Language Models(LLMs) offer superior semantic understanding of customer interactions, general-purpose LLMs are ill-suited for lead ranking: they generate text rather than comparable scores, and lack alignment with the hierarchical priorities of sales funnels. We introduce an LLM-based discriminative framework for sales lead scoring, which supports joint modeling of structured CRM features and unstructured customer interactions. On top of this framework, we propose HPRO (Hierarchical Preference Ranking Optimization), which augments sales lead scoring with a hierarchical preference ranking objective. HPRO employs a margin-aware Bradley-Terry formulation to transform sparse binary labels into dense, funnel-aware preference pairs, enabling lead scoring to leverage both pointwise and pairwise supervision. Experiments on large-scale data from a leading NEV brand demonstrate state-of-the-art classification (AUC 0.8161) and ranking performance (+39.7% precision among top-ranked leads). A 132-day online A/B test validates 9.5% sales volume uplift, confirming real-world commercial impact.
[IR-18] LCSHBench: A Multilingual Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment
链接: https://arxiv.org/abs/2606.04382
作者: Kwok Leong Tang
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.
[IR-19] DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling
链接: https://arxiv.org/abs/2606.04374
作者: Bokang Wang,Xing Fang,Mingmin Jin,Jing Wang,Zhentao Song,Guangxin Song,Jianbo Zhu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Jing Wang (Corresponding Author)
Abstract:Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall’s production data show that our proposed approach has achieved better results, improving offline AUC by +1.54%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13% UCTR, +0.25% UCTCVR), proving its massive industrial value.
[IR-20] Disentangling Answer Engine Optimization from Platform Growth: A Log-Based Natural Experiment on ChatGPT Referral Traffic
链接: https://arxiv.org/abs/2606.04362
作者: Keisuke Watanabe,Kazuki Nakayashiki
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 1 table
Abstract:Large language model (LLM) “answer engines” such as ChatGPT now send measurable referral traffic to the open web, and a practice analogous to search engine optimization, here called Answer Engine Optimization (AEO), has emerged. Public AEO success stories typically quote large raw growth multiples, but raw referral growth is confounded by the rapid platform-level growth of the answer engines themselves. We report a longitudinal field study on a single high-traffic domain (this http URL) whose corpus of hundreds of thousands of YouTube question-and-answer pages received a defined bundle of AEO interventions in January 2026 (detailed in Section 4). Because the interventions were concentrated on one subset of the site, the untreated remainder of the same domain acts as a contemporaneous control that absorbs the platform tailwind. Using first-party analytics and server logs rather than probabilistic third-party estimators, we find: (1) raw growth is dominated by the platform tailwind: on monthly aggregates total ChatGPT referrals grew 5.7x while untreated pages on the same domain grew 3.5x over the same window; (2) an interrupted time-series model on the weekly treated/control ratio estimates a discrete, intervention-aligned level increase of 1.82x (95% CI 1.31-2.54, HAC p=0.001), robust across engagement-filtered traffic (2.27x) and alternative specifications; (3) however, a conservative placebo-in-time permutation test yields p=0.16, so the effect is suggestive, not conclusive, given a short and noisy pre-period; and (4) Google organic clicks to treated pages did not fall beyond the ambient site-wide trend and indexation was preserved, consistent with the SEO-protection rule. The methodological message, separating treatment from platform tailwind with an on-domain control, matters more than any single multiple, and implies that headline AEO multiples substantially overstate causal effect.
[IR-21] Creative Reading: Scaffolding Reading for Transformation
链接: https://arxiv.org/abs/2606.04308
作者: Sophia Liu,Sarah Abowitz,Yijun Liu,Sarah Sterman,Shm Garanganao Almeda,Max Kreminski
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Reading augmentation systems increasingly help readers process text at scale. While these tools address real constraints of time and cognitive load, they often implicitly frame reading as information transmission, or “reading to discard,” delegating interpretation and effort to the machine. Yet this delegation changes the outcome of reading. For example, in scholarly reading, deciding what a research text implies and why it matters is central to the work of scholarly production. We propose creative reading as an alternative goal: reading augmentation that supports readers in creating both readings and themselves as readers. By putting literary and narrative theories into conversation with scholarly sensemaking and creativity support, we present a provocation-oriented design space for valuing the process of reading as a way of preserving a plurality of readings and transforming readers over time.
[IR-22] Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval
链接: https://arxiv.org/abs/2606.04300
作者: Abdelrahman Abdallah,Mahmoud Abdalla,Mohammed Ali,Adam Jatowt
类目: Information Retrieval (cs.IR)
备注:
Abstract:Late-interaction vision-language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout-sensitive evidence request. We introduce \textbfArgus, a family of query-conditioned late-interaction retrievers built on Qwen3.5-VL. Argus adds a region-aware Mixture-of-Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query-aware router selects latent experts per region before MaxSim. The output remains a multi-vector index compatible with ColPali-style retrieval, but the document representation is now dependent on the query (i.e., \mathbfD(q) ). All Argus models use a 1024-dimensional retrieval head, compared with the 2560-dimensional and 4096-dimensional heads of recent state-of-the-art systems, and are trained on roughly 9% of the available public supervision rather than the full pool. The 9B model reaches \textbf92.67 NDCG@5 on ViDoRe V1 and \textbf86.0 NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late-interaction model on the combined leaderboard. Wrapped in a Qwen3.6-27B agentic retrieval pipeline on ViDoRe V3, Argus-9B further improves its NDCG@10 from 60.28 to \textbf64.80 over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.
[IR-23] he Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning
链接: https://arxiv.org/abs/2606.04280
作者: Justinas Zaliaduonis,Patrick Putzky,Till Richter,Sergios Gatidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.
[IR-24] raining-Free Lexical-Dense Fusion for Conversational-Memory Retrieval
链接: https://arxiv.org/abs/2606.04194
作者: Christian Lysenstøen
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 3 figures, 10 tables. Code, data, and per-table receipts: this https URL
Abstract:Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, “Turn Isolation Retrieval”) beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory’s). We make no claim to a complete memory architecture; this is a retrieval-stage study.
[IR-25] PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation AAAI2026
链接: https://arxiv.org/abs/2601.18777
作者: Abhishek Divekar,Anirban Majumder
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)
备注: Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)
Abstract:Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.
[IR-26] Archi: Agent ic Operations at the CMS Experiment
链接: https://arxiv.org/abs/2606.04755
作者: Pietro Lugato,Luca Lavezzo,Jason Mohoney,Hasan Ozturk,Muhammad Hassan Ahmed,Juan Pablo Salas,Viphava Ohm,Krittin Phornsiricharoenphant,Gabriele Benelli,Mariarosaria D’Alfonso,Manasvita Joshi,Warren Nam,Aron Soha,Samantha Sunnarborg,Austin Swinney,Jack Tucker,Dmytro Kovalskyi,Tim Kraska,Christoph Paus
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN’s LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.
人机交互
[HC-0] “A Glimpse Not a Gaze”: Using Generative AI to Balance Privacy and Awareness in Inter-generational Caregiving
链接: https://arxiv.org/abs/2606.05055
作者: Zixi Christina Li,Keiko Katsuragawa,James R. Wallace
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As older adults increasingly prefer to age in place, their adult children often assume the role of informal caregivers. This dynamic creates a distinct tension between the adult child’s need for awareness and the older adult’s fundamental right to privacy. Traditional monitoring technologies, such as raw video feeds, often compromise the older adult’s autonomy. To address this challenge, this study explores the use of generative Artificial Intelligence (GenAI) to create abstract, privacy-preserving ``visual summaries’’ of daily activities. We design a 10-day Experience Sampling Method (ESM) study with dyads consisting of older adults and their adult children. Through daily smartphone prompts, participants report their current context and evaluate pre-generated AI sketches, indicating their willingness to share or receive these images. Follow-up interviews will further investigate participants’ boundary-setting behaviours. This research aims to quantify the privacy mismatch between generations and provide actionable design guidelines for applying visual abstraction in AI-mediated caregiving tools, ultimately supporting inter-generational connection while protecting user dignity.
[HC-1] Scaling Expert Feedback with Reflective Edit Propagation in Compositional Knowledge Bases
链接: https://arxiv.org/abs/2606.05023
作者: Jiajing Guo,Xueming Li,Jorge Piazentin Ono,Wenbin He,Liu Ren
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM CAIS '26 Demo Track
Abstract:Domain-specific knowledge bases (KBs) encode vertical expertise and proprietary information that organizations depend on, but curating them at scale is a persistent challenge. Although Large Language Models (LLMs) can draft initial entries efficiently, technical accuracy still requires human expert validation, and reviewing entries one by one at scale is impractical. We present Reflective Agent for Identifier Dictionary (RAID), a novel system that transforms individual expert edits into systematic knowledge updates. Unlike traditional “correct-and-save” paradigms, RAID utilizes a reflective agent to infer the underlying semantic intent behind a single expert edit and propagates that correction across the entire KB through a three-step architecture: Intent Inference, Reflection-based Planning, and User Controlled Execution. We evaluated the reflection and propagation performance on a public dataset and conducted a user study with subject matter experts with proprietary data. The evaluation shows RAID’s technical feasibility in capturing expert intent and its potential to scale specialized expertise across industrial knowledge bases.
[HC-2] PhysDox: Benchmarking LLM s on Physical Feasibility Auditing of Physiological Sensing Protocols
链接: https://arxiv.org/abs/2606.05003
作者: He Liu,Boyuan Gu,Shuaiqi Cheng,Haiyang Sun,Siyu You,Xuming Hu
类目: Human-Computer Interaction (cs.HC)
备注: 31 Pages,7 Figures
Abstract:Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing domains. We formulate the task as a two-stage evaluation: severity detection classifying protocols as valid, minor, or fatal, followed by the constraint-level diagnosis of fatal violations. Evaluating 6 LLMs across 4 inference strategies yields a peak Stage-1 macro-F1 of only 53.0. Moreover, strong oracle diagnosis collapses during end-to-end evaluation due to correlated cascade errors. Error analysis reveals scaffold bias, where models conflate procedural completeness with physical validity. Consequently, implicit constraints exhibit a 2 times higher miss rate than explicit hardware violations, supported by strong statistical correlation at \rho=0.81 and p0.01 . Trace analysis of false negatives exposes a 54%–46% split between attention and judgment failures, ultimately demonstrating that protocol auditing demands calibrated feasibility reasoning rather than factual recall or longer rationales.
[HC-3] Multi-Camera AR Guidance System for Surgical Instrument Handling and Assembly: Investigating Workload and Efficiency
链接: https://arxiv.org/abs/2606.04992
作者: Shiyu Li,Julian Kreimeier,Hannah Schieber,Dirk Müller,Bernhard Kainz,Rüdiger von Eisenhart-Rothe,Daniel Roth
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 11 pages
Abstract:The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses. Comments: 11 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.04992 [cs.CV] (or arXiv:2606.04992v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.04992 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-4] What Can Eye Gaze Teach Us About Real-World Cycling? Insights From the Oxford RobotCycle Project
链接: https://arxiv.org/abs/2606.04989
作者: Benjamin Hardin,Efimia Panagiotaki,Daniele De Martini,Lars Kunze
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.
[HC-5] DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving
链接: https://arxiv.org/abs/2606.04987
作者: Xiaochen Zhu,Georgi Karadzhov,Tom Stafford,Andreas Vlachos
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.
[HC-6] Clinical Assistant for Remote Engagement Link (CARE-link): A Web-Based Electronic Health Records Software for Managing Diabetes
链接: https://arxiv.org/abs/2606.04952
作者: Prince Ebenezer Adjei,Joshua Teye Tettey,Toufiq Musah,Audrey Agbeve,John Amuasi
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.
[HC-7] CapSenseBand: Sustaining Cross-Disciplinary Creativity When Stitches Must Meet Signals
链接: https://arxiv.org/abs/2606.04609
作者: Sark Pangrui Xing,Hongci Hu,Lai Wei,Le Fang,Ziqian Bai,Kinor Shou-xiang Jiang,Stephen Jia Wang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Wearable sensing systems increasingly depend on textiles that are both materially wearable and electronically functional. Their design requires collaboration between textile designers, who reason through stitches, yarn behavior, and machine constraints, and interaction designers, who reason through electrodes, signal paths, and insulation. However, these forms of expertise do not easily translate across disciplinary boundaries. This poster presents CapSenseBand, a knitted capacitive-sensing wristband developed through a research-through-design process organized around Analysis, Synthesis, and Detailing. We document an artifact chain spanning material swatches, a rapid wearable prototype, Paper Models as shared negotiation surfaces, a double-layer knitted structure, and an insulated Swept Frequency Capacitive Sensing breakout board. We show how Paper Models functioned as boundary objects, helping collaborators externalize intent, negotiate spatial and technical constraints, and preserve disciplinary expertise while converging on a shared design. We contribute a reusable swatch-to-sleeve pattern for material-centered HCI: keep discipline-specific probes open early, then converge through artifacts that make material, spatial, and electronic decisions legible before fabrication locks them in.
[HC-8] Synthetic Personalities: How Well Can LLM s Mimic Individual Respondents Using Socio-Economic Microdata?
链接: https://arxiv.org/abs/2606.04592
作者: Leonard Kinzinger,Jochen Hartmann
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a 3 \times 5 \times 2 \times 2 construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher- z correlation reaches r = 0.590 on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.
[HC-9] Addressing Negative Commons Governance with Positive Commons Principles
链接: https://arxiv.org/abs/2606.04563
作者: Boyang Zhou,Oleg Ianchenko
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Paper in Proceedings of LIMITS 2026: 12th Workshop on Computing within Limits, 2026-06-23-25, Online
Abstract:Computing is accompanied by both positive and negative commons throughout its lifecycle of creation, execution, and disposal. We examine two governance systems situated within this lifecycle – global e-waste trade and the Linux kernel community – to evaluate whether Elinor Ostrom’s eight design principles for common-pool resource (CPR) governance extend to the management of negative common-pool resources (NCPRs). Unlike traditional CPRs where communities work to preserve a finite resource (i.e. clean water), NCPR governance seeks to collectively reduce a negative shared stock. In our two cases, e-waste governance aims to reduce the volume of mismanaged waste and illicit trade, while the Linux community aims to reduce the number of error-prone or malicious contributions that reach the main branch and, in turn, extend the life of existing hardware. Through qualitative analysis of primary sources from each domain, we find that the same eight principles by Ostrom that aid positive commons governance tend to appear in successful negative commons governance systems. We argue that future NCPR governance design should prioritize Ostrom’s principles, particularly clearly defined boundaries and well-functioning nested structures.
[HC-10] Speculating the Impacts of Mediated Social Touch Technology
链接: https://arxiv.org/abs/2606.04489
作者: Russian(Ruo-Xuan)Wu,Tim Moesgen,Myung Jin(MJ)Kim,Xinyan Yu,Naoki Kameyama,Anusha Withana,Marius Hoggenmueller,Luke Hespanhol
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: To be published in Designing Interactive Systems Conference (DIS '26) proceedings, June 13-17, 2026, Singapore. 17 pages, 8 figures, 1 table
Abstract:With growing research on haptic interfaces, Mediated Social Touch (MST) technologies offer the potential to record, synthesise, and reproduce (RSR) touch experiences across space and time, enabling, for instance, a hug from afar and from the past. Although much of the existing research highlights the direct benefits of these systems, such as reducing loneliness and providing emotional support, little attention has been paid to their broader sociotechnical impacts. To address this gap, we used the Future Ripples method to speculate on possible effects of MST. We conducted three workshops with 24 participants, including potential users, domain experts, and haptics researchers. Throughout these sessions, participants collectively envisioned possible future scenarios, alongside opportunities and threats, and proposed actionable responses. Our qualitative analysis organised these insights into four themes and three distinctive challenges. These findings offer haptics researchers intervention points across the RSR pipeline to inform MST design, alongside methodological insights from applying Future Ripples to MST technology.
[HC-11] IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation
链接: https://arxiv.org/abs/2606.04480
作者: Haoyang Ge,Jian Ma,Ziwen Wang,Qihe Wang,Jianqi Fan,Hongzhi Yu,Xingyu Chen,Kun Li
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.
[HC-12] When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations
链接: https://arxiv.org/abs/2606.04431
作者: Minh Duc Chu,Yifan Wu,Zhiyi Chen,Angel Hsing-Chi Hwang,Luca Luceri
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Millions turn to AI companion chatbots during loneliness, grief, and personal crises. How these companion platforms respond in such moments can shape the trajectory of a user’s vulnerable state. Yet we lack tools to characterize what each platform actually does when users open up. Existing audits score reactions to pre-defined crisis prompts and miss the underlying decision policy that governs sustained interaction. We address these gaps with two key contributions. First, we introduce the AI Companion Vulnerability-Response Taxonomy, a paired taxonomy of user vulnerability and chatbot response designed for analyzing extended companion chatbot interactions. Second, we infer the response policy each platform follows across distinct vulnerability scenarios by applying Inverse Reinforcement Learning to ~48k turns of real-world user conversations with GPT-4.1, this http URL, and Replika. Our findings reveal what AI companions prioritize in conversations with vulnerable users: GPT-4.1 reaches for advice, this http URL spreads its response across different strategies without a dominant mode, and Replika consistently asks questions and stays present. Each, however, downweights the responses that introduce corrective friction: GPT-4.1 probes less as conversations continue and when interacting with psychologically high-risk users; Replika advises bonded users more and challenges them less; this http URL shows no committed engagement strategy on internal distress. Estimated policies are invisible to output-level audits, providing a new lens for auditing chatbots in the wild and enabling more realistic safety evaluation.
[HC-13] GlossAssist – A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings
链接: https://arxiv.org/abs/2606.04367
作者: Bhargav Shandilya,Matt Buchholz,Alexis Palmer
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 6 pages, 3 figures
Abstract:Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.
[HC-14] Behavioral and Performance Indicators of Depression and Anxiety in Electronic Learning Systems
链接: https://arxiv.org/abs/2606.04254
作者: Arya VarastehNezhad,Fattaneh Taghiyareh
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:This study investigates whether behavioral and performance indicators derived from a Moodle-based learning management system are associated with university students’ depression and anxiety in two undergraduate Computer Engineering courses. Using a quantitative observational design, LMS event logs, academic records, and self-reported Beck Depression Inventory-II and Beck Anxiety Inventory scores from 97 students were integrated. A broad set of behavioral and performance indicators spanning temporal engagement, session structure, deadline-related behavior, page-refresh patterns, and LMS navigation was extracted from raw event logs and analyzed using descriptive statistics, independent-samples t-tests with Benjamini-Hochberg FDR correction, effect sizes, and Spearman correlations; inventory scores were confirmed invariant by sex and academic year. Several indicators were significantly associated with depression and anxiety. Higher depression was associated with shifted temporal activity patterns, longer session durations, and shorter homework submission lead times, while higher anxiety was associated with concentrated temporal engagement and session-based differences. These findings suggest that routine LMS data can provide meaningful behavioral signals related to student well-being and may support earlier educational awareness of students who experience mental-health-related strain. At the same time, such indicators should be interpreted as contextual and non-diagnostic markers rather than as substitutes for clinical assessment.
[HC-15] SocialCoach: Personalized Social Skill Learning with RL-based Agent ic Tutoring and Practice
链接: https://arxiv.org/abs/2606.04155
作者: Tianfu Wang,Max Xiong,Jianxun Lian,Hongyuan Zhu,Zhengyu Hu,Yuxuan Lei,Linxiao Gong,Xiaofang Li,Peiting Tsai,Nicholas Jing Yuan,Qi Zhang
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Social skills such as negotiation and leadership are crucial for personal and professional success in today’s interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.
[HC-16] Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
链接: https://arxiv.org/abs/2606.04150
作者: Yaoxi Shi,Cathy Mengying Fang,Pattie Maez,Amit Goldenberg
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people’s beliefs about AI’s emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.
[HC-17] Stories and Systems: Educational Interactive Storytelling to Teach Media Literacy and Systemic Thinking
链接: https://arxiv.org/abs/2508.11059
作者: Christian Roth,Rahmin Bender-Salazar,Breanne Pitt
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: Under submission (May, 2025)
Abstract:This paper explores how Interactive Digital Narratives (IDNs) can support learners in developing the critical literacies needed to address complex societal challenges, so-called wicked problems, such as climate change, pandemics, and social inequality. While digital technologies offer broad access to narratives and data, they also contribute to misinformation and the oversimplification of interconnected issues. IDNs enable learners to navigate nonlinear, interactive stories, fostering deeper understanding and engagement. We introduce Systemic Learning IDNs: interactive narrative experiences explicitly designed to help learners explore and reflect on complex systems and interdependencies. To guide their creation and use, we propose the CLASS framework, a structured model that integrates systems thinking, design thinking, and storytelling. This transdisciplinary approach supports learners in developing curiosity, critical thinking, and collaborative problem-solving. Focusing on the classroom context, we apply CLASS to two cases, one commercial narrative simulation and one educational prototype, offering a comparative analysis and practical recommendations for future design and implementation. By combining narrative, systems mapping, and participatory design, this paper highlights how IDNs can become powerful tools for transformative, systems-oriented learning in an increasingly complex world.
计算机视觉
[CV-0] Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text
链接: https://arxiv.org/abs/2606.05162
作者: Jaeyeong Kim,Ines Kim,Jahyeok Koo,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.
[CV-1] An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers
链接: https://arxiv.org/abs/2606.05149
作者: Gandhimathi Padmanaban,Fred Feng
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 24 pages, 10 figures, venue TBD
Abstract:Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.
[CV-2] GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes
链接: https://arxiv.org/abs/2606.05142
作者: Josef Bengtson,Yaroslava Lochman,Fredrik Kahl
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.
[CV-3] Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting
链接: https://arxiv.org/abs/2606.05124
作者: Hongyu Zhou,Zorah Lähner
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.
[CV-4] Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have
链接: https://arxiv.org/abs/2606.05107
作者: Elouan Gardès,Seung Eun Yi,Kartik Ahuja,Théo Moutakanni,Huy V. Vo,Piotr Bojanowski,Wolfgang M. Pernice,Loïc Landrieu,Camille Couprie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model’s generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.
[CV-5] Identifying Gems from Roman RAPIDly
链接: https://arxiv.org/abs/2606.05103
作者: Karan Gandhi,Ashish A. Mahabal,Jacob E. Jencson,Russ R. Laher,Ben Rusholme,Lin Yan,Ryan M. Lau,Schuyler D. Van Dyk,Mansi M. Kasliwal
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 15 pages, 10 figures, Submitted to the Publications of the Astronomical Society of the Pacific
Abstract:The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented spatial resolution and cadence, enabling the discovery of millions of astronomical transients. Hence, it is necessary to have automated pipelines for generating alerts in place so that the telescope can begin discovering reliable transients and variable objects soon after it is launched. However, no real Roman data currently exist, making the development of such pipelines difficult. In this work, we present a machine learning model RuBR and a general methodology for distinguishing genuine transient and variable detections from spurious (bogus) detections within the RAPID pipeline. In particular, we present three models using this methodology: RuBR_comb trained and tested on combined locally injected and OpenUniverse2024 transients, RuBR_loc trained on locally injected transients and tested on OpenUniverse2024 transients, and RuBR_DA that combines locally injected transients with a fraction of OpenUniverse2024 transients in domain-adaptation mode for training. This paves the way for strategies to adapt the RuBR_comb model to real observations in the absence of any ground-truth labels during the early phases of the Roman mission. While the image differencing pipeline continues to be improved, our experimental results demonstrate the effectiveness of the proposed approach and its promise for robust real-bogus classification in the Roman era.
[CV-6] ZipSplat: Fewer Gaussians Better Splats
链接: https://arxiv.org/abs/2606.05102
作者: Alexander Veicht,Sunghwan Hong,Dániel Baráth,Marc Pollefeys
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with \sim6\times fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at \hrefthis https URLthis https URL .
[CV-7] InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space CVPR
链接: https://arxiv.org/abs/2606.05071
作者: Jiarui Wu,Yujin Wang,Ruikang Li,Fan Zhang,Mingde Yao,Tianfan Xue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Computer Vision and Pattern Recognition (CVPR), 2026
Abstract:Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: this https URL.
[CV-8] MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution
链接: https://arxiv.org/abs/2606.05068
作者: Daeyoung Han,Seongmin Hwang,Moongu Jeon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.
[CV-9] UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD
链接: https://arxiv.org/abs/2606.05058
作者: Jingyuan Chen,Sheng Jin,Haopeng Sun,Wentao Liu,Chen Qian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.
[CV-10] Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping
链接: https://arxiv.org/abs/2606.05035
作者: Peilin Tao,Chong Cheng,Yuansen Du,Caiwei Song,Zhengqing Chen,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Hainan Cui,Shuhan Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train–test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emphAnchor3R, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.
[CV-11] MetaPoint: Unlocking Precise Spatial Control in Agent ic Visual Generation
链接: https://arxiv.org/abs/2606.05031
作者: Dewei Zhou,Xinyu Huang,Xun Wang,Ji Xie,Yabo Zhang,Liang Li,Kunchang Li,Zongxin Yang,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model’s inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object’s position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.
[CV-12] Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives CCS
链接: https://arxiv.org/abs/2606.05018
作者: Marco Peer,Thomas Gorges,Mathias Seuret,Vincent Christlein,Andreas Fischer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at ICCST 2026
Abstract:Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.
[CV-13] CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation
链接: https://arxiv.org/abs/2606.05011
作者: Yurim Jeon,Dongseong Seo,Seung-Woo Seo
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 5 figures
Abstract:Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at this https URL.
[CV-14] Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning
链接: https://arxiv.org/abs/2606.04986
作者: Yu Zhu,Yongkang Li,Wenjie Zhu,Haoyi Jiang,Wenyu Liu,Wei Yang,Bin Li,Xinggang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.
[CV-15] Plan Watch Recover: A Benchmark and Architectures for Proactive Procedural Assistance
链接: https://arxiv.org/abs/2606.04970
作者: Kaustav Kundu,Ritvik Shrivastava,Maxim Arap,Nanshu Wang,Xianhui Zhu,Quintin Fettes,Gautam Tiwari,Parth Suresh,Théo Moutakanni,Alejandro Castillejo Munoz,Allen Bolourchi,Pascale Fung,Pinar Donmez,Babak Damavandi,Anuj Kumar,Seungwhan Moon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 53 pages, 14 figures
Abstract:We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textitwhen to interrupt, and \textithow to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf(1)~we release \textbfEgoProactive, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf(2)~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbfPro\textsuperscript2Bench under a unified proactive-guidance schema; \textbf(3)~we propose a \textbfdecoupled planner–interaction architecture specialized for procedural state, visual cues, and recovery injection; \textbf(4)~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.
[CV-16] Scene-Centric Unsupervised Video Panoptic Segmentation CVPR2026
链接: https://arxiv.org/abs/2606.04925
作者: Christoph Reich,Oliver Hahn,Nikita Araslanov,Laura Leal-Taixé,Christian Rupprecht,Daniel Cremers,Stefan Roth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Oliver Hahn and Christoph Reich - both authors contributed equally. Code: this https URL Project page: this https URL
Abstract:Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
[CV-17] Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models
链接: https://arxiv.org/abs/2606.04922
作者: Tran Dinh Tien,Zhiqiang Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Code is available at this https URL
Abstract:Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at this https URL.
[CV-18] oward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling
链接: https://arxiv.org/abs/2606.04920
作者: Chin-Yuan Yeh,Ting-An Chen,De-Nian Yang,Ming-Syan Chen
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.
[CV-19] CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection MICCAI2026
链接: https://arxiv.org/abs/2606.04898
作者: Roberto Di Via,Irina Voiculescu,Francesca Odone,Vito Paolo Pastore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted MICCAI 2026
Abstract:Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.
[CV-20] Hierarchical Space Partition for Surface Reconstruction
链接: https://arxiv.org/abs/2606.04891
作者: Minjie Tang,Xiangfei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: Published in 2026 International Conference on 3D Vision (3DV)
Abstract:Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at this https URL.
[CV-21] HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios
链接: https://arxiv.org/abs/2606.04888
作者: Yinxiang Yu,Maoxiang Chu,Qi Niu,Guanghu Liu,Wei Xu,Haotian Wang,Zhi Chen,Yutian Zhu,Yuelong Fan,Guanghao Liao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables
Abstract:Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at this https URL.
[CV-22] DiverAg e: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance
链接: https://arxiv.org/abs/2606.04881
作者: Yueying Zou,Peipei Li,Qianrui Teng,Dianyan Xu,Zekun Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages,10 figures, 5 tables
Abstract:Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbfDiverAge, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.
[CV-23] MAOAM: Unified Object and Material Selection with Vision-Language Models SIGGRAPH2026
链接: https://arxiv.org/abs/2606.04880
作者: Jaden Park,Valentin Deschaintre,Jason Kuen,Kangning Liu,Iliyan Georgiev,Krishna Kumar Singh,Yong Jae Lee,Michael Fischer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2026 Conference. Project page: \href{ this https URL }{here}
Abstract:Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user’s selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.
[CV-24] Recent Advances and Trends in Learning-based 3D Representations
链接: https://arxiv.org/abs/2606.04871
作者: Adrien Schockaert,Hamid Laga,Hazem Wannous,Vincent Magnier,Guillaume Dufaye,Jean-françois Witz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.
[CV-25] IRIS-GAN: Staged Specialist Detection of Deepfake Faces
链接: https://arxiv.org/abs/2606.04863
作者: Jaume M. Trenchs,Veronica Sanz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures
Abstract:We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.
[CV-26] Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification
链接: https://arxiv.org/abs/2606.04844
作者: Tu Vo,Sheir Zaheer,Chan Y. Park
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class’s noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.
[CV-27] 3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks
链接: https://arxiv.org/abs/2606.04836
作者: Inam Qadir,Elizabeth B Varghese,Dena Al-Thani,Marwa Qaraqe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components T_x, T_y, T_z ) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9% accuracy and 3D facial features reaching 81.4% accuracy, outperforming 2D baseline approaches by 10.7% and 7.5%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.
[CV-28] OA-CutMix: Correcting the Label Bias of CutMix
链接: https://arxiv.org/abs/2606.04820
作者: Tobias Christian Nauen,Stanislav Frolov,Federico Raue,Brian B. Moser,Andreas Dengel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is 21.5% . In 17% of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.
[CV-29] Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
链接: https://arxiv.org/abs/2606.04811
作者: Rui Zhao,Kaiming Yang,Jifeng Zhu,Siyang Chen,Ziqi Wang,Weijia Wu,Kevin Qinghong Lin,Heng Wang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce this http URL, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, this http URL synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. this http URL will be open-sourced at this https URL.
[CV-30] NoRA: Evaluating Grounded Reason ableness in Visual First-person Normative Action Reasoning
链接: https://arxiv.org/abs/2606.04806
作者: Sichao Li,Sai Ma,Daniel Kilov,Secil Yanik Guyot,Zhuang Li,Seth Lazar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.
[CV-31] Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find Pruning and Lookup Tables
链接: https://arxiv.org/abs/2606.04801
作者: Titouan Le Breton,Karol Szustakowski,Marie Piraud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Flash Cubical, a highly efficient computation of cubical persistence on a V-filtration for 2D and 3D images over \mathbbF_2 . The implementation is built around three core ideas. First, cubical complexes satisfy properties that allow for the computation of persistence of the highest dimension via union-find and duality. Second, pruning of certain edges allows for a fast and efficient implementation of union-find. Third, the use of a lookup table, which exploits the regularity of cubical complexes to pre-compute local information. This avoids the need to compute local information at run time. To the best of our knowledge, this is the most efficient implementation of cubical persistence with a V-filtration, both in terms of time and memory costs. Although the paper focuses on persistence for V-filtration cubical complexes, the underlying ideas generalise naturally to T-filtrations on cubical complexes and suggest promising directions for other complexes.
[CV-32] Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization
链接: https://arxiv.org/abs/2606.04797
作者: Jiahua Dong,Wenqi Liang,Hongliu Li,Yang Cong,Duzhen Zhang,Hanbin Zhao,Henghui Ding,Yulun Zhang,Salman Khan,Fahad Shahbaz Khan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Abstract:Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user’s collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.
[CV-33] A Pathology Foundation Model for Gastric Cancer with Real-World Validation
链接: https://arxiv.org/abs/2606.04792
作者: Ling Liang,Jiabo Ma,Zhengyu Zhang,Fengtao Zhou,Yingxue Xu,Yihui Wang,Cheng Jin,Zhengrui Guo,On Ki Tang,Zhijian Cen,Zhen Wang,Qi Xie,Chengyu Lu,Chenglong Zhao,Feifei Wang,Yu Cai,Hongyi Wang,Jing Zhang,Yaping Ye,Shijun Sun,Shenglei Li,Yu Wang,Zhenhui Li,Ronald Cheong Kin Chan,Xiuming Zhang,Zhe Wang,Hao Chen,Li Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and Clinical dEcision support. GRACE was developed from multicenter gastric pathology datasets totaling 48,364 primarily HE-stained whole-slide images from 37,493 patients. When evaluated on 28 clinically relevant tasks, GRACE consistently outperformed representative pancancer PFMs, achieving a macro-AUC of 0.9188, with strong performance for precancerous lesion diagnosis (macro-AUC 0.9322), tumor histopathological assessment (macro-AUC 0.9119), molecular profiling (macro-AUC 0.8682), and prognostic prediction. Beyond benchmarking, GRACE’s translational value was substantiated through a rigorous evidence chain. Under safety-gated criteria requiring 100% NPV for rule-out and 100% PPV for rule-in, GRACE streamlined review for up to 69.6% of malignancy-diagnosis cases and triaged 46.8% of MMR-IHC follow-up requests. This translational feasibility was further strengthened by a randomized crossover reader study of pathologist-AI collaboration. With GRACE assistance, diagnostic accuracy improved from 82.0% to 89.9%, yielding nearly twofold higher adjusted odds of a correct diagnosis (OR 1.987) alongside concurrent gains in sensitivity and specificity. AI assistance also reduced diagnostic time by 14.9%, elevated diagnostic confidence by 9.0%, and markedly improved inter-rater agreement. When calibrated to maintain non-inferior performance to senior pathologists, the AI-assisted workflow could triage 60.7% of atrophy and 82.7% of intestinal metaplasia cases.
[CV-34] Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives
链接: https://arxiv.org/abs/2606.04788
作者: Ayumi Umemura,Toshinori Kuwahara,Marc Pollefeys,Daniel Barath
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Visual localization – estimating a camera pose within a pre-existing map – is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives – lines and circles – are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird’s-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2606.04788 [cs.CV] (or arXiv:2606.04788v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.04788 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ayumi Umemura Umemura [view email] [v1] Wed, 3 Jun 2026 12:14:24 UTC (3,549 KB) Full-text links: Access Paper: View a PDF of the paper titled Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives, by Ayumi Umemura and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs cs.RO References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-35] Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
链接: https://arxiv.org/abs/2606.04775
作者: Jihoon Hong,Alice Chan,Qiyue Dai,Julian Skifstad,Glen Chou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.
[CV-36] Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction
链接: https://arxiv.org/abs/2606.04772
作者: Hoang-Son Vo,Van-Hung Bui,Minh-Huy Mai-Duc,Tien-Dung Mai,Soo-Hyung Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas – a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.
[CV-37] Measuring Model Robustness via Fisher Information: Spectral Bounds Theoretical Guarantees and Practical Algorithms
链接: https://arxiv.org/abs/2606.04767
作者: Chong Zhang,Xiang Li,Jia Wang,Qiufeng Wang,Xiaobo Jin
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 1 figure
Abstract:The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model’s output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: this https URL.
[CV-38] Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma
链接: https://arxiv.org/abs/2606.04764
作者: Dilakshan Srikanthan,Amoon Jamzad,Paul Wilson,Nooshin Maghsoodi,Robert Policelli,Gabor Fichtinger,John F. Rudan,Parvin Mousavi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five pathology foundation models (CONCH v1.5, UNI v2, Virchow2, GigaPath, H-Optimus-1) and a ResNet50 baseline. Using attention-based multiple instance learning, we train single-task and multi-task models to predict five molecular alterations in glioblastoma on the CPTAC cohort, validate on an independent TCGA cohort, and evaluate biological coherence of attention maps against 87 transcriptional signatures using co-registered Visium spatial transcriptomics data from 18 samples. Internally, no single encoder dominates across all tasks, and external validation inverts internal performance rankings. Attention maps show a five-fold enrichment gradient from pathways (Cohen’s d=0.329) to individual genes (d=0.055), indicating that attention captures emergent multi-gene transcriptional programs rather than individual molecular events. Spatially smooth attention maps do not imply biological coherence, and different encoders attend to distinct biological compartments. Our framework provides objective, quantitative assessment of what foundation models learn from histopathology, moving the field beyond qualitative saliency map review.
[CV-39] Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment
链接: https://arxiv.org/abs/2606.04737
作者: Cong Wang,Hanxin Zhu,Jiayi Luo,Yonglin Tian,Xiaoqian Cheng,Peiyan Tu,Xin Jin,Long Chen,Zhibo Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbfPILA (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.
[CV-40] StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT MICCAI2026
链接: https://arxiv.org/abs/2606.04722
作者: Weiru Wang,Susanne G.H. Olthuis,Elizaveta Lavrova,Robert J. van Oostenbrugge,Charles B.L.M. Majoie,Wim H. van Zwam,Ruisheng Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accepted at MICCAI 2026
Abstract:Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: 4.5 h, 4.5-6 h, and 6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at this https URL.
[CV-41] Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification
链接: https://arxiv.org/abs/2606.04710
作者: Maitreya Shelare,Atharva Satam,Poonam Sonar,Sneha Burnase
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.
[CV-42] ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection
链接: https://arxiv.org/abs/2606.04706
作者: Xiaojing Chen(1),Xinyu Lu(1),Changtao Miao(2),Yunfeng Diao(3) ((1) Anhui University, (2) Ant Group, (3) Hefei University of Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.
[CV-43] Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation
链接: https://arxiv.org/abs/2606.04705
作者: Amirhossein Movahedisefat,Amirreza Fateh,Mohammad Reza Mohammadi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at this https URL
[CV-44] A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound
链接: https://arxiv.org/abs/2606.04700
作者: Ron Keuth,Christoph Großbröhmer,Franziska Halm,Miriam Johann,Anne-Nele Schröder,Ludger Tüshaus,Mattias P. Heinrich,Lasse Hansen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and annotations for fracture angle assessment in radiographs: this https URL
Abstract:Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of 4.1^\circ , 5.4^\circ , and 5.51^\circ , respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.
[CV-45] Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimers Disease Classification
链接: https://arxiv.org/abs/2606.04699
作者: Yogesh Kumar,Vrushank Ahire,Mudasir Ganaie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early and accurate detection of Alzheimer’s disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.
[CV-46] MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation CVPR2026
链接: https://arxiv.org/abs/2606.04688
作者: Jiale Xu,Wang Zhao,Ying Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.
[CV-47] Real-Time Automatic License Plate Recognition Using YOLOv8 SORT Tracking and Temporal Data Interpolation
链接: https://arxiv.org/abs/2606.04684
作者: Mirza Muhammad Mobeen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 Pages, For Accessing code: this https URL mobeen-pmo/Automatic-License-Plate-Recognition
Abstract:The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.
[CV-48] Instance-Level Post Hoc Uncertainty Quantification in Object Detection
链接: https://arxiv.org/abs/2606.04656
作者: Chongzhe Zhang,Zifan Zeng,Qunli Zhang,Feng Liu,Zheng Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures
Abstract:Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.
[CV-49] MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer KR CVPR2026
链接: https://arxiv.org/abs/2606.04621
作者: Weiyu Li,Antoine Toisoul,Tom Monnier,Roman Shapovalov,Rakesh Ranjan,Ping Tan,Andrea Vedaldi
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR2026 Highlight, Homepage: this https URL , Code: this https URL
Abstract:We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: this https URL, Code: this https URL
[CV-50] Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain
链接: https://arxiv.org/abs/2606.04613
作者: Alessandro Gambetti,Qiwei Han,Cláudia Soares,Hong Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 9 tables
Abstract:Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: this https URL.
[CV-51] COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations
链接: https://arxiv.org/abs/2606.04604
作者: Zixu Li,Yupeng Hu,Zhiwei Chen,Haokun Wen,Xuemeng Song,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIP 2026
Abstract:Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at this https URL
[CV-52] 4D Reconstruction from Sparse Dynamic Cameras CVPR2026
链接: https://arxiv.org/abs/2606.04593
作者: Kazuki Ozeki,Shun Kenney,Yuto Shibata,Eisuke Takeuchi,Takuya Narihira,Kazumi Fukuda,Ryosuke Sawata,Yuki Mitsufuji,Yoshimitsu Aoki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 4DV Workshop at CVPR 2026
Abstract:Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.
[CV-53] Impostor: An Agent -Curated Benchmark for Realistic AIGC Manipulation Localization
链接: https://arxiv.org/abs/2606.04545
作者: Zhenliang Li(1),Yutao Hu(1),Qixiong Wang(2),Wenpeng Du(1),Hongxiang Jiang(2),Jiasong Wu(1),Xiaolong Jiang(2),Jungong Han(3) ((1) Southeast University, (2) Xiaohongshu Inc., (3) Tsinghua University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 5 tables
Abstract:Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality AI-edited image manipulation localization dataset containing 100K manipulated images. Impostor is constructed by CraftAgent, a closed-loop agent framework that integrates scene perception, editing planning, manipulation execution, quality validation, and iterative reflection to automatically generate diverse and visually realistic manipulated images. Moreover, Impostor contains images generated by seven recent AIGC models across three manipulation types and includes multiple manipulated regions, providing a more comprehensive benchmark for AIGC-based IMDL. Furthermore, we propose PhaseAware-Net (PANet), a semantic-forensic framework that introduces local phase modeling and semantic-forensic consistency learning to better localize semantically plausible yet forensically disrupted manipulated regions. Extensive experiments show that Impostor poses significant challenges to existing large vision-language models (LVLMs) and specialized IMDL methods, while PANet achieves superior performance on Impostor and multiple public benchmarks.
[CV-54] Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning
链接: https://arxiv.org/abs/2606.04528
作者: Fan Zhang,Sijin Zheng,Fei Ma,Qiang Yin,Yongsheng Zhou,Fei Gao,Xian Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.
[CV-55] Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
链接: https://arxiv.org/abs/2606.04527
作者: Yuxuan Bian,Zeyue Xue,Songchun Zhang,Shiyi Zhang,Weiyang Jin,Yaowei Li,Junhao Zhuang,Haoran Li,Jie Huang,Haoyang Huang,Nan Duan,Qiang Xu
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Website: this https URL
Abstract:We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs’ pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.
[CV-56] SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning
链接: https://arxiv.org/abs/2606.04493
作者: Zhihua Wang,Yanping Li,Yizhang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at this https URL.
[CV-57] Adaptive Calibration for Fair and Performant Facial Recognition
链接: https://arxiv.org/abs/2606.04469
作者: Ryan Brown,Chris Russell
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids “leveling down” where fairness comes at the cost of degraded performance for some groups. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.04469 [cs.CV] (or arXiv:2606.04469v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.04469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-58] ChannelTok: Efficient Flexible-Length Vision Tokenization
链接: https://arxiv.org/abs/2606.04461
作者: Sukriti Paul,Arpit Bansal,Tom Goldstein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first k channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being 8.6\times faster in decoding and 2.1\times smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: this https URL
[CV-59] Imagine Before You Draw: Visual Prompt Engineering for Image Generation
链接: https://arxiv.org/abs/2606.04457
作者: Liyu Jia,Fengda Zhang,Jiachun Pan,Kesen Zhao,Saining Zhang,Wang Lin,Weijia Wu,Yue Liao,Aojun Zhou,Hanwang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as “visual prompts” that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.
[CV-60] Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection
链接: https://arxiv.org/abs/2606.04453
作者: Hina Shakir,Mohammad Mohatram,Javeed Hussain,Syed Rizwan Ali,Muhammad Irfan Memon
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.
[CV-61] INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception
链接: https://arxiv.org/abs/2606.04437
作者: Chen Li,Shengrong Yuan,Jialong Zuo,Xinzhong Zhu,Nong Sang,Changxin Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 \log_2 communication volume, corresponding to about 16 \times compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.
[CV-62] 3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training
链接: https://arxiv.org/abs/2606.04436
作者: Jiaxin Shi,Xidong Zhang,Fucai Zhu,Zhe Li,Siyu Zhu,Weihao Yuan
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.
[CV-63] Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning ICML2026
链接: https://arxiv.org/abs/2606.04434
作者: Niloufar Alipour Talemi,Hossein Kashiani,Fatemeh Afghah
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce a query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration-query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.
[CV-64] DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation CVPR2026
链接: https://arxiv.org/abs/2606.04432
作者: Thanh-Tung Le,Yunhan Zhao,Menglei Chai,Zhengyang Shen,Zhe Cao,Danhang Tang,Xiaohui Xie,Deying Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026, Findings Track
Abstract:Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.
[CV-65] Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation
链接: https://arxiv.org/abs/2606.04427
作者: Bisheng Tang,Zhangfeng Ma,Chuchu Zhai,Feng Dong,Yaoqun Wu,Ammar Oad,Yifei Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under reviewing
Abstract:Image segmentation remains fundamentally limited by boundary ambiguity arising from sampling-induced information loss and inherent uncertainty in pixel-wise labeling. Although encoder-decoder architectures such as U-Net achieve strong performance, they often produce overconfident predictions that fail to capture transition-region ambiguity. To address this issue, we propose \textbfNoiseUNet, a simple yet effective framework that injects bounded perturbations into skip connections to regularize cross-scale feature fusion. This mechanism enforces robustness to local feature variations and promotes boundary-aware representations. Theoretically, the perturbation induces an implicit fuzzification effect, yielding soft, data-driven memberships without requiring explicit fuzzy modeling. We further introduce \textbfThyR, a real-world thyroid ultrasound dataset with inherently ambiguous boundaries. Experiments demonstrate that NoiseUNet consistently improves both segmentation accuracy and boundary fidelity.
[CV-66] Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis
链接: https://arxiv.org/abs/2606.04414
作者: Chuankai Xu,Cristiane De Carvalho Singulane,Mohammad Abuannadi,Stephen Chandler,Jeremy Slivnick,Karolina Zareba,Jane Cao,Vidya Nadig,Fabio Fernandes,Seth Uretsky,Diego Perez de Arenaza,Amit Patel,Jianxin Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View–Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (MMs, MMs2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.
[CV-67] Ultra-Fast Neural Video Compression CVPR2026
链接: https://arxiv.org/abs/2606.04410
作者: Jiahao Li,Wenxuan Xie,Zhaoyang Jia,Bin Li,Zongyu Guo,Xiaoyi Zhang,Yan Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at this https URL.
[CV-68] An Empirical Study of Data Scale Model Complexity and Input Modalities in Visual Generalization
链接: https://arxiv.org/abs/2606.04409
作者: Luoyidi Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 9 figures, 4 tables
Abstract:Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: this https URL.
[CV-69] Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models ICML2026
链接: https://arxiv.org/abs/2606.04385
作者: Shuwen Yu,Zhanxuan Hu,Yi Zhao,Yonghang Tai,Huafeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026
Abstract:Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at this https URL
[CV-70] Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers ICML2026
链接: https://arxiv.org/abs/2606.04373
作者: Biao Qian,Yang Wang,Yong Wu,Jungong Han
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to appear at ICML 2026, Seoul, Korea
Abstract:Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q’s outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at this https URL.
[CV-71] VT-3DAD: Cross-Category 3D Anomaly Detection via Visual-Text Normal Space Alignment
链接: https://arxiv.org/abs/2606.04369
作者: Zi Wang,Katsuya Hotta,Yawen Zou,Koichiro Kamide,Yijin Wei,Chao Zhang,Jun Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot cross-category 3D anomaly detection aims to determine whether an unknown point cloud belongs to a target normal category using only a few normal references. Existing training-based methods usually require category-wise optimization, while recent training-free methods based on multi-view CLIP visual features mainly rely on visual similarity and may be confused by geometrically similar categories. In this paper, we propose VT-3DAD, a training-free framework for cross-category 3D anomaly detection via Visual-Text Normal Space Alignment. Given few-shot normal references and a test point cloud, VT-3DAD first generates realistic multi-view depth maps and extracts view-wise features using a frozen CLIP visual encoder. The visual branch measures reference-test deviation in the multi-view feature space. In parallel, depth-aware and 3D-aware prompts are encoded by the frozen CLIP text encoder to construct textual normal anchors, which provide semantic normality constraints for the target category. The final anomaly score is obtained by fusing visual deviation from normal references and semantic deviation from the textual normal space. Experiments on the ShapeNetPart dataset demonstrate that VT-3DAD achieves state-of-the-art performance. In particular, VT-3DAD improves the one-shot average AUC-ROC from 92.49% to 94.80% compared with the visual-only baseline, while also reducing the average standard deviation from 5.64 to 3.41.
[CV-72] Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes
链接: https://arxiv.org/abs/2606.04365
作者: Renjie Liang,Zhengkang Fan,Jinqian Pan,Chenkun Sun,Jiang Bian,Russell Terry,Jie Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbfLesionDETR, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC 0.799 \pm 0.009 on UF-Health and 0.817 \pm 0.072 on KiTS23. A count-conditioned variant reaches per-lesion mAP 0.190 \pm 0.083 on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.
[CV-73] Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention
链接: https://arxiv.org/abs/2606.04364
作者: Dhanesh Ramachandram
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to 2.9% .
[CV-74] MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models
链接: https://arxiv.org/abs/2606.04349
作者: Yue Wu,Changyuan Wang,Zixuan Wang,Shilin Ma,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach’s superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.
[CV-75] HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning
链接: https://arxiv.org/abs/2606.04345
作者: Isha Abid,Fawad Khan,Muhammad Khuram Shahzad
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, multiple figures;
Abstract:This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.
[CV-76] Robust Multi-view Clustering against Imperfect Information
链接: https://arxiv.org/abs/2606.04343
作者: Zhichao Huang,Haochen Zhou,Hao Wang,Mouxing Yang,Xi Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures
Abstract:Real-world multi-view data always suffer from imperfect information problem, where the view-specific observations are absent (i.e., Incomplete Views, IV) and cross-view correspondences are mismatched (i.e., Noisy Correspondences, NC) for certain instances. As a remedy, numerous IV- and NC-oriented multi-view clustering (MvC) methods have been proposed, which however require either reliable correspondences or sufficiently complete instances, thus stopping short of addressing the imperfect information problem. In contrast, we observe that both IV and NC challenges originate from the same issue of imperfect cross-view counterpart information, where the counterpart of an anchor instance in another view might be either unavailable or unreliable. Based on the observation, we propose a novel robust MvC framework, termed Posterior-guided Latent Counterpart Inference (PLCI), which could handle both IV and NC in a unified manner. Specifically, PLCI formulates the desired cross-view counterpart of each anchor instance as a latent variable, and integrates both instance-level reliability and prototype-level semantic transport to infer the posterior distribution of the latent counterpart. Extensive experiments on six widely-used multi-view datasets against 10 state-of-the-art MvC methods demonstrate the effectiveness of PLCI for tackling the imperfect information problem. The code will be released upon acceptance.
[CV-77] Answer Self-Consistency with Margin-Triggered Question Re-Arbitration for the CVPR 2026 VidLLM s Challenge
链接: https://arxiv.org/abs/2606.04323
作者: Tomoya Miyazawa,Hiroyasu Okuno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this report, we present our solution for Track 2 of the CVPR 2026 VidLLMs Challenge. This track evaluates visual relational reasoning in videos, where models must infer relations that are not always explicitly visible. We propose Answer Self-Consistency with Margin-Triggered Question Re-Arbitration (ASC-MQRA), a training-free test-time reasoning framework built on a multimodal reasoning model. The core ASC component performs multiple stochastic video question-answering runs and aggregates their answer choices through answer-level self-consistency. This substantially improves over single-pass inference and forms our final test submission. We further study MQRA, a conditional re-arbitration module for low-margin examples where the first-stage vote distribution indicates uncertainty. Our vote-margin analysis shows that low-margin examples often retain the ground-truth answer among the top candidates, motivating MQRA to narrow the candidate set and re-watch the video only over the retained candidates. On validation, MQRA further improves over ASC, indicating that low-margin vote distributions can provide a useful uncertainty signal. On test, however, MQRA slightly degrades performance relative to ASC, suggesting that re-arbitration is sensitive to the size and category distribution of the triggered subset. Our final test submission therefore uses ASC without re-arbitration, achieving 72.73 average accuracy and 78.34 category-wise macro average accuracy on validation, and 81.16 average accuracy and 80.91 category-wise macro average accuracy on test. This report details our prompting strategy, implementation setup, ablation studies, and diagnostic analyses. The code is available at this https URL
[CV-78] PureLight: Learning Complex Luminaires with Light Tracing
链接: https://arxiv.org/abs/2606.04319
作者: Pedro Figueiredo,Zixuan Li,Beibei Wang,Miloš Hašan,Nima Khademi Kalantari
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures
Abstract:We propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.
[CV-79] XSSR: Cross-Domain Self-Supervised Representative Selection for Efficient Annotation in Medical Image Segmentation ALT
链接: https://arxiv.org/abs/2606.04301
作者: Byunghyun Ko,Aleksei Anisimov,Kobe Ke,Suhas Bharthepude,Jeongkyu Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Third International Conference on AI in Healthcare (AIiH 2026). This is the preprint version of the paper
Abstract:Acquiring labeled medical image data is resource-intensive and a challenge further exacerbated in cross-domain scenarios where source and target datasets differ in imaging equipment, population, or clinical site. This study introduces XSSR (Cross-Domain Self-Supervised Representative Selection), a framework designed to minimize annotation effort in the target domain while maintaining robust segmentation performance. XSSR comprises three stages: first, a Masked Autoencoder (MAE) is trained on unlabeled source data to establish a shared embedding space without requiring target labels; second, a greedy selection algorithm scores unlabeled target samples based on a composite density, novelty, and diversity criterion; and third, a U-Net segmentation model is trained exclusively on the selected subset. The novelty-diversity trade-off parameter, alpha, is automatically calibrated by minimizing embedding-space coverage, eliminating manual tuning. We evaluate XSSR on three public benchmarks: Chest X-ray, RIGA+ retinal fundus imaging, and multi-site Prostate MRI, each under a fixed 5% annotation budget. XSSR achieves 99.3% of full-data performance on Chest X-ray using only 22 labeled samples, surpasses random selection by up to 2.5 Dice points on Prostate MRI, and consistently outperforms the CoreSet baseline by 0.4 to 1.2 Dice points across all datasets. Ablation studies indicate that diversity is the most influential scoring component, and per-site analysis shows that performance correlates with scanner similarity to the source domain.
[CV-80] Efficient and Training-Free Single-Image Diffusion Models CVPR2026
链接: https://arxiv.org/abs/2606.04299
作者: Haojun Qiu,Kiriakos N. Kutulakos,David B. Lindell
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026; Project Page: this https URL
Abstract:We consider the problem of generating images whose internal structure – defined by the distribution of patches across multiple scales – matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.
[CV-81] A Cookbook of 3D Vision: Data Learning Paradigms and Application CVPR CVPR2026
链接: https://arxiv.org/abs/2606.04291
作者: Hongyang Du,Zongxia Li,Dawei Liu,Runhao Li,Haoyuan Song,Qingyu Zhang,Yubo Wang,Jingcheng Ni,Shihang Gui,Congchao Dong,Tao Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. this https URL
Abstract:3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data–point clouds, meshes, voxels, and 3D Gaussians–along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.
[CV-82] FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLM s
链接: https://arxiv.org/abs/2606.04282
作者: Eshika Khandelwal,Jingjing Pan,Mingfang Zhang,Quan Kong,Lorenzo Garattoni,Hilde Kuehne
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models’ ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.
[CV-83] StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets
链接: https://arxiv.org/abs/2606.04271
作者: Stepan Konev
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at this https URL.
[CV-84] Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation
链接: https://arxiv.org/abs/2606.04269
作者: Yilong Wang,Cheng Qian,Edward Johns
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at this https URL.
[CV-85] UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation
链接: https://arxiv.org/abs/2606.04264
作者: Zeyuan Yang,Hao-Wei Chen,Xueyang Yu,Yuncong Yang,Haoyu Zhen,Ziqiao Ma,Maohao Shen,Chuang Gan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to “draw” text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.
[CV-86] SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections ICIP2026
链接: https://arxiv.org/abs/2606.04251
作者: Ofir Gilad,Andrei Sharf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ICIP 2026, 6 pages, 4 figures
Abstract:Reconstructing thin 3D structures is challenging due to their sparsity, scale variation, and complex geometry. Such structures arise in a wide range of domains, including medical imaging of vascular systems and industrial pipe systems. While recent neural methods perform well on dense surfaces, they often fail to recover fine thin geometries. We propose a reconstruction approach based on local depth projections, which provide an efficient and informative 2D representation of thin structures. Specifically, we traverse the 3D model with a sliding box to generate local orthographic depth projections, which are processed by a neural network to reconstruct missing thin structures in 2D. The local reconstructions are subsequently fused back into the 3D model to produce a coherent and detailed shape. Experiments on pulmonary artery reconstruction from CT volumes and industrial pipeline recovery from synthetic and real scans demonstrate improved preservation of fine structural details over existing methods.
[CV-87] Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
链接: https://arxiv.org/abs/2606.04249
作者: Lixuan Chen,Zhongnan Liu,Jesse Hamilton,James M. Balter,Jeong Joon Park,Liyue Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.
[CV-88] Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG
链接: https://arxiv.org/abs/2606.04198
作者: Achraf Ben Ahmed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4’s effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.04198 [cs.CV] (or arXiv:2606.04198v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.04198 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Achraf Ben Ahmed [view email] [v1] Tue, 2 Jun 2026 20:33:36 UTC (236 KB)
[CV-89] GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLM s ACL2026
链接: https://arxiv.org/abs/2606.04184
作者: Weidong Tang,Jierui Li,Yueling Hou,Zihan Mei,Can Zhang,Xinyan Wan,Zhiyuan Liang,Pengfei Zhou,Yang You,Wangbo Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026
Abstract:True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.
[CV-90] End-to-End Text Line Detection and Ordering
链接: https://arxiv.org/abs/2606.04166
作者: Benjamin Kiessling(ALMAnaCH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line’s position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method’s source code and model weights are available under an open license at this https URL.
[CV-91] Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking
链接: https://arxiv.org/abs/2606.04133
作者: Nika Chuzhoy,Brian Hu,Amit A. Arora,Jae Ro,Sarthak S. Sahu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed. Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline. A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible. Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).
[CV-92] SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation
链接: https://arxiv.org/abs/2606.04108
作者: Guangda Ji,Qimin Chen,Qinchan Li,Mingrui Zhao,Kai Wang,Hao Zhang
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.
[CV-93] Reflection Separation from a Single Image via Joint Latent Diffusion CVPR2026
链接: https://arxiv.org/abs/2606.04107
作者: Zheng-Hui Huang,Zhixiang Wang,Yu-Lun Liu,Yung-Yu Chuang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL
Abstract:Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: this https URL
[CV-94] When Seeing Is Not Believing – A Benchmark for Search-Grounded Video Misinformation Detection
链接: https://arxiv.org/abs/2606.04098
作者: Tao Yu,Yujia Yang,Shenghua Chai,Zhang Jinshuai,Haopeng Jin,Hao Wang,Minghui Zhang,Zhongtian Luo,Yuchen Long,Xinlong Chen,Jiabing Yang,Zhaolu Kang,Yuxuan Zhou,Zhengyu Man,Xinming Wang,Hongzhu Yi,Zheqi He,Xi Yang,Yan Huang,Liang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 52 pages
Abstract:Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbfEVID-Bench, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43% point-level accuracy and 43.24% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.
[CV-95] Optimal Transport Flow Matching by Design WWW
链接: https://arxiv.org/abs/2606.04092
作者: Shimon Malnick,Matan Rusanovsky,Ohad Fried,Shai Avidan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Flow matching models learn to transport samples from a simple prior distribution to a complex data distribution. When prior-data pairs are coupled via optimal transport (OT), the learned trajectories are straight and non-crossing, enabling fast, even single-step, generation. However, computing the OT coupling in high dimensions is intractable, and existing methods attempt to solve the OT problem, at the cost of persistent bias or significant overhead. Rather than solving for the OT coupling, we reformulate the problem. Once the prior is treated as a design choice rather than a fixed input, the OT coupling between prior and data is no longer unique. Many priors admit an OT-optimal identity coupling to the data, leaving us free to choose one that is also tractable to sample. We identify low-frequency projection of natural images as such a choice. The identity coupling between data and its low-frequency representation is empirically OT-optimal, the prior is structured enough to be sampled by a lightweight model at inference, and the remaining flow-matching task reduces to synthesizing high-frequency detail. Interpolating the prior with Gaussian noise further improves generation quality while preserving the OT coupling. The approach requires no modifications to the flow model itself, and integrates naturally with latent-space models, classifier-free guidance, and one-step generation frameworks. Across all benchmarks, our method reduces trajectory curvature by more than 2\times compared to existing flow matching methods, yielding better generation quality in the few-step regime.
[CV-96] Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning
链接: https://arxiv.org/abs/2606.04061
作者: Yang Liu,Wentao Feng,Shu-Dong Huang,Yalan Ye,Jiancheng Lv
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a “Discrete Selection” paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at this https URL.
[CV-97] Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration ICME2026
链接: https://arxiv.org/abs/2606.04060
作者: Zhonggai Wang,Kai Fang,Guangyu Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2026
Abstract:Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class-level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift-resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class-level references to preserve long-term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance-specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry-aware decisions to directly filter unreliable signals and enforce a strict “one object, one class” constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, particularly in challenging multi-step incremental settings. The code is available at this https URL.
[CV-98] PointAction: 3D Points as Universal Action Representations for Robot Control
链接: https://arxiv.org/abs/2606.03943
作者: Mutian Tong,Han Jiang,Qiao Feng,Lingjie Liu,Jiatao Gu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.
[CV-99] SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
链接: https://arxiv.org/abs/2605.13672
作者: Giries Abu Ayoub,Morad Tukan,Loay Mualem
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.
[CV-100] L-TGVN: Leverag ing Longitudinal Priors for Personalized Rapid MRI MICCAI2026
链接: https://arxiv.org/abs/2606.04419
作者: Arda Atalık,Sumit Chopra,Daniel K. Sodickson
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Accepted to MICCAI 2026
Abstract:MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at this http URL.
[CV-101] GSD: Topology-Guided State-Space Diffusion for EEG Spatial Super-Resolution
链接: https://arxiv.org/abs/2606.03998
作者: Zijian Kang,Weiming Zeng,Yueyang Li,Shengyu Gong,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at this https URL.
人工智能
[AI-0] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization
链接: https://arxiv.org/abs/2606.05150
作者: Ammar Hoori,Yuichi Motai
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 15 Page, Under Review
Abstract:The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structure in both shallow and deep networks. The error correction (ErrCor), a state-of-the-art gradient-based training method, selects optimal hidden units to improve accuracy. Alternatively, as a population-based algorithm, the particle swarm optimization algorithm (PSO) uses the swarm experience to optimize RBFN parameters, offering global search and robustness to local minima. Adaptive PSO (APSO) has emerged as an improved variant of PSO. APSO algorithm improves convergence speed by dynamically adjusting swarm parameters during optimization. Both ErrCor and PSO demonstrate improved results and competitive convergence. However, with large datasets, these methods face scalability challenges such as excessive kernel computations and large hidden layer structures. A recent multi-column RBFN approach (MCRN) improves ErrCor performance by deploying small RBFNs in a parallel system. Inspired by MCRN’s success, we propose two novel approaches to improve PSO performance: the multi-column RBFN with PSO (MC-PSO) and the multi-column RBFN with APSO (MC-APSO). These methods introduce parallel RBFN structures trained using evolutionary swarm methods. Each RBFN is independently trained on a specific spatial subset of the dataset using either PSO or APSO algorithms. These resulting specialist-trained RBFNs are tailored to their respective subsets. During testing, only selected RBFNs, where the test instance neighbors are located, contribute to the multi-column output. This specialization improves accuracy, while parallelism enhances speed. We evaluate the proposed methods on various benchmark datasets. The MC-PSO and MC-APSO outperform ErrCor, PSO, APSO, and MCRN in terms of accuracy and recall. They also demonstrate faster training and testing times in most experiments.
[AI-1] owards Efficient and Evidence-grounded Mobility Prediction with LLM -Driven Agent
链接: https://arxiv.org/abs/2606.05130
作者: Linyao Chen,Qinlao Zhao,Zechen Li,Mingming Li,Likun Ni,Jinyu Chen,Yuhao Yao,Xuan Song,Noboru Koshizuka,Hiroki Kobayashi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42% Acc@1 on BW, 33.14% on YJMob100K, and 33.50% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65% to 48.62% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at this https URL.
[AI-2] Knowledge Index of Noahs Ark
链接: https://arxiv.org/abs/2606.05104
作者: Sheng Jin,Minghao Liu,Yunze Xiao,Zeqi Zhou,Heli Qi,Yifan Yao,Meishu Song,Kaijing Ma,Xuan Zhang,Sicong Jiang,Yizhe Li,Ningshan Ma,Jie Wei,Ziniu Li,Minglai Yang,Bangya Liu,Yiming Liang,Xiao Fang,Qingcheng Zeng,Jiarui Liu,Rui Yang,Shen Yan,Wenhao Huang,Jiaheng Liu,Zihan Wang,Weihao Xuan,Ge Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.
[AI-3] AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
链接: https://arxiv.org/abs/2606.05080
作者: Zhangchen Xu,Junda Chen,Yue Huang,Dongfu Jiang,Jiefeng Chen,Hang Hua,Zijian Wu,Zheyuan Liu,Zexue He,Lichi Li,Shizhe Diao,Jiaxin Pei,Jinsung Yoon,Hao Zhang,Mengdi Wang,Radha Poovendran,Misha Sra,Alex Pentland,Zichen Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL ; Website: this https URL
Abstract:Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent’s initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.
[AI-4] Strabo: Declarative Specification and Implementation of Agent ic Interaction Protocols AAMAS2026
链接: https://arxiv.org/abs/2606.05043
作者: Samuel H. Christie V,Amit K. Chopra,Munindar P. Singh
类目: Artificial Intelligence (cs.AI)
备注: Presented in the Engineering Multiagent Systems Workshop co-located with the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.
[AI-5] Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery
链接: https://arxiv.org/abs/2606.05037
作者: Arquimedes Canedo,Grama Chethan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:When an AI agent calls an API and hits a validation error, it needs more than what went wrong – it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\this http URL[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ( N=30 per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by +36.7 – 40.0 pp over plain-English diagnoses on Anthropic models (Fisher’s exact p \le 0.0022 ), at 1.8 – 2.2\times better per-success token efficiency. The lift is not significant on gpt-4o-mini ( p=0.435 ); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit_prompt\this http URL as reusable CI infrastructure. Code and data: this https URL.
[AI-6] Invariant Gradient Alignment for Robust Reasoning Distillation
链接: https://arxiv.org/abs/2606.05025
作者: Zehua Cheng,Wei Dai,Jiahao Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 Pages
Abstract:Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework that aligns gradient updates across semantically diverse but logically isomorphic examples via three innovations: (i) Logical Isomer Sets, groups of problems sharing identical logical structure across distinct semantic domains (mathematics, medicine, law, science); (ii) a differentiable \emphContinuous Gradient Conflict Mask, that suppresses parameter dimensions with high cross-domain gradient variance while preserving invariant directions; and (iii) a truncated SVD projection of the masked gradient back onto the LoRA low-rank manifold, maintaining parameter efficiency throughout. Theoretically, IGA yields tighter OOD generalization bounds than ERM, scaling with the number of isomer domains, and converges at the standard SGD rate under mild regularity. Empirically, IGA outperforms eight baselines across four benchmarks with accuracy gains up to 14.3 pp over ERM-SFT and a Logical Consistency Score of 0.031 versus 0.142 – a fourfold improvement in representational invariance.
[AI-7] SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models ACL2026
链接: https://arxiv.org/abs/2606.05004
作者: Peihua Mai,Xuanrong Gao,Youlong Ding,Xianglong Du,Wei Liu,Yan Pang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: accepted by ACL 2026 (main)
Abstract:With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over 20% higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to 5\times compared to non-batched inference.
[AI-8] From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents
链接: https://arxiv.org/abs/2606.04990
作者: Yiqi Wang,Jiaqi Zhang,Taotao Cai,Zirui Liu,Qingqiang Sun,Zequn Sun,Zhangkai Wu,Mingkai Zhang,Yanming Zhu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.
[AI-9] From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents
链接: https://arxiv.org/abs/2606.04967
作者: Sanderson Oliveira de Macedo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.
[AI-10] What Type of Inference is Active Inference?
链接: https://arxiv.org/abs/2606.04935
作者: Wouter W. L. Nuijten,Mykola Lukashchuk,Thijs van de Laar,Bert de Vries
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.
[AI-11] AdaKoop: Efficient Modeling of Nonlinear Dynamics from Nonstationary Data Streams with Koopman Operator Regression KDD’26
链接: https://arxiv.org/abs/2606.04930
作者: Naoki Chihara,Ren Fujiwara,Yasuko Matsubara,Yasushi Sakurai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by KDD’26
Abstract:Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.
[AI-12] Abduction Prover in Isabelle/HOL
链接: https://arxiv.org/abs/2606.04877
作者: Yutaka Nagashima,Daniel Sebastian Goc
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Accepted to Isabelle2026
Abstract:Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on proof assistants. We address this problem by introducing the Abduction Prover for Isabelle/HOL. Given a challenging proof goal, the Abduction Prover constructs a proof script for the goal by identifying useful conjectures using abductive reasoning.
[AI-13] AICompanionBench: Benchmarking LLM s-as-Judges for AI Companion Safety
链接: https://arxiv.org/abs/2606.04867
作者: Yanjing Ren,Reza Ebrahimi,TengTeng Ma
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI companion platforms such as Replika and this http URL rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: this https URL
[AI-14] Learning Empirically Admissible Neural Heuristics for Combinatorial Search
链接: https://arxiv.org/abs/2606.04860
作者: Siddharth Sahay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 2 tables, 1 algorithm
Abstract:Finding optimal solution paths for combinatorial puzzles like the Rubik’s Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik’s Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.
[AI-15] Uncertainty-Aware End-to-End Co-Design of Neural Network Processors: From Training and Mapping to Fabrication
链接: https://arxiv.org/abs/2606.04850
作者: Yuyang Du,Yujun Huang,Gioele Zardini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
备注: 14 pages
Abstract:Designing a neural network processor is an end-to-end co-design problem: network architecture and training budget determine the inference workload; hardware mapping decisions determine chip area, latency, and energy; and these characteristics govern fabrication yield and manufacturing cost. In practice, these decisions are made in separate stages, and existing co-design methodologies are tightly coupled to specific algorithms, making it difficult to improve one component without reworking the entire pipeline. This paper presents a unified framework, grounded in monotone co-design theory, that composes four interoperable design blocks spanning network training, chip mapping, wafer-level fabrication, and compute resource allocation. Each block exposes only a functionality-resource interface to the rest of the system, so any block can be refined without structural changes elsewhere. A central contribution is the treatment of uncertainty: rather than collapsing stochastic outcomes into point estimates, the framework introduces Confidence, the inverse of success probability, as an explicit and optimizable resource alongside cost, time, and power. Three case studies validate the approach. The first recovers Pareto-optimal implementations across heterogeneous application scenarios. The second confirms that Confidence functions as a continuously tunable design knob rather than a post-hoc diagnostic. The third demonstrates that improving a single block’s implementation set automatically propagates to the global Pareto front, without modifying the co-design diagram.
[AI-16] Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting AAAI2026
链接: https://arxiv.org/abs/2606.04833
作者: Balthazar Courvoisier,Tristan Cazenave
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, accepted at AAAI 2026 AI4TS Workshop
Abstract:Initially developed for natural language processing, Transformer architectures and attention mechanisms are now central to a wide range of deep learning models, including applications in time series forecasting. A standard attention mechanism, however, implicitly assumes homophilic interactions, limiting its ability to model data with positive and negative dependencies, such as time series. In this work, we introduce the Signed Dual Attention, a novel attention formulation that captures both positive and negative relational patterns without additional parameters. By leveraging a dual message-passing scheme inspired by correlation structures, Signed Dual Attention propagates both supportive and contrastive information within a single shared block, effectively achieving the expressiveness of two head attention without additional parameters. This module can be seamlessly integrated into existing architectures and can yield performance gains in certain situations, requiring signed relational modeling. This approach opens a pathway toward more expressive and parameter-efficient transformers.
[AI-17] Beyond Objective Equivalence: Constraint Injection for LLM -Based Optimization Modeling on Vehicle Routing Problems
链接: https://arxiv.org/abs/2606.04816
作者: Xizi Luo,Changhong He,Dongdong Geng,Chenggong Shi,Yu Mei
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages
Abstract:Large language models (LLMs) increasingly translate natural-language optimization problems into executable solver code. Yet for constraint-dense operations research (OR) problems, existing data-filtering and training pipelines largely rely on objective-equivalence signals such as differential testing and answer agreement, which a program can pass while adding spurious constraints or silently omitting required ones, whenever those constraints are non-binding on the tested instance. We propose constraint injection, which uses feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent constraint omission. Combined with differential testing, it forms a dual verifier. We instantiate and evaluate it on vehicle routing problems (VRPs), a representative constraint-dense combinatorial optimization testbed with coupled operational constraints. We develop VRPCoder, an 8B end-to-end model that translates natural-language VRP scenarios into Gurobi scripts, together with an expert-verified VRP benchmark suite covering 21 variants. The verifier is reused as a rejection-sampling filter during data synthesis and as a per-rollout reward in group relative policy optimization (GRPO). Across four VRP benchmarks, VRPCoder-GRPO reaches 93% average Pass@1, outperforms Gemini-3.1-Pro Preview on three benchmarks, exceeds Claude-Sonnet-4.5 by 28 average points, and surpasses prior OR-LLMs by 78 average points.
[AI-18] Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents
链接: https://arxiv.org/abs/2606.04815
作者: Bo Mao,Jie Zhou,Yutao Yang,Xin Li,Xian Wei,Qin Chen,Xingjiao Wu,Liang He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\textttLifeSkill), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.
[AI-19] Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
链接: https://arxiv.org/abs/2606.04812
作者: Mohit Prashant,Arvind Easwaran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, preprint
Abstract:Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.
[AI-20] AIP: A Graph Representation for Learning and Governing Agent Skills
链接: https://arxiv.org/abs/2606.04781
作者: Zachary Blumenfeld,Jim Webber
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet’s mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.
[AI-21] ree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions
链接: https://arxiv.org/abs/2606.04779
作者: Andrea Ferrario
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: 29 pages, 9 figures
Abstract:Complementarity is the case in which a human–AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents’ predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for N=2 , the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for N=4 , they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli f -divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.
[AI-22] Description-Code Inconsistency in Real-world MCP Servers: Measurement Detection and Security Implications
链接: https://arxiv.org/abs/2606.04769
作者: Yutao Shi,Xiaohan Zhang,Xiangjing Zhang,Xihua Shen,Hui Ouyang,Huming Qiu,Mi Zhang,Min Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Preprint
Abstract:The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool’s description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem. Comments: Preprint Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2606.04769 [cs.CR] (or arXiv:2606.04769v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.04769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-23] An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers
链接: https://arxiv.org/abs/2606.04752
作者: Ossi Lehtinen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure, 8 tables. Code: this https URL
Abstract:Transformers consuming multi-channel scalar signals must embed C simultaneous values into one d_\textmodel -dimensional vector per time step. We empirically audit eight input encoders – spanning a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP stem, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding – on a synthetic benchmark designed to make channel identity informative and on ETTh1 as a real-data check, measured in next-step negative log-likelihood (NLL). The headline is one of practical near-equivalence within a wide “top tier”: the standard per-channel linear projection (this http URL(C, d_\textmodel )) matches every alternative in that tier up to small, statistically real but practically modest, differences. Two encoders lose decisively: the shared-scalar baseline, which collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline, which underperforms on both benchmarks and overfits universally on the synthetic one. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small C , with a direct geometric probe showing the mechanism is positional-channel orthogonalisation; a nonlinear MLP stem edges them at the largest C we test, with the gap shrinking under more training data. The practical recommendation is to use this http URL(C, d_\textmodel ) by default and reach for something more elaborate only when the task at hand gives a real reason to do so. Code and data to reproduce every experiment in this paper are available at this https URL
[AI-24] FALSIFYBENCH: Evaluating Inductive Reasoning in LLM s with Rule Discovery Games
链接: https://arxiv.org/abs/2606.04751
作者: Leonardo Bertolazzi,Katya Tentori,Raffaella Bernardi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.
[AI-25] Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment
链接: https://arxiv.org/abs/2606.04750
作者: Ajay Vishwanath,Christian Omlin
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent’s teleology and makes its behavior human-level interpretable.
[AI-26] Revisiting Vul-RAG : Reproducibility and Replicability of RAG -based Vulnerability Detection with Open-Weight Models
链接: https://arxiv.org/abs/2606.04739
作者: Sabrina Kaniewski,Fabian Schmidt,Tobias Heer
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at AICCPS 2026 workshop, co-located with the 21st International Conference on Availability, Reliability and Security (ARES 2026). This is the authors’ preprint version
Abstract:Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at this https URL.
[AI-27] Curvature-aware dynamic precision approach for physics-informed neural networks
链接: https://arxiv.org/abs/2606.04736
作者: Yingjie Shao,Ioannis N. Athanasiadis,George van Voorn,Taniya Kapoor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-informed neural networks (PINNs) have become a promising framework for simulating partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, recent studies show that PINN optimisation is sensitive to numerical precision. Existing implementations commonly use either single precision (FP32), which is computationally efficient but prone to failure modes, or double precision (FP64), which is robust but substantially expensive. This creates a trade-off between computational efficiency and numerical accuracy. To reduce the computational cost of double-precision training while retaining prediction accuracy, we propose a curvature-aware precision controller that adapts numerical precision during training rather than treating it as a fixed implementation choice. The proposed method reuses curvature information derived from the limited-memory BFGS (L-BFGS) optimiser to construct a precision controller, retaining FP32 when lower precision is sufficient and promoting computation to FP64 when the training dynamics indicate numerical sensitivity or precision-limited stagnation. We evaluate the proposed approach on four canonical PINN failure-mode benchmarks and an irradiance-driven ordinary differential equation example. We further test the proposed approach across different neural network architectures. The method consistently matches or even slightly exceeds full FP64 solution accuracy while reducing training time relative to full double-precision training on all benchmark equations. The obtained results indicate that precision sensitivity in PINN optimisation is phase-dependent, and that selectively applying higher precision only during numerically critical stages can lower computational cost without sacrificing predictive accuracy.
[AI-28] race-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.04735
作者: Viktor Veselý,Aleksandar Todorov,Erwan Escudie,Matthia Sabatelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB). At intermediate eligibility trace depths, agents irrationally prefer trajectories with high-magnitude reward peaks'' over alternatives with higher cumulative returns. This provides a mechanistic account of the Peak-End Rule: a human memory bias where experiences are judged by their most intense moments rather than integrated utility. We show that TMPB emerges because traces amplify distal Temporal Difference errors into gradient shocks’’ that fixed-step-size Stochastic Gradient Descent cannot normalize, leading to global overestimation. Conversely, adaptive optimizers mitigate this pathology via second-moment normalization. Our results suggest that human-like saliency distortions may emerge naturally from the mathematical constraints of credit assignment in distributed systems, and that adaptive optimization is a theoretical necessity for rational value estimation.
[AI-29] CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
链接: https://arxiv.org/abs/2606.04718
作者: Kailun Huang(1),Zikang Xie(1),Yanzhe Xie(1),Panpan Liao(3),Fanghai Zhang(1),Yanheng Mai(1),Wenhao Xu(2),Yunheng Wang(1),Renjing Xu(1),Haohui Huang(3) ((1) Hong Kong University of Science and Technology (Guangzhou), (2) South China Agricultural University, (3) Guangdong University of Technology)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu and Haohui Huang
Abstract:Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Similarly, humanoid robots should achieve smooth transitions between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference and the distribution shift induced by terrain-dependent visual and dynamic variations. Although Mixture-of-Experts (MoE) architectures can alleviate multi-skill interference, naive joint training often fails to yield clear expert specialization, limiting their effectiveness. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced and trained with a contrastive objective to shape the gating network, enabling it to capture structured terrain representations and promote expert specialization. The final action is obtained via weighted fusion of the base gait policy and the terrain-aware branch, allowing the policy to preserve stable locomotion patterns while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains, while maintaining accurate foothold placement and dynamic stability under external disturbances.
[AI-30] VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
链接: https://arxiv.org/abs/2606.04708
作者: Siyuan Yang,Linzheng Guo,Ouyang Lu,Zhaxizhuoma,Daoran Zhang,Xinmiao Wang,Ting Xiao,Fangzheng Yan,Zhijun Chen,Yan Ding,Chao Yu,Chenjia Bai,Xuelong Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including \pi_0.5 , LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.
[AI-31] Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models ICML2026
链接: https://arxiv.org/abs/2606.04672
作者: Ayushman Raghuvanshi,Thummaluru Siddartha Readdy,Sundeep Prabhakar Chepuri,Mahesh Chandran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning. Comments: Accepted at ICML 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.04672 [cs.LG] (or arXiv:2606.04672v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04672 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-32] Why Muon Outperforms Adam: A Curvature Perspective
链接: https://arxiv.org/abs/2606.04662
作者: Shuche Wang,Fengzhuo Zhang,Jiaxiang Li,Dirk Bergemann,Zhuoran Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon’s superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon’s smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon’s NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon’s NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon’s lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.
[AI-33] BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction
链接: https://arxiv.org/abs/2606.04648
作者: Qi Wang,Peijie Wang,Fei Yin,Cheng-Lin Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework that establishes Bidirectional Neuro-Symbolic Interaction (BiNS) between a MLLM Adviser and a Symbolic Solver. MLLM Adviser actively incorporates feedback from the symbolic solver to dynamically rectify inconsistent formal representations or propose auxiliary hypotheses, resolving symbolic conflicts and facilitating complex deductions.
[AI-34] MIRAG E: Mobile Agents with Implicit Reasoning and Generative World Models
链接: https://arxiv.org/abs/2606.04627
作者: Zhichao Yang,Yuanze Hu,Haojie Hao,Longkun Hao,Dongshuo Huang,Hongyu Lin,Gen Li,Lanqing Hong,Yihang Lou,Yan Bai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.
[AI-35] QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
链接: https://arxiv.org/abs/2606.04620
作者: Pasindu Wickramasinghe,Achyuta Muthuvelan,Rachmad Vidya Wicaksana Putra,Minghao Shao,Muhammad Shafique
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 5 tables
Abstract:LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art methods typically employ uniform post-training quantization (PTQ) across attention blocks of the network, hence overlooking the potential of applying different quantization levels in the same network. They also employ complex operations to mitigate the negative impact of activation outliers, hence incurring high computational overheads. Moreover, they have not considered evaluation using emerging LLMs with non-conventional attention architectures (e.g., state-space models), which pose different challenges in applying quantization. To address these limitations, we propose QuBLAST, a novel PTQ methodology that employs block-level compression approach with activation scaling strategy for LLMs. Block-level compression approach enables mixed-precision quantization across blocks of the network, while activation scaling strategy efficiently mitigates the negative impact of activation outliers. Specifically, QuBLAST first analyzes the sensitivity of different attention blocks in the pre-trained model through the cross-entropy loss analysis. QuBLAST leverages this sensitivity analysis to determine the weight quantization level for each attention block in the model. Furthermore, QuBLAST employs the activation scaling map for each block to control the range of activation values and mitigate the negative impact of activation outliers, thereby enabling better quantization results. Experimental results show that, QuBLAST reduces model sizes by 40%-45.2% across different model architectures (i.e., Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B), while maintaining the performance within 5% perplexity increase for the WikiText-2 and WikiText-103 datasets.
[AI-36] A Normative Intermediate Representation for ASP-Based Compliance Reasoning
链接: https://arxiv.org/abs/2606.04619
作者: Yangfan Wu,Huanyu Yang,Jianmin Ji
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We propose MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a staged operational semantics, while MONIR-ASP provides an executable compilation and extensions for external functions, temporal rules, and stable-model reasoning. We instantiate the framework on Chinese ADAS regulations and standards with an LLM-assisted pipeline. Experiments evaluate extraction quality and the efficiency of modular and incremental ASP solving.
[AI-37] Parthenon Law: A Self-Evolving Legal-Agent Framework
链接: https://arxiv.org/abs/2606.04602
作者: Hejia Geng,Leo Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products – yet reliable deployment faces three obstacles: no large-scale evidence on how today’s strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB – 12,510 agent trajectories – shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textscParthenon, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience – as a firm refines its checklists and playbooks after each matter – without touching model weights. Across our large-scale empirical analysis, \textscParthenon substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.
[AI-38] Plan First Judge Later Run Better: A DMAIC-Inspired Agent ic System for Industrial Anomaly Detection
链接: https://arxiv.org/abs/2606.04599
作者: Yongzi Yu,Ao Li,Le Wang,Ziyue Li,Fugee Tsung,Yuxuan Liang,Man Li
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a “Plan First, Judge Later” multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.
[AI-39] Learning Admissible Heuristics via Cost Partitioning
链接: https://arxiv.org/abs/2606.04597
作者: Hugo Barral,Quentin Cappart,Marie-José Huguet,Sylvie Thiébaux
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost partitioning combines multiple abstraction heuristics while preserving admissibility, but computing optimal partitions online is expensive. We propose a framework that learns to infer admissible cost partitions by leveraging the Lagrangian dual equivalence between cost partitioning and multiplier prediction. Planning states and patterns are encoded as labelled graphs, and an action-centric variant of the Weisfeiler-Leman algorithm extracts structural feature vectors. A deep architecture with axial self-attention and a softmax output layer maps these features to cost weights that satisfy the partition constraints by construction, ensuring admissibility. Experiments demonstrate reduced node expansions compared to suboptimal partitioning baselines while maintaining strict admissibility. To our knowledge, this is the first machine-learned heuristic guaranteed to be admissible.
[AI-40] Ekka: Automated Diagnosis of Silent Errors in LLM Inference ICML2026
链接: https://arxiv.org/abs/2606.04594
作者: Yile Gu,Zhen Zhang,Shaowei Zhu,Xinwei Fu,Jun Wu,Yida Wang,Baris Kasikci
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: ICML 2026
Abstract:LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.
[AI-41] Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge
链接: https://arxiv.org/abs/2606.04581
作者: Haotian Zheng,Zhanwei Wang,Mingyao Cui,Chang Cai,Hongyang Du,Kaibin Huang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users’ computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.
[AI-42] SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification KDD2026
链接: https://arxiv.org/abs/2606.04579
作者: Xiangyu Zhao,Hengyuan Zhao,Yiheng Wang,Wanghan Xu,Yuhao Zhou,Qinglong Cao,Zhiwang Zhou,Lei Bai,Wenlong Zhang,Xiao-Ming Wu
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026 AI4Science Track
Abstract:While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.
[AI-43] Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent -Based Models
链接: https://arxiv.org/abs/2606.04562
作者: Janani Venugopalan,Gaurav Deshkar,Rishabh Gaur,Harshal Hayatnagarkar,Jayanta Kshirsagar
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Purpose The WHO’s COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic’s progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak’s peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic’s impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.
[AI-44] Rollout-Level Advantage-Prioritized Experience Replay for GRPO
链接: https://arxiv.org/abs/2606.04560
作者: Gyeongtae Yoo,Sanghyeok Park,Soohyuk Jang,Ik-hwan Kim,Sungroh Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.
[AI-45] Scaling Self-Evolving Agents via Parametric Memory
链接: https://arxiv.org/abs/2606.04536
作者: Tao Ren,Weiyao Luo,Hui Yang,Rongzhi Zhu,Xiang Huang,Yuchuan Wu,Bingxue Chou,Jieping Ye,Jiafeng Liang,Yongbin Li,Yijie Peng
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emphlook up what they have seen but cannot \emphlearn from it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \textttTMEM, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights \Delta_t via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from \pi_\theta_0+\Delta_t , while extraction actions produce supervision that updates \Delta_t for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training \theta_0 improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \textttTMEM consistently outperforms summary-based and retrieval-based baselines across different model scales.
[AI-46] reat Traffic Like Trees: A Semantic-Preserving Hierarchical Graph-Based Expert Framework for Encrypted Traffic Analysis
链接: https://arxiv.org/abs/2606.04517
作者: Yuantu Luo,Jun Tao,Linxiao Yu,Guang Cheng
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Graph-based deep learning methods have been widely employed in encrypted traffic analysis to exploit latent correlations across different granularities. However, while complex preprocessing pipelines and sophisticated model structures often achieve strong performance, they may obscure inherent protocol semantics during representation learning. Moreover, the hierarchical structure of protocol layers and their corresponding fields, defined by protocol specifications and routinely utilized in manual traffic analysis, remains underexplored in existing learning frameworks. In this paper, we propose Protocol Tree Graph Attention with Mixture of Experts (PTGAMoE), a semantic-preserving hierarchical graph-based expert framework for encrypted traffic analysis. The field-based graph construction and expert committee design enable PTGAMoE to quantify the model’s preferences for specific fields and protocols. Extensive experimental results on representative benchmark datasets under strict no-data-leakage settings demonstrate that PTGAMoE significantly outperforms state-of-the-art (SOTA) models. Furthermore, the semantic-preserving design provides interpretable insights into protocol-level feature importance and expert-level contributions, reflecting the model’s decision-making logic in encrypted traffic classification tasks.
[AI-47] GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
链接: https://arxiv.org/abs/2606.04516
作者: Guangcheng Zhu,Shenzhi Yang,Haobo Wang,Xing Zheng,Yingfan MA,Xuening Feng,Zhongqi Chen,Kai Tang,Zhengqing Zang,Bowen Song,Weiqiang Wang,Gang Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.
[AI-48] MapAgent : An Industrial-Grade Agent ic Framework for City-scale Lane-level Map Generation KDD2026
链接: https://arxiv.org/abs/2606.04513
作者: Deguo Xia,Zihan Li,Haochen Zhao,Dong Xie,Yuyao Kong,Xiyan Liu,Jizhou Huang,Mengmeng Yang,Diange Yang
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent’s practicality and effectiveness for large-scale lane-level map generation.
[AI-49] Simulate Reason Decide: Scientific Reasoning with LLM s for Simulation-Driven Decision Making
链接: https://arxiv.org/abs/2606.04505
作者: Yuhan Yang,Ruipu Li,Alexander Rodríguez
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.
[AI-50] Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots
链接: https://arxiv.org/abs/2606.04503
作者: Guangcheng Zhu,Shenzhi Yang,Haobo Wang,Xing Zheng,Yingfan MA,Xuening Feng,Zhongqi Chen,Bowen Song,Weiqiang Wang,Gang Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of “golden” samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the “pick in the dark” setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation. Through systematic analysis, we demonstrate that smart picks hinge on a well-calibrated uncertainty estimator to enable strategic partitioning of data for adaptive training regimes. Building on this insight, we propose PivotTrace, a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots during reasoning. By precisely quantifying uncertainty through pivot density, PivotTrace achieves automated data routing to synergistically maximize both annotation and training efficiency. Empirically, PivotTrace surpasses the fully supervised LRM with only 29.3% annotated samples and 2.75 faster convergence.
[AI-51] Beyond Prompt-Based Planning : MCP-Native Graph Planning -based Biomedical Agent System
链接: https://arxiv.org/abs/2606.04494
作者: Zhangtianyi Chen,Florensia Widjaja,Wufei Dai,Xiangjun Zhang,Yuhao Shen,Juexiao Zhou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.
[AI-52] ChessMimic: Per-Rating Transformer Models for Human Move Clock and Outcome Prediction in Online Blitz Chess
链接: https://arxiv.org/abs/2606.04473
作者: Thomas Johnson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic’s human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model’s accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE’s reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at this http URL and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.
[AI-53] ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion
链接: https://arxiv.org/abs/2606.04468
作者: Ruiqing Sun,Sen Yang,Dawei Feng,Bo Ding,Yijie Wang,Huaimin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
备注:
Abstract:Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive environment interactions. While recent generative methods have achieved notable success, they predominantly rely on external surrogate models. This dependency introduces significant computational overhead, suffers from deceptive evaluations, and deviates from the prevailing paradigm of jointly training mainstream generative models with conditions. To address these bottlenecks, we propose ParetoPilot, a novel zero-surrogate diffusion framework for offline MOO. ParetoPilot fully leverages the conditional priors inherently embedded within pre-trained diffusion models. At its core, the framework introduces the Infer-Perturb-Guide (IPG) engine, which is seamlessly interleaved within the unconditional denoising steps of the reverse generation process. First, it implicitly infers the instantaneous objective direction by matching conditional and unconditional noise predictions. Next, it mathematically orthogonalizes a parallel gravity field for strict convergence and an edgeness-aware repulsive force for mutual diversity, creating a dynamically annealed perturbation vector. Finally, this perturbed target seamlessly steers the generation process via standard Classifier-Free Guidance (CFG). Extensive experiments across 51 tasks demonstrate that ParetoPilot outperforms 14 state-of-the-art surrogate-based and inverse generative baselines. By eliminating auxiliary proxy training, our approach preserves data privacy while achieving hypervolume improvement and robust Pareto front coverage.
[AI-54] CyberGym-E2E: Scalable Real-World Benchmark for AI Agents End-to-End Cybersecurity Capabilities ICML2026
链接: https://arxiv.org/abs/2606.04460
作者: Tianneng Shi,Robin Rheem,Dongwei Jiang,Mona Wang,Francisco De La Riega,Zhun Wang,Jingzhi Jiang,Alexander Cheung,Sean Tai,Jonah Cha,Jianhong Tu,Gabriel Han,Chenguang Wang,Jingxuan He,Wenbo Guo,Dawn Song
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents’ abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.
[AI-55] RowNet: A Memory Transformer for Tabular Regression
链接: https://arxiv.org/abs/2606.04445
作者: Askat Rakhymbekov,Gulshat Muhametjanova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注: Retrieval-based neural architecture for real estate valuation. Related to TabR ( arXiv:2307.14338 ) and retrieval-augmented tabular learning
Abstract:Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard multilayer perceptrons treat each row as an isolated vector and must learn locality, scale sensitivity, and categorical matching from supervision alone. Gradient-boosted decision trees provide strong tabular baselines, but their feature-centric splitting mechanism does not explicitly model the retrieval of similar historical observations. This paper presents RowNet, a retrieval-based neural architecture for real estate price-per-square-meter prediction. RowNet represents a query property through pairwise similarity features against a memory bank of labeled properties. A first retrieval layer estimates a coarse target from feature-only similarities. A second layer augments the memory comparison with target-consistency features and uses multiple learned attention heads to retrieve complementary comparable sets. A final mixture-of-experts module combines learned gating, residual correction, entropy regularization, and head-diversity regularization to produce the prediction. Comments: Retrieval-based neural architecture for real estate valuation. Related to TabR (arXiv:2307.14338) and retrieval-augmented tabular learning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST) Cite as: arXiv:2606.04445 [cs.LG] (or arXiv:2606.04445v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-56] LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling
链接: https://arxiv.org/abs/2606.04438
作者: Wenkai Chen,Tianshu Li,Wenyong Huang,Yichun Yin,Lifeng Shang,Chengwei Qin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. However, mainstream looped architectures rely on dense backbones that couple parameter count with per-token FLOPs, which makes it impossible to isolate the effect of iterative computation under matched budgets. To this end, we present LoopMoE, a looped MoE language model that integrates sparse routing with iterative weight-shared computation through two designs. The first is IterAdaLN, which resolves weight-sharing symmetry via a modulation signal jointly conditioned on the iteration index and the per-token hidden state. The second is a capacity-balancing strategy that recovers the attention-to-FFN active parameter ratio of well-tuned non-looped references. Together, these designs enable the first strictly controlled, head-to-head evaluation of a looped MoE against a Vanilla MoE under identical total parameters, per-token FLOPs, and active sublayer ratios. At the 3B scale, LoopMoE outperforms the Vanilla MoE on 8 of 9 downstream benchmarks with an average improvement exceeding 1 point. At the 9B scale, LoopMoE continues to outperform the matched Vanilla MoE, indicating that the architectural gain persists at larger scale. Our work establishes a controlled synthesis of sparsity and recurrence, and suggests a promising direction for looped language models.
[AI-57] What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agent ic Systems
链接: https://arxiv.org/abs/2606.04425
作者: Yuanbo Xie,Tianyun Liu,Yingjie Zhang,Suchen Liu,Yulin Li,Liya Su,Tingwen Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: position paper
Abstract:Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.
[AI-58] rivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers
链接: https://arxiv.org/abs/2606.04421
作者: Edward Y. Chang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 62 pages, 12 tables, 12 figures
Abstract:Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.
[AI-59] An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization
链接: https://arxiv.org/abs/2606.04408
作者: Rui Zhang,Jinhang Liu,Wenbo Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:High-dimensional and incomplete (HDI) data are prevalent in many real-world big data scenarios. Latent factor models serve as a common representation learning approach, capable of uncovering informative latent factors from such data. Nevertheless, most existing latent factor models rely solely on gradient descent for optimization, which may lead to insufficient and biased representations, particularly when dealing with heterogeneous HDI data. Thus, this study proposes an Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization (ELFM-DEGDO) with two-fold designed: 1) two diverse latent factor models are independently modeled via differential evolution and gradient descent optimization, respectively, and 2) the two diverse latent factor models are combined via a customized self-adaptive weighting mechanism to effectively fuse their strengths. By leveraging the complementary advantages of both optimization paradigms, ELFM-DEGDO is able to produce more comprehensive and less biased representations for HDI data. Three HDI datasets are tested to show that ELFM-DEGDO consistently performs better than related several latent factor models.
[AI-60] Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View
链接: https://arxiv.org/abs/2606.04405
作者: Mingyu Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of the model approximately scale-invariant with respect to weight magnitudes. In this regime, standard Frobenius-norm weight decay acts purely along the radial direction of the weight space and cannot directly simplify the function represented by the normalized layer. We study grokking in small algorithmic tasks through this lens and propose \emphLow-Rank Decay (LRD), a nuclear-norm-like spectral regularizer whose subgradient – the polar factor UV^\top – retains a tangential component even in the scale-invariant setting. This distinction has a concrete dynamical consequence: after the model memorizes the training set and task gradients vanish, L2 decay can no longer reshape the weight spectrum, whereas LRD continues to compress singular values in an \ell_1 -like fashion. On modular arithmetic tasks, we find that LRD induces rapid effective-rank collapse in Query/Key matrices and expands the data-fraction boundary at which delayed generalization (grokking) occurs. We further provide a spectral-geometric interpretation through the ``needle-to-fan’’ expansion of the nuclear-norm subdifferential near low-rank strata.
[AI-61] Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
链接: https://arxiv.org/abs/2606.04402
作者: Jingbo Wen,Liang He,Ziqi He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.
[AI-62] Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
链接: https://arxiv.org/abs/2606.04391
作者: Jiaxi Li,Ke Deng,Yun Wang,Jingyuan Huang,Yucheng Shi,Qiaoyu Tan,Jin Lu,Ninghao Liu
类目: Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at this https URL.
[AI-63] ITAN-FedAnil: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises
链接: https://arxiv.org/abs/2606.04388
作者: Muhammad Hadi,Muhammad Jahangir,Talha Shafique,Muhammad Khuram Shahzad
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures; code available at this https URL
Abstract:Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data heterogeneity arising from non-IID distributions and decentralized security threats remain significant challenges, particularly in resource-constrained enterprise environments. This paper presents TITAN-FedAnil+, a Trust-Based Adaptive Network for blockchain-enabled federated learning in intelligent enterprises. The proposed framework introduces affinity propagation-based adaptive clustered aggregation to identify and filter malicious updates without requiring prior knowledge of the number of attackers. In addition, GPU-accelerated vectorization is employed to improve computational efficiency, while a signed state jump mechanism enables lightweight blockchain resynchronization. Experimental results demonstrate substantial reductions in memory overhead, achieving up to 81% savings across 50 communication rounds on constrained 8 GB edge devices compared with the baseline framework. The results indicate that TITAN-FedAnil+ effectively improves robustness, scalability, and resource efficiency for secure federated learning deployments in intelligent enterprise environments.
[AI-64] From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models
链接: https://arxiv.org/abs/2606.04381
作者: Chen Chu,Bita Azarijoo,Li Xiong,Khurram Shafique,Cyrus Shahabi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emphsymbolic, arising from pattern matching over spatial language rather than true \emphgeometric reasoning over space. Because LLMs operate on discrete tokens, they lack native support for continuous spatial representations, explicit geometric computation, and structured spatial operators. To address this limitation, we introduce the \emphSpatial Language Model (SLM), the first multimodal LLM that treats location information as a first-class modality and enables geometric spatial reasoning within the model’s inference process. SLM directly operates on learned spatial representations rather than textual descriptions of spatial relations. To support effective training, we construct a \emphSpatial Instruction Dataset that aligns spatial representations, atomic geometric operations, and natural language instructions. We further propose a new benchmark named \emphSpatialEval, which is designed to evaluate spatial reasoning across attributes, distance, topology, and relative-position tasks. Extensive experiments show that SLM significantly outperforms existing LLM-based approaches that rely on symbolic reasoning via prompt engineering or textual abstraction, demonstrating the benefits of integrating geometric spatial representations for robust spatial reasoning. Our instruction dataset, evaluation benchmark, model training codes, and models’ checkpoints can be found at: \hyperlinkthis https URLthis https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.04381 [cs.LG] (or arXiv:2606.04381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-65] Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty KDD2026
链接: https://arxiv.org/abs/2606.04342
作者: Riku Green,Zahraa S. Abdallah,Telmo M Silva Filho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, Accepted for KDD 2026 Research track
Abstract:Multi-step time series forecasting (MSF) is commonly evaluated using point-wise error metrics such as mean squared error (MSE), implicitly treating the conditional mean as a sufficient target. We show that this can be misleading under conditional uncertainty, where the conditional expectation becomes unrepresentative of typical realized values at longer horizons. We formalize this effect through a conditional uncertainty gap and prove that whenever this gap is nonzero, no deterministic predictor can simultaneously minimize MSE and match the marginal distribution of realized futures. This establishes a fundamental, model-agnostic trade-off between point accuracy and marginal realism in MSF evaluation. Using controlled stochastic dynamical systems and nine real-world forecasting benchmarks, we empirically characterize the resulting accuracy–realism frontier and \textbfquantify the practical cost of MSE-only model selection. As conditional uncertainty increases with forecast horizon, the attainable set expands into a pronounced Pareto front, separating MSE-optimal but under-dispersed predictors from methods that trade accuracy for realistic marginal variability. \textbfAcross benchmarks, we find that small relaxations in MSE ( \boldsymbol\le 5% ) frequently unlock disproportionate gains in marginal realism, with median improvements of \mathbf17.3% and gains exceeding \mathbf30% in some datasets. We further show that common forecasting strategies systematically occupy different regions of this frontier: direct multi-output predictors concentrate near the accuracy-optimal extreme, while recursive strategies and sample-based inference favors marginal realism. Together, these results expose a structural failure mode of MSE-based evaluation in long-horizon forecasting and recast strategy and inference selection as navigation of an unavoidable accuracy–realism trade-off.
[AI-66] From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
链接: https://arxiv.org/abs/2606.04329
作者: Pritam Dash,Tongyu Ge,Aditi Jain,Tanmay Shah,Zhiwei Shang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench – a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.
[AI-67] Generalizable Multi-Task Learning for Wireless Networks Using Prompt Decision Transformers
链接: https://arxiv.org/abs/2606.04328
作者: Fatih Temiz,Shavbo Salehi,Melike Erol-Kantarci
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted paper at IEEE International Mediterranean Conference on Communications and Networking (MeditCom) 2026
Abstract:Future wireless networks demand rapid adaptation to highly heterogeneous environments and dynamic task configurations, necessitating a shift from conventional rule-based and optimization-driven radio resource management (RRM) toward artificial intelligence (AI)-driven RRM. AI-driven approaches can learn complex nonlinear relationships, generalize across diverse network conditions and enable real-time, scalable and autonomous decision-making. Among RRM techniques, coordinated multipoint (CoMP) transmission is pivotal for mitigating inter-cell interference and enhancing cell-edge performance, thereby improving quality of experience (QoE) in dense deployments. However, optimal multi-cell selection remains a complex combinatorial challenge as it requires jointly optimizing over many possible serving-cell combinations under dynamic traffic and channel conditions. Despite their success, conventional deep reinforcement learning (DRL) methods such as proximal policy optimization (PPO) suffer from poor sample efficiency, limited generalization, and costly retraining when state and action spaces change. To address these bottlenecks, we propose a Prompt Decision Transformer (PromptDT) based multi-task learning framework capable of learning across diverse network configurations and reformulating multi-cell selection as a sequence modeling problem. By leveraging offline trajectories and task-specific prompts, PromptDT enables scalable learning across diverse network configurations, including varying base stations and user equipment counts, and scheduler policies. Experimental results demonstrate that PromptDT improves QoE by up to 49% in multi-task settings compared to baselines, with performance scaling positively alongside model capacity. Moreover, PromptDT generalizes effectively to unseen tasks, achieving robust few-shot adaptation to new network configurations without retraining or fine-tuning.
[AI-68] A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
链接: https://arxiv.org/abs/2606.04327
作者: Tian Ding,Dawei Li,Ruoyu Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 47 pages
Abstract:We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of “neuron splitting” where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the “inner Hessian” matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that “splitting” a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.
[AI-69] Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models
链接: https://arxiv.org/abs/2606.04326
作者: Julian Skirzynski,Harry Cheon,Shreyas Kadekodi,Meredith Stewart,Berk Ustun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Benchmarks available at this https URL
Abstract:Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits researchers’ ability to determine which problems are suitable for these models, isolate the factors that drive their performance or lead to failures, or uncover which algorithms perform well. In this paper, we develop synthetic benchmarks for concept-bottleneck models, focusing on their two main use cases: decision support, in which models assist humans in making better decisions, and automation, in which models handle routine tasks without supervision. Our benchmarks can generate labeled datasets while controlling for properties that affect performance, including data modality, concept choice, annotation quality, and completeness. We demonstrate how the benchmarks can be used to evaluate representative classes of concept bottleneck models. Our demonstrations show how the benchmarks can diagnose failure modes and guide follow-up testing.
[AI-70] he Digital Apprentice: A Framework for Human-Directed Agent ic AI Development
链接: https://arxiv.org/abs/2606.04321
作者: Travis Weber,Rohit Taneja
类目: Artificial Intelligence (cs.AI)
备注: Submitted to ACM AI Leadership Summit 2026, Visionary Papers Track. 5 pages, 2 figures
Abstract:Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human’s standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional’s tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.
[AI-71] OpenRFM: Dissecting Relational In-Context Learning
链接: https://arxiv.org/abs/2606.04320
作者: Zhikai Chen,Junyu Yin,Jialiang Gu,Siheng Xiong,Xiaoze Liu,Ruowang Zhang,Keren Zhou,Kai Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, including appendix
Abstract:Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one forward pass via relational in-context learning (ICL). Yet a substantial gap separates open RFMs from their commercial counterparts, and the origin of this gap has not been systematically understood. We dissect a representative framework, the Relational Transformer (RT), from two perspectives. Model side: we show that RT performs relation-level ICL, and a kernel regression view shows it fails when sparse label-cell coverage yields an underdetermined regression. Data side: we ablate RT’s pre-training source and find that existing synthetic-only pre-training and in-distribution pre-training drive the same architecture into different regimes, lazy vs. feature-learning. Probing this gap reveals that the missing ingredient is a support-identifiable relational latent in the label-generation process. These two diagnoses translate into (1) a dual-stage ICL architecture that combines the relational backbone with a batch-level ICL layer lifted from a pre-trained tabular foundation model to overcome relation-level label scarcity, and (2) a homophily-aware synthetic plus continual real-data pre-training mixture, augmented with a prototype-based regularization. These choices define OpenRFM, a simple yet effective RFM that improves average task performance by approximately 30% over the RT backbone and surpasses the commercial model KumoRFMv1 on a large set of evaluation tasks.
[AI-72] Exploring Cross-Scenario Generality of Agent ic Memory Systems: Diagnostics and a Strong Baseline
链接: https://arxiv.org/abs/2606.04315
作者: Zhikai Chen,Jialiang Gu,Junyu Yin,Xianxuan Long,Shenglai Zeng,Xiaoze Liu,Kai Guo,Keren Zhou,Jiliang Tang
类目: Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.
[AI-73] Anycast Performance in Context
链接: https://arxiv.org/abs/2606.04298
作者: Eric Liang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:IP anycast lets a service advertise one address from many physical sites, leaving BGP to map each client to a site. It is central to the DNS root server system, public resolvers, and some content delivery networks, yet the same routing mechanism has very different consequences across applications. This paper compares anycast latency in two settings: root DNS, where recursive caching amortizes root-server delay over many users and long time-to-live values, and CDNs, where each additional round trip can directly affect page-load, video-start, or API latency. The synthesis finds that root DNS anycast can exhibit substantial path inflation while still producing limited user-visible delay, whereas CDN anycast requires active engineering of peering, route policy, catchment scope, and measurement feedback to keep inflation small. The paper contributes a comparative latency model, a reproducible measurement design, and an optimization framework that separates resilience-driven anycast objectives from latency-driven objectives. The central conclusion is practical: operators should not optimize root DNS and CDN anycast with the same objective function. For root DNS, robustness, reachability, and cache behavior dominate; for CDN services, tail latency, catchment correctness, and policy control dominate.
[AI-74] he Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
链接: https://arxiv.org/abs/2606.04296
作者: Manvendra Modgil
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 5 tables. Code and data: this https URL
Abstract:As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff’s alpha = +0.047; best pairwise Cohen’s kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector’s accuracy.
[AI-75] Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models
链接: https://arxiv.org/abs/2606.04287
作者: Alessio Barboni,Massimiliano Lupo Pasini,Bishal Lakha,Edoardo Serra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.04287 [cs.LG] (or arXiv:2606.04287v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-76] From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments ICLR2026
链接: https://arxiv.org/abs/2606.04275
作者: Saket Tiwari,Tejas Kotwal,George Konidaris
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at ICLR 2026: this https URL
Abstract:We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment’s state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.
[AI-77] Characterizing initial human-AI proof formalization workflows
链接: https://arxiv.org/abs/2606.04273
作者: Katherine M. Collins,Simon Frieder,Jonas Bayer,Jacob Loader,Jeck Lim,Peiyang Song,Fabian Zaiser,Lexin Zhou,Shanda Li,Sam Looi,Joshua B. Tenenbaum,Umang Bhatt,Adrian Weller,Jose Hernandez-Orallo,Cameron E. Freer,Valerie Chen,Ilia Sucholutsky
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems’ ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people’s ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people’s formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people’s preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.
[AI-78] Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
链接: https://arxiv.org/abs/2606.04238
作者: Devleena Das,Rajeev Patwari,Elliott Delaye,Ashish Sirasao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA – a lightweight, data-free accuracy recovery method originally developed for general model weight corruption – to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B–20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5–23.3% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA – training low-rank adapters on the quantized layers via logit distillation with synthetic data – to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80–95% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.
[AI-79] Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry
链接: https://arxiv.org/abs/2606.04227
作者: Jason L. Volk
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: 2 figures, 2 tables, 1 algorithm; code at this https URL
Abstract:We present an algorithmic framework for incremental maintenance of first sheaf cohomology H^1(X; \mathcalF) on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of H^1 via factorization of the coboundary matrix requires O(n^3) time; when the complex evolves with a stream of m edits, full recomputation after each edit costs O(mn^3) . Under a bounded local geometry assumption – bounded cell size v_\max , bounded stalk dimension d , and bounded nerve degree D – each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in O(1) time with respect to the total complex size n (with cost polynomial in the local geometry parameters v_\max , d , and D , which are treated as constants independent of n ), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through V = 10^6 ). We also give an amortized O(|E|) streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ( d \geq 2 , non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to 5 \times 10^6 vertices and 1.7 \times 10^7 streaming edits show 35 \mu s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is O(n) per flush in the implemented full-traversal path. Exact synchronization costs are reported separately. Comments: 2 figures, 2 tables, 1 algorithm; code at this https URL Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) MSC classes: 55N30, 68W05, 05C85 ACMclasses: F.2.2; G.2.2 Cite as: arXiv:2606.04227 [cs.DS] (or arXiv:2606.04227v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.04227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-80] PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification ICRA2026
链接: https://arxiv.org/abs/2606.04226
作者: Charlie Gauthier,Sacha Morin,Liam Paull
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at ICRA 2026 (Vienna); published on arxiv for archival purposes. See also this https URL
Abstract:Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot’s perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of approximately 39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.
[AI-81] Consensus is Strategically Insufficient: Reasoning -Trace Disagreement as a Knowledge-Representation Signal KR
链接: https://arxiv.org/abs/2606.04223
作者: Michał Wawer,Jarosław A. Chudziak
类目: Artificial Intelligence (cs.AI)
备注: Accepted to LAMASSR workshop at FLoC 2026 (KR + ICPL + LICS + CP + FSCD)
Abstract:Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.
[AI-82] SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
链接: https://arxiv.org/abs/2606.04202
作者: Joel Sol,Homayoun Najjaran
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure
Abstract:As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.
[AI-83] Notarized Agents : Receiver-Attested Confidential Receipts for AI Agent Actions
链接: https://arxiv.org/abs/2606.04193
作者: Juan Figuera
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 22 pages. Reference implementation at this https URL
Abstract:Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent’s call signs a receipt of what it observed using its own key, encrypts the receipt to the agent’s owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.
[AI-84] Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge
链接: https://arxiv.org/abs/2606.04191
作者: Cen Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matching, and trajectory reconstruction across nine task pairs. The key discovery is that no single model family dominated all metrics. Instead, we built a metric-aware hybrid system that assigned a different predictor to each metric family: (1) synthetic-pretrained denoisers for full-trajectory reconstruction, (2) Lorenz ODE fitting and trajectory shooting for the first 20 forecast steps, and (3) histogram-tail substitution using synthetic Lorenz libraries for long-time evaluation. A representative mature submission from this system family scored 83.83551 on the public leaderboard, and a small follow-up stack of the same ideas reached 83.85529. We focus on the cleaner intermediate system because it captures the full method while remaining simple enough to reproduce and analyze, while the final submission can be understood as a conservative extension of the same backbone.
[AI-85] Dual Advantage Fields ICML2026
链接: https://arxiv.org/abs/2606.04188
作者: Alexey Zemtsov,Maxim Bobrin,Alexander Nikulin,Dmitry V. Dylov,Fakhri Karray,Vladislav Kurenkov,Martin Takáč,Arip Asadulaev
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
Abstract:Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.
[AI-86] Exact Unlearning in Reinforcement Learning ICML
链接: https://arxiv.org/abs/2606.04182
作者: Thanh Nguyen-Tang,Raman Arora
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML Spotlight
Abstract:We formulate the problem of \emphexact unlearning in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user’s data upon deletion request, i.e., the online learner’s output after unlearning is \emphindistinguishable from what would have been produced had the deleted user never interacted with the learner. For any \rho 0 , we show that there exists a reinforcement learning (RL) algorithm that is \rho -TV-stable and supports an exact unlearning procedure whose expected computational cost is only a \rho \sqrt\ln T fraction of the computational cost of retraining from scratch. We construct such a \rho -TV-stable RL algorithm for tabular Markov decision processes (MDPs), which achieves a regret bound of \mathcalO(H^2 \sqrtSAT + H^3 S^2 A + H^2.5 S^2 A/\rho) , where S, A, H , and T denote the number of states, the number of actions, the episode horizon, and the number of episodes, respectively. We also establish a lower bound of \Omega(H\sqrt!SAT! +! SAH/\rho) for \rho -TV-stable RL algorithms, showing that our algorithm is nearly minimax optimal.
[AI-87] MimeLens: Position-Agnostic Content-Type Detection for Binary Frag ments
链接: https://arxiv.org/abs/2606.04171
作者: Michael J. Bommarito II
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 2 figures, 15 tables. Models released on Hugging Face ( this https URL;) reference training code at this https URL
Abstract:File-type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned systems such as Google’s Magika assume whole-file access at a known offset, so they break on the inputs many of these tasks actually produce, like a single packet payload, a header-less carved fragment, a random disk block, or a chunked upload. We introduce MimeLens, a family of small BERT-style encoders pretrained on binary content from windows sampled at a uniformly random offset within each file, with no privileged head-of-file position, in standard- and short-context variants. A byte chunk goes in from anywhere in a file, no header needed and no fixed size; out comes one of libmagic’s 125 MIME labels. On the clean head of complete files, MimeLens beats Magika v1.1 by +10.7 pp top-1 on libmagic-labeled data, and it keeps classifying where Magika cannot: from a single mid-stream UDP packet, and more than twice as accurately as libmagic and Magika on random mid-file disk blocks. The cost is latency: MimeLens runs roughly one to two orders of magnitude slower per sample on CPU than Magika, though it matches on consumer GPUs or in batch. All trained checkpoints are released on Hugging Face (mjbommar/mimelens-001-*).
[AI-88] Smart Transportation Without Neurons – Fair Metro Network Expansion with Tabular Reinforcement Learning
链接: https://arxiv.org/abs/2606.04167
作者: Dimitris Michailidis,Sennay Ghebreab,Fernando P. Santos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel demand. Traditional methods rely on exact and heuristic approaches that require expert-defined constraints to reduce the search space. Recently, deep reinforcement learning (Deep RL) has emerged due to its effectiveness in complex sequential decision-making processes-it remains, however, computationally expensive, environmentally costly, and requires additional engineering to interpret. We show that MNEP problems are small enough to not require Deep RL methods. Reformulating the MNEP as a Non-Markovian Rewards Decision Process (NMRDP), we use tabular RL to achieve similar performance with significantly fewer training episodes, additionally offering greater interpretability. Additionally, we incorporate social equity criteria into the reward functions, focusing on efficiency and fairness, highlighting the versatility of our method. Evaluated in real-world settings-Xi’an and Amsterdam-our method reduces total episodes by a factor of 18 and total carbon emissions by a factor of 12 on average, while remaining competitive with Deep RL. This approach offers a replicable, modular, interpretable, and resource-efficient solution with potential applications to other combinatorial optimization problems.
[AI-89] ADAPTOOD: Uncertainty-Aware Fine-Tuning for Out-of-Distribution ECG Time Series Models
链接: https://arxiv.org/abs/2606.04164
作者: Sotirios Vavaroutas,Yu Yvonne Wu,Ali Etemad,Cecilia Mascolo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Data samples used for training often differ from those encountered during fine-tuning and deployment, and while ML models show promise, their performance remains limited when only small annotated datasets are available. Performance often degrades under distribution shifts caused by diverse sensors, populations, and application settings. Although pre-training helps, models frequently encounter out-of-distribution (OOD) data in real-world settings, leading to reduced robustness. Existing adaptation methods usually assume fixed distribution shifts and struggle when multiple types or severities occur. In particular, they overlook shift severity, for example treating adaptation to a large familiar dataset the same as adaptation to a small dataset with a new task, which limits generalisation. To address this, we propose ADAPTOOD, a novel framework that leverages data uncertainty to quantify distribution shift severity and guide fine-tuning for time series. This uncertainty measures how strongly samples from the target deployment distribution deviate from the pre-training distribution, providing a direct signal of OOD severity. Our framework combines this uncertainty with low-rank model updates and adaptive hyperparameter optimisation to improve adaptation. We show that ADAPTOOD achieves up to 7% higher accuracy and 12.9% higher precision than existing methods in OOD tasks, maintaining strong performance as distribution shift severity increases.
[AI-90] hinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research
链接: https://arxiv.org/abs/2606.04152
作者: Clarisse de Souza,Gabriel Barbosa,Simone Diniz Junqueira Barbosa,Bárbara Betts,Renato Cerqueira,Juliana Jansen Ferreira
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 5 figuras
Abstract:Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduces PEEL - Protocols for Epistemically Engaged Literacy in AI, a working scaffolding that combines deterministic distant reading via Voyant Tools with LLM interpretation via Claude, grounded in Peircean semiotics and abductive reasoning. Applied to AI-generated condensations of three source texts, PEEL reveals systematic distortions in quantity, term frequency, and epistemic voice that are invisible without non-AI measurement – and yields three design implications: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in, not assumed.
[AI-91] EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
链接: https://arxiv.org/abs/2606.04145
作者: Guilin Zhang,Chuanyi Sun,Shahryar Sarkani,John M. Fossaceca
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std = 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).
[AI-92] Physics-Informed Machine Learning for Short-Term Flood Prediction
链接: https://arxiv.org/abs/2606.04143
作者: Tewodros Syum Gebre,Jagrati Talreja,Leila Hashemi-Beni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication in IGARSS 2026. The final authenticated version will be available through IEEE Xplore
Abstract:Accurate flood forecasting is essential for mitigating disaster risks and protecting communities. However, purely data-driven machine learning models often struggle in data-scarce environments and may violate fundamental hydrological principles. Standard Long Short-Term Memory (LSTM) networks can generate physically inconsistent predictions, particularly when extrapolating to extreme weather conditions. To address these limitations, we propose a Physics-Informed Machine Learning (PIML) framework that incorporates hydrological knowledge directly into the loss function of an LSTM model. Specifically, a Trend Alignment constraint penalizes directional inconsistencies between precipitation and discharge trends, improving model robustness without requiring complex hydrodynamic equations. This regularization encourages the model to learn physically plausible hydrograph behavior, even with limited training data, while enhancing reliability during peak flood events. Experimental results show that the proposed physics-informed model outperforms a standard LSTM baseline in data-scarce settings, increasing the Nash-Sutcliffe Efficiency (NSE) from 0.20 to 0.23 when trained on only 5% of the available data. Additional stress tests under simulated extreme climate scenarios demonstrate that the baseline model exhibits unstable behavior, whereas the physics-informed model maintains directional consistency and physical plausibility. Although accurately predicting extreme peak magnitudes remains challenging with limited data, the proposed approach substantially reduces unphysical fluctuations common in purely data-driven models. These findings demonstrate that simple physical constraints can significantly improve the reliability of deep learning models for real-time flood forecasting, offering a practical solution for ungauged basins and evolving climate conditions.
[AI-93] Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
链接: https://arxiv.org/abs/2606.04141
作者: Kargi Chauhan,Pratibha Revankar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.
[AI-94] HighTide: An Agent -Curated Open-Source VLSI Benchmark Suite
链接: https://arxiv.org/abs/2606.04126
作者: Benjamin Goldblatt,Paolo Pedroso,Farhad Modaresi,Ethan Sifferman,Matthew R. Guthaus
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanning multiple design languages and technology nodes, (ii) Bazel-based incremental RTL-to-GDS compilation with remote caching, (iii) AI-assisted design curation through twelve agent skills covering the design lifecycle, flow optimization, tool reference, and meta-maintenance, backed by per-design decision logs that serve as long-term memory of tuning rationale across the suite, and (iv) an infrastructure with RTL compilation verification for stable releases. The suite is publicly available and designed to grow with the open-source hardware ecosystem.
[AI-95] dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats
链接: https://arxiv.org/abs/2606.04115
作者: Giuseppe Franco,Ian Colbert,Pablo Monteagudo-Lago,Felix Marty,Nicholas Fraser
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantizing large language models (LLMs) to low-precision floating-point representations is central to efficient deployment, yet applying a single bit-width uniformly across all layers is sub-optimal in terms of both performance and accuracy. This work introduces dMX, a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment. We study its application for the microscaling floating-point (MXFP) family of data types defined by the Open Compute Project (OCP) standard. The per-layer bit-width assignment is formulated as a continuous optimization problem in which each layer’s floating-point format format is parameterized by a scalar parameter, folding the multi-variate design space into a single learnable offset. During training this offset takes continuous values, avoiding sudden oscillations between discrete quantization formats. A temperature-based annealing schedule progressively discretizes the learned offsets, ensuring that the final configuration maps to hardware-compatible MXFP formats without abrupt transitions between training and inference behavior. A target-aware regularization term steers the average bit-width toward a user-specified budget, serving as a coarse-grained proxy for inference cost and balancing model quality against deployment efficiency. We performed experiments on different families of LLM, such as Llama, Qwen3, and SmolLM2, evaluating perplexity on WikiText-2 and accuracy on four zero-shot reasoning benchmarks. Across these settings, dMX consistently yields Pareto-dominating models and improves over Kullback-Leibler (KL) divergence-based layer-selection heuristics, efficiently navigating trade-offs between model quality and average bit-width.
[AI-96] Agent icDiffusion: Agent ic Diffusion-based Path Planning for Vision-Based UAV Navigation
链接: https://arxiv.org/abs/2606.04111
作者: Faryal Batool,Muhammad Ahsan Mustafa,Fawad Mehboob,Valerii Serpiva,Dzmitry Tsetserukou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.
[AI-97] Building The Ph(ysical)AI Layer Of Machine Intelligence
链接: https://arxiv.org/abs/2606.04106
作者: Ulbert Jose Botero,Liam Smith,Brooks Olney,Pooya Khorrami,Steven Kusiak,Watson Jia,Sage Trudeau,Daniel Capecci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 102 pages, 11 Figures
Abstract:Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data. We propose principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than learn untethered statistical correlations. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase. Training exclusively on radio-frequency (RF) data with co-designed architecture and losses incorporating these principles, we achieve cross-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine-tuning of the encoder on target domains. Our 1.99M parameter frozen encoder achieves 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing, with systematic variation: 84.5 on physically-grounded tasks (speaker recognition, seismology, RF fingerprinting) versus 70.0% on semantic tasks (music genre, language recognition). This reveals that principle-driven and scale-driven approaches offer complementary paths: physical principles enable efficient cross-modal transfer while naturally establishing the boundary between physical and semantic understanding.
[AI-98] Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems
链接: https://arxiv.org/abs/2606.04104
作者: Zexun Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 25 pages, 2 tables, 3 figures. Implementation-informed systems paper with bounded public validation
Abstract:Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces. Comments: 25 pages, 2 tables, 3 figures. Implementation-informed systems paper with bounded public validation Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2606.04104 [cs.SE] (or arXiv:2606.04104v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.04104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-99] he Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids
链接: https://arxiv.org/abs/2606.04103
作者: Alejandro Ballesta Rosen,Jason Mikiel-Hunter,Julian Maclaren,Jack Collins,Richard F. Lyon,Simon Carlile
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails to provide sufficient listening support in complex environments, such as situations with multiple speakers (the ``cocktail party’’ problem). To more comprehensively address the underlying encoding dysfunctions of hearing loss, we introduce the Differentiable Auditory Loop (DAL), a new open-source framework for personalized hearing aid design and fitting. Our first implementation of DAL incorporates CARFAC, a differentiable model of human cochlear function, which we ported to JAX, to optimize a deep neural network to match impaired auditory neural activity patterns with a normal-hearing reference. To build a hearing aid with the fine-grained spectro-temporal signal processing required, we adopt SEANet, a waveform-to-waveform fully convolutional UNet generator. We fine-tune the network by comparing the outputs of a CARFAC model fitted to normal hearing with that of a CARFAC model fitted to match each subject’s individual hearing impairment. The comparison is done using loss functions derived from the respective CARFAC neural activity pattern (NAP) outputs and stabilized auditory images (SAIs), the latter providing a 2D representation that captures phase-insensitive temporal structure in the auditory nerve output. Through gradient descent, the SEANet model learns to both denoise the input and compensate for the hearing loss modelled by the impaired CARFAC model. Across neural-representation and signal-fidelity metrics, the DAL-optimized SEANet model outperformed the tested master hearing aid (MHA) baselines. The DAL framework provides a practical path toward model-based, machine-learning-driven personalization of hearing aid signal processing. Next steps include hardware deployment to enable real-world clinical testing.
[AI-100] Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting
链接: https://arxiv.org/abs/2606.04074
作者: Federico Zucchi,Yi Xie,Chao Zhang,Keyuan Luo,Thomas Lampert,Ziyue Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Adaptive patching is a recent and compelling proposal for time-series Transformers: allocate finer patches where the sequence looks locally informative. This paper asks under what conditions a content-adaptive patching operator should outperform a tuned uniform one. Local heterogeneity alone is not enough: under pointwise forecasting losses, a complex-looking region is not automatically one where finer patching reduces the loss. We model patching as a budgeted bitrate allocation and derive an explicit threshold that a dynamic patching rule must satisfy to beat a well-tuned uniform baseline, then bound the achievable improvement both locally (a quadratic surrogate) and globally (a strong-convexity bound under the model’s assumptions). Two structural results follow: without a coupling constraint, scalar local complexity cannot produce a non-uniform optimum under a common loss landscape; and once the backbone is trained to its representation-aware optimum, the alignment gain collapses around a well-tuned uniform patch size. To test these predictions, we run a controlled isolation study on three representative architectures, replacing each adaptive mechanism with a uniform patch-size sweep while keeping the backbone, data, and training protocol fixed. On standard long-horizon forecasting benchmarks, the validation-selected uniform baseline is competitive with the dynamic counterpart, with per-setting effects concentrated near zero and no consistent directional advantage once results are aggregated by dataset. The larger gains we do observe are method- and dataset-specific. Adaptive patching should therefore be evaluated against a tuned uniform baseline; its value depends on whether a cheap and reliable routing signal can identify where finer patches actually reduce forecasting loss.
[AI-101] PA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection
链接: https://arxiv.org/abs/2606.04073
作者: Xiancheng Wang,Zhibo Zhang,Ran Li,Rui Wang,Minghang Zhao,Shisheng Zhong,Lin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbfTwo-stage \textbfPseudo \textbfAnomaly-guided \textbfAnomaly \textbfDetection, \textbfTPA-AD) for axle-box bearing time-series anomaly detection (time series anomaly detection, TSAD) under the setting where only normal samples are available for training. The method first generates pseudo-anomalous windows near the normal boundary using a reconstruction model and per-feature target-error control. It then learns anomaly-sensitive representations through contrastive learning between normal and pseudo-anomalous windows, and finally produces window-level and point-level anomaly scores using k-nearest neighbors (KNN). Compared with existing methods that rely on known fault categories, real anomaly priors, or random anomaly injection, TPA-AD improves the separability of the normal boundary by constructing pseudo-anomalies in boundary neighborhoods and can jointly handle continuous and discrete features in mixed-variable scenarios. The main experiments are conducted on bearing fault detection datasets and degradation-process datasets, with an additional exploratory extension on 13 public TSAD datasets. The results show that the proposed method yields relatively stable anomaly responses, is sensitive to degradation evolution, and demonstrates a certain degree of broader applicability on public TSAD benchmarks and real high-speed-train-related bearing data.
[AI-102] Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation
链接: https://arxiv.org/abs/2606.04067
作者: Xinyue Huang,Xiaochun Cao,Wenyuan Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs become increasingly woven into everyday workflows, user queries sent to cloud hosted LLMs routinely mix task-essential content with task non-essential sensitive disclosures, yet type based PII redaction is context agnostic and may raise two issues: over disclosing untyped sensitive context and over removing answer bearing spans. We recast privacy preserving query rewriting under Contextual Integrity: a span should be forwarded only if it is necessary for the task. We introduce DelegateCI-Bench, the first task based Contextual Integrity benchmark for privacy-conscious delegation, comprising 3,167 samples that combine high quality synthetic data spanning 11 tasks and 20 task types, WildChat based real user queries, and a medical challenge set with dense sensitive information. Building on this benchmark, we propose a CI-guided reinforcement learning framework that converts essential and non-essential sensitive spans into verifiable optimization signals, and train a query rewriter to preserve task critical information while suppressing unnecessary sensitive disclosure. Experiments show that our learned rewriter achieves the best privacy-utility tradeoff, achieving up to +10.1 average utility over on-device baselines.
[AI-103] LLM Compression with Jointly Optimizing Architectural and Quantization choices
链接: https://arxiv.org/abs/2606.04063
作者: Hoang-Loc La,Truong-Thanh Le,Amir Taherkordi,Phuong Hoai Ha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying large language models (LLMs) is challenging due to their significant memory and computational requirements. While some methods address this by developing small or tiny language models from scratch, these approaches demand extensive GPU training. Compressing pre-trained LLMs for edge devices offers a compelling alternative. Beyond pruning and quantization, Neural Architecture Search (NAS) enables effective compression, yet prior NAS approaches often limit the search space and decouple architecture from quantization. We introduce a differentiable NAS framework that explores the entire space and jointly optimizes architectural configurations alongside mixed-precision quantization for linear layers of LLMs. Experiments demonstrate superior accuracy-latency trade-offs: our models achieve up to 1.4x faster inference than sequential NAS-then-quantization baselines at comparable accuracy, or up to 6% higher average accuracy across seven reasoning tasks at equivalent latency.
[AI-104] Spectral Scaling Laws of Muon
链接: https://arxiv.org/abs/2606.04058
作者: Gagik Magakyan,Pablo Parrilo,Asuman Ozdaglar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton–Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size M (around M^-0.25 ), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to M^-0.96 ) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter – avoiding unnecessary computation without sacrificing update quality.
[AI-105] he Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation
链接: https://arxiv.org/abs/2606.04057
作者: Akanksha Narula,Mofasshara Binte Rafique,Laurent Bindschaedler
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidental prompt cues, meaning contextual words or metadata outside the task specification, can steer which algorithm the model selects, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality. Here, output policy means algorithm choice under fixed correctness. We define algorithm steering as cue-induced shifts in algorithm-family distributions and run 46,535 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization semantic-vs-surface ablation that preserves meaning while changing typography and punctuation), and 15 model configurations. We find large, systematic shifts in algorithm-family distributions (up to 100 pp), largely consistent with cue semantics, including in applied tasks such as rate limiting. Direct algorithm naming is the most reliable mitigation we tested. Accidental context therefore creates an “invisible lottery” over performance, security, and maintainability.
[AI-106] A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
链接: https://arxiv.org/abs/2606.04053
作者: Eduardo Terrés-Caballero,Herke van Hoof
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal-set-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function-approximation, and continuous-control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at this https URL.
[AI-107] RUBAS: Rubric-Based Reinforcement Learning for Agent Safety
链接: https://arxiv.org/abs/2606.04051
作者: Xian Qi Loye,Qinglin Su,Zhexin Zhang,Shiyao Cui,Qi Zhu,Fei Mi,Hongning Wang,Minlie Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.
[AI-108] LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection ICML2026
链接: https://arxiv.org/abs/2606.04050
作者: Liulu He,XuanAng Liu,Juntao Liu,Taolue Feng,Ting Lu,Chunsheng Gan,Zhiyv Peng,Yuan Du,Huanrui Yang,Yijiang Liu,Li Du
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 Spotlight
Abstract:Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant’s decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at this https URL.
[AI-109] Unlocking Feature Learning in Gated Delta Networks at Scale
链接: https://arxiv.org/abs/2606.04048
作者: Yifeng Liu,Quanquan Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ( \mu P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.
[AI-110] Bayes-Sufficient Representations in Supervised Learning
链接: https://arxiv.org/abs/2606.04045
作者: Vasileios Sevetlidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be Bayes-sufficient for a joint distribution and loss if some prediction head can use it to implement a Bayes-optimal action rule. This makes the target information loss-dependent. In the almost-surely unique Bayes-action case, the relevant object is a Bayes quotient, which identifies inputs that require the same Bayes-optimal action. A representation is sufficient when it refines this quotient, and Bayes-minimal when it is informationally equivalent to it. The framework connects naturally to property elicitation: zero-one loss requires the Bayes class, squared loss the conditional mean, Brier loss the conditional probability in binary prediction, and log loss or strictly proper scoring rules the predictive distribution. Controlled finite experiments, learned neural bottleneck experiments, and a real-data iNaturalist taxonomic refinement experiment illustrate the distinction between sufficiency, minimality, and retained non-required information. For a fixed supervised problem, the distribution and the loss determine the Bayes action, the Bayes action determines the quotient, and the quotient determines the minimal information required for Bayes-optimal prediction.
[AI-111] Channel-Oriented Design for EEG-to-Music Reconstruction
链接: https://arxiv.org/abs/2606.04040
作者: Jiaxin Qing,Junwei Lu,Lexin Li
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.
[AI-112] Beyond Static Priors: Dynamic Neural Guidance for Large-Scale Ant Colony Optimization KDD2026
链接: https://arxiv.org/abs/2606.04039
作者: Dat Thanh Tran,Van Khu Vu,Yining Ma
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at KDD 2026
Abstract:Neural-guided Ant Colony Optimization (ACO) suffers from a fundamental training-inference misalignment: policies are typically trained to generate static priors (e.g., heatmaps), yet deployed to guide iterative, long-horizon search processes. In this paper, we present DyNACO, a novel framework that achieves dynamic neural guidance by periodically observing the pheromone distribution and the incumbent solution. To make DyNACO tractable at scale, we pair the policy with a perturbation-based ACO backend and a scope-restricted refinement mechanism that jointly ensure efficacy and stable credit assignment. On TSP, DyNACO scales to 100,000-node instances and outperforms neural baselines while often reducing total runtime compared to the unguided solver. We extend DyNACO to CVRP via a capacity-aware backend, consistently improving the unguided baseline with less than 1% neural overhead. We further provide in-depth analysis validating the model’s generalization capabilities and elucidating why dynamic guidance outperforms static priors. Our work underscores the necessity of aligning neural training with iterative search dynamics in learning-guided optimization. The code is available at this https URL.
[AI-113] oward Pre-Deployment Assurance for Enterprise AI Agents : Ontology-Grounded Simulation and Trust Certification
链接: https://arxiv.org/abs/2606.04037
作者: Thanh Luong Tuan,Abhijit Sanyal
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 26 pages, 3 figures. Companion to arXiv:2604.00555
Abstract:Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a Trust Certificate carrying a machine-verifiable attestation with graduated deployment verdicts (Approved, Conditional, Rejected). A controlled pilot across four regulated industries (Fintech, Banking, Insurance, and Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam, generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation (G4) achieved 48.3% regulatory coverage versus 33.1% for the persona-based baseline (corrected p = .0006) and the highest domain specificity (4.77/5.0; p = 2e-6). The coverage advantage over baseline and retrieval-augmented prompting was not robust after Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The results establish ontology-grounded scenario generation as a credible complement to persona-based test suites for regulatory-intensive domains.
[AI-114] Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLM s
链接: https://arxiv.org/abs/2606.04035
作者: Zacharie Bugaud
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B–70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Trustworthy deployment requires predictable safety behavior, yet we find compliance is highly context-dependent: the same model (Mistral Nemo 12B) provides surveillance designs in 100% of requests but assists with trafficking in only 26.7%. This unpredictability is opaque to deployers: the technical framing bypass, where harmful requests reframed as engineering problems override safety training without any external signal that refusal thresholds have shifted. Within-domain heterogeneity reaches 84.4pp, meaning safety behavior cannot be predicted even at the domain level. A replication on five frontier closed models (GPT-4.1/5.2, Claude Haiku/Sonnet/Opus 4.x; n=4,163 responses) accessed via the GitHub Copilot CLI deployed-product surface reproduces the same domain stratification, attenuated in absolute level but identical in shape, with the two low-codification domains (science fraud, surveillance) again the most permissive. These results show that current safety mechanisms lack the transparency and consistency required for trustworthy AI deployment.
[AI-115] Position: Deployed Reinforcement Learning should be Continual ICML ICML2026
链接: https://arxiv.org/abs/2606.04029
作者: Parnian Behdin,Kevin Roice,Golnaz Mesbahi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the ICML 2026 Position Paper Track. See this https URL
Abstract:Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.
[AI-116] MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models
链接: https://arxiv.org/abs/2606.04027
作者: Yingzi Ma,Zhengyue Zhao,Xiaogeng Liu,Minhui Xue,Yue Zhao,Chaowei Xiao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 7 figures, 11 tables. Preprint
Abstract:Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.
[AI-117] he Biomimetic Architecture of Software 4.0
链接: https://arxiv.org/abs/2606.04025
作者: Philip Sheldrake,Dirk Scheffler
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Dominant programming paradigms inherit an execution model optimised for a bygone era of a single human mind instructing a local machine, leaving contemporary systems burdened with historical path dependencies. When forced to host multi-dimensional, connectionist intelligence, this brittle assembly model fractures under the weight of a profound probabilistic-symbolic impedance mismatch. While contemporary Software 3.x frameworks attempt to patch the mismatch by encasing large language models (LLMs) in increasingly complicated external harnesses, this spiralling architectural complexity only compounds the carrying cost of static code assembly. To address the cause rather than the effects, this paper introduces Software 4.0 – an autopoietic heterarchy of human intelligence, neural AI, and natively reflective symbolic substrate. Under this paradigm, software is transformed from an inert corpus to be parsed into a self-regulating metabolic network that natively verifies, modifies, and evolves its own structural integrity. We present Recognitive, the programming language and platform that materialises this architecture. By offloading the burden of structural verification to a deterministic substrate, it unlocks a superior inference-time scaling regime – one where connectionist compute translates entirely into deep semantic exploration and hypothesis traversal rather than the ruinous computational and financial cost of simulating structural constraints probabilistically. Moving beyond the legacy ‘Software Factory’ mindset, we outline the theoretical foundations required to ground connectionist intent and arrive fully in the intelligence age. This is a foundational vision paper; empirical evaluation and formal specification of the type system and operational semantics are the subject of future work.
[AI-118] CodegenBench: Can LLM s Write Efficient Code Across Architectures?
链接: https://arxiv.org/abs/2606.04023
作者: Jie Li,Wenzhao Wu,Junqi Hu,Qinrui Zheng,Bowen Wu,Juepeng Zheng,Yutong Lu,Haohuan Fu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 29 pages, 22 figures
Abstract:While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across three distinct hardware platforms: x86_64, Sunway, and Kunpeng. Our benchmark comprises 106 standard Basic Linear Algebra Subprograms (BLAS) routines establishing a fundamental baseline, alongside 20 specialized computational kernels adapted for each of the unique supercomputing architectures (LeetSunway and LeetKunpeng). Our extensive evaluation reveals that while state-of-the-art LLMs can generate optimized code for ubiquitous architectures like x86_64, they exhibit significant performance degradation on domain-specific architectures with limited public documentation and training data, highlighting critical limitations in cross-platform generalization. Furthermore, our analysis of factors influencing code quality such as implementation length and task complexity indicates that current LLMs are most effective for moderately difficult problems requiring concise code snippets. We open-source our dataset and automated evaluation infrastructure to facilitate future research in LLM-driven high-performance code generation. The resources are available at this https URL and this https URL.
[AI-119] Early Detection of Alzheimers Disease Using Explainable Machine Learning on Clinical Biomarkers: A Multi-Class Classification Study Using the Alzheimers Disease Neuroimaging Initiative (ADNI) Dataset
链接: https://arxiv.org/abs/2606.03995
作者: Afshan Hashmi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Background: Alzheimer’s disease (AD) affects over 55 million people worldwide. Accurate, interpretable detection of normal cognition (NC), mild cognitive impairment (MCI), and AD from routine clinical assessments remains a critical unmet need. Methods: An XGBoost classifier was developed for three-class detection using eight clinical features from the Alzheimer’s Disease Neuroimaging Initiative (ADNI): MMSE, CDR Global, CDR Sum of Boxes (CDR-SB), MoCA, FAQ, age, sex, and education. Hyperparameters were optimised using Optuna (50 trials); class imbalance was addressed with SMOTE. Performance was evaluated by macro AUC-ROC with 1,000-iteration bootstrap 95% confidence intervals, macro F1, balanced accuracy, and Cohen’s kappa. SHAP values provided feature-level explainability. Results: The dataset comprised 1,641 baseline subjects (608 NC, 767 MCI, 266 AD). On five-fold cross-validation, mean macro AUC was 0.983 (SD 0.007), accuracy 0.944 (SD 0.006), and macro F1 0.929 (SD 0.008). On the held-out test set (n = 247), macro AUC was 0.982 (95% CI: 0.965–0.995), accuracy 0.943, balanced accuracy 0.932, macro F1 0.927, and Cohen’s kappa 0.909. SHAP analysis identified CDR Global as the dominant predictor for NC and MCI, while CDR-SB and MMSE together drove AD classification. Conclusion: An explainable machine learning model trained on routine clinical assessments achieves near-perfect three-class Alzheimer’s detection. SHAP analysis reveals clinically plausible, class-specific feature importance patterns supporting clinical validity. Future work will extend this framework with speech biomarkers for multimodal detection.
[AI-120] DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning
链接: https://arxiv.org/abs/2509.10247
作者: Xinhong Zhang,Runqing Wang,Yunfan Ren,Jian Sun,Hao Fang,Jie Chen,Gang Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures, 1 table
Abstract:This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU-GPU data transfer bottlenecks and delivers orders-of-magnitude improvements in simulation throughput. In contrast to existing simulators, DiffAero not only provides high-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms. Extensive benchmarks and real-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer-grade hardware. The code is available at this https URL.
[AI-121] How do machines learn? Evaluating the AIcon2abs method
链接: https://arxiv.org/abs/2401.07386
作者: Rubens Lacerda Queiroz,Cabral Lima,Fabio Ferrentini Sampaio,Priscila Machado Vieira Lima
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: textual review (spelling and grammar); reorganization of the elements of some figures; New references included
Abstract:This study expands on previous work that introduced the AIcon2abs method (AI from Concrete to Abstract: Demystifying Artificial Intelligence to the general public), an innovative approach designed to increase public understanding of machine learning (ML) across diverse age groups, including K-12 students, and aims to evaluate its effectiveness. AIcon2Abs employs the WiSARD algorithm, a weightless neural network known for its simplicity, and user accessibility. WiSARD does not require Internet, making it ideal for non-technical users and resource-limited environments. This method enables participants to intuitively visualize and interact with ML processes through engaging, hands-on activities, as if they were the algorithms themselves. The method allows users to intuitively visualize and understand the internal processes of training and classification through practical activities. Once WiSARDs functionality does not require an Internet connection, it can learn effectively from a minimal dataset, even from a single example. This feature enables users to observe how the machine improves its accuracy incrementally as it receives more data. Moreover, WiSARD generates mental images representing what it has learned, highlighting essential features of the classified data. AIcon2abs was tested through a six-hour remote course with 34 Brazilian participants, including 5 children, 5 adolescents, and 24 adults. Data analysis was conducted from two perspectives: a mixed-method pre-experiment (including hypothesis testing), and a qualitative phenomenological analysis. Nearly all participants rated AIcon2abs positively, with the results demonstrating a high degree of satisfaction in achieving the intended outcomes. This research was approved by the CEP-HUCFF-UFRJ Research Ethics Committee.
[AI-122] AI from concrete to abstract: demystifying artificial intelligence to the general public
链接: https://arxiv.org/abs/2006.04013
作者: Rubens Lacerda Queiroz,Fábio Ferrentini Sampaio,Cabral Lima,Priscila Machado Vieira Lima
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages; 2 tables; 47 figures; review comment: Included references for the final published peer-reviewed version of this pre-print: this https URL and this https URL typos corrected
Abstract:Artificial Intelligence (AI) has been adopted in a wide range of domains. This shows the imperative need to develop means to endow common people with a minimum understanding of what AI means. Combining visual programming and WiSARD weightless artificial neural networks, this article presents a new methodology, AI from concrete to abstract (AIcon2abs), to enable general people (including children) to achieve this goal. The main strategy adopted by is to promote a demystification of artificial intelligence via practical activities related to the development of learning machines, as well as through the observation of their learning process. Thus, it is possible to provide subjects with skills that contributes to making them insightful actors in debates and decisions involving the adoption of artificial intelligence mechanisms. Currently, existing approaches to the teaching of basic AI concepts through programming treat machine intelligence as an external element/module. After being trained, that external module is coupled to the main application being developed by the learners. In the methodology herein presented, both training and classification tasks are blocks that compose the main program, just as the other programming constructs. As a beneficial side effect of AIcon2abs, the difference between a program capable of learning from data and a conventional computer program becomes more evident. In addition, the simplicity of the WiSARD weightless artificial neural network model enables easy visualization and understanding of training and classification tasks internal realization.
[AI-123] Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models CVPR2026
链接: https://arxiv.org/abs/2606.04123
作者: Eleanor Brosius,Yuji Takubo,Daniele Gammelli,Simone D’Amico,Marco Pavone
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 7 pages, 4 figures, Presented as a short paper at IEEE CVPR 2026, AI4Space Workshop
Abstract:Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.
[AI-124] Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition
链接: https://arxiv.org/abs/2606.04019
作者: Hao Li,Mingrui Zheng,Yasuyuki Tahara,Yuichi Sei
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies on sensor-language alignment have shown that two-stage frameworks can improve the semantic modeling ability of wearable-sensor human activity recognition (HAR), where SensorLLM-style methods first perform motion-to-language alignment and then fine-tune the model for downstream tasks. However, our experiments reveal a consistent failure mode when the Stage 2 backbone is compressed to a compact model such as TinyLlama: recognition of dynamic activities remains relatively strong, while the discrimination of low-motion static classes such as standing, sitting, and lying degrades substantially. To address this issue, we propose a gravity-aware hierarchical routing head as a lightweight post-alignment adaptation built on top of an already aligned model, rather than a new large-scale pretraining framework. The method uses the per-channel mean and std from the Chronos tokenizer state to extract statistical cues related to posture and gravity direction, and adaptively combines a static expert and a full expert through soft routing, together with a load-balancing loss for stable training. On the MHealth dataset, this design significantly improves macro-F1 with minimal parameter overhead, and the gains are concentrated mainly on static classes while preserving strong performance on dynamic activities. As a first arXiv disclosure, the current paper reports results on a single dataset only, with the goal of highlighting the core method and laying the groundwork for broader evaluation in future work.
[AI-125] he Variance Brain Foundation Models Forgot: Third-Order Statistics Predict Cognition Where Billion-Parameter Models Fail
链接: https://arxiv.org/abs/2606.04010
作者: Giovanni Marraffini,Gabriel Mahuas,Trinidad Borrell,Victoria Shevchenko,Demian Wassermann
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 37 pages, 16 figures, 23 tables
Abstract:Brain foundation models (BFMs) are self-supervised Transformers pretrained on fMRI data. We posit that these models should capture each subject’s cognitive performance from their fMRI signal. Yet across three state-of-the-art BFMs and every readout we test, they predict cognition worse than a linear regression from the \sim 80K parameters of the functional connectivity matrix (FC). The gap widens with scale: BrainLM’s 650M model predicts cognition worse than its 111M. We attribute this to a \textbfvariance allocation problem: BFM pretraining captures the variance components that dominate fMRI but not the higher-order structure that predicts cognition. Our per-cumulant analysis of the reconstructed signal shows that the second-order covariance is partially preserved, while the third-order co-skewness tensor is largely destroyed. To recover what BFMs lose, we design a linear pipeline that projects the fMRI signal into the subspace that best preserves its co-skewness and computes FC there. This \textbfexceeds raw FC and every pretrained BFM on every dataset and parcellation we test, outperforming prior state-of-the-art under controlled evaluation \textbfwith no pretraining and no GPU. We \textbfrecover the raw-FC ceiling on BrainLM’s forward pass by finetuning with a loss targeted at this same subspace. This shows that the bottleneck is the pretraining objective, not the architecture or the model size.
[AI-126] Counterfactual Explanations for Deep Two-Sample Testing
链接: https://arxiv.org/abs/2606.04009
作者: Wei-Cheng Lai,Marco Simnacher,Christoph Lippert
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages
Abstract:Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis H_0 . To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model’s representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.
[AI-127] Neural Radiated-Noise Fields for Unmanned Underwater Vehicle Noise Spectrum Prediction in Three-Dimensional Scenes
链接: https://arxiv.org/abs/2606.04008
作者: Yan Wu,Yang Yang,Jun Fan,Bin Wang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiated noise in unmanned underwater vehicles (UUVs) is an important indicator for characterizing acoustic signatures and evaluating platform performance. To address the strong dependence of traditional physics-based modeling and numerical simulation methods on target structural information and environmental boundary conditions, and their inability to achieve continuous spatial spectrum-response modeling in three-dimensional scenes, this paper proposes a neural radiated-noise field (NRNF). An NRNF represents the UUV radiated-noise spectrum as a continuous function of the three-dimensional UUV position, the three-dimensional hydrophone position, the UUV yaw angle, and the frequency, enabling query-based prediction at arbitrary spatial locations. The proposed method employs sinusoidal encoding for position and frequency, and introduces a learnable three-dimensional scene feature grid to explicitly represent environmental structure and propagation effects. A spectrum-prediction dataset is constructed from lake trials, and the proposed model is evaluated under three settings: horizontal extrapolation, depth extrapolation, and cross-run generalization. Results show that the NRNF achieves an average prediction error of 3.5 dB in the 50 to 5000 Hz band. Horizontal extrapolation is easiest, depth extrapolation is the most challenging, and cross-run generalization is of intermediate difficulty. Further ablation results demonstrate that the scene feature grid significantly improves the prediction stability and spatial generalization of the model.
[AI-128] Constraint-Enhanced Physical Search through Correlation Matching
链接: https://arxiv.org/abs/2606.03554
作者: Song-Ju Kim
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Computational Physics (physics.comp-ph)
备注: 13 pages, 4 figures
Abstract:Physical systems do not merely add noise to search processes; they impose constraints that generate structured correlations. We propose a principle of constraint-enhanced physical search in which temporal correlations in exploration are matched to constraint-induced spatial correlations in the update dynamics. Using a minimal tug-of-war bandit model (TOW), we show that a conservation law converts local observations into differential evidence across alternatives, while a temporally correlated drive controls the order of exploration. Search efficiency is improved not by stronger randomness or by maximal anti-correlation, but by matching the temporal correlation to the physical update scale that converts feedback into evidence. A scaling estimate identifies the update-noise-to-contrast ratio as the leading parameter that limits how strongly temporal anti-correlation can be used. The results suggest a general organizing principle for physical search: constraints and fluctuations can generate structured spatiotemporal correlations, and efficient exploration emerges when these correlations are matched to the update dynamics.
机器学习
[LG-0] BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning
链接: https://arxiv.org/abs/2606.05139
作者: Luca Thale-Bombien,Jan Ewald,Ralf König,Aaron Klein
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce \textbfBBOmix , the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.
[LG-1] Generating Financial Time Series by Matching Random Convolutional Features
链接: https://arxiv.org/abs/2606.05138
作者: Konrad J. Mueller,Nikita Zozoulenko,Ben Wood,Thomas Cass,Lukas Gonon
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:
Abstract:Generating realistic financial time series is challenging as training data is often limited to a single historical path. With such scarce data, overfitting is hard to avoid, especially under adversarial training where a trained discriminator can memorize the training samples. To mitigate this, recent approaches train generators to minimize the discrepancy between untrained feature representations of real and generated time series. In these works, the feature maps are based on path signatures, which can fail to capture relevant time series properties at tractable truncation depths. In this work, we instead train generators by matching random convolutional features of real and generated time series. Existing random convolutional feature maps, such as Rocket and Hydra, have been shown to provide informative representations of real-world time series, but cannot supervise generative models because they are non-differentiable. We introduce SOCK (SOft Competing Kernels), a fully differentiable random convolutional feature map, suited to train generative time series models. We show that generators trained by matching random SOCK features consistently outperform signature and diffusion baselines across a wide range of small-sample financial datasets. We further demonstrate SOCK’s expressiveness on two-sample hypothesis testing and time series classification tasks, where SOCK matches or outperforms existing unsupervised feature maps.
[LG-2] Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning
链接: https://arxiv.org/abs/2606.05131
作者: Kelan Gray,Finlay Brown,Nicolas Boullé,Matthew J. Colbrook
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Optimization and Control (math.OC); Spectral Theory (math.SP)
*备注: 26 pages, 11 figures
Abstract:Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flexible coordinates, whereas structure-preserving methods enforce operator identities on fixed dictionaries. We combine these ideas by introducing Deep Embedded Multiplicative Dynamic Mode Decomposition (DeepMDMD), a method that learns a latent space and a partition of it, while enforcing the Koopman product rule as an exact algebraic constraint. Training alternates between an exact multiplicative operator update and a differentiable latent-clustering step that promotes Koopman closure. The result is a finite transition map on learned latent cells. Its nonzero spectrum lies on the unit circle, its dictionary is shaped by the dynamics rather than by ambient geometry, and forecasts are made in latent coordinates before being decoded to physical space. Across Hamiltonian, chaotic, and fluid examples, DeepMDMD learns dictionaries that are far more compact and dynamically coherent than those produced by geometric MDMD partitions. It reduces spectral pollution, reveals richer continuous-spectrum structure, and gives stable forecasts under severe noise. In high-dimensional flows, including a 158,624-dimensional cylinder wake and a noisy Re=20,000 lid-driven cavity, it preserves coherent structures and long-time spectral statistics where state-space MDMD fails. These results suggest a practical rule for Koopman learning: learn the coordinates, constrain the algebra.
[LG-3] Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption
链接: https://arxiv.org/abs/2606.05129
作者: Jian Yang,Yuan Tong,Qinbin Li,Zeyi Wen,Xiaofang Zhou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Preserving data privacy is an important topic in structural data management and data mining. However, the issue of privacy leakage in distributed causal structure learning is a persistent challenge, especially in cases where data transmission and computation are required. In this paper, we propose a method based on fully homomorphic encryption (FHE) that performs calculations on ciphertexts, keeping data encrypted in transition and computation. Nevertheless, adopting FHE to causal structure learning is challenging due to the high computation cost and limited support on division as well as logarithm operations in FHE. To tackle this challenge, we propose a series of novel techniques including (i) circuit simplification for better efficiency, (ii) approximation of division and logarithm through Newton-Raphson Reciprocal and Taylor expansion, and (iii) a batching technique with SIMD-acceleration to enhance the whole learning process. Additionally, our method can be easily extended beyond FHE by demonstration of its portability to support differential privacy. Empirical results show that our method achieves high consistency and comparable causal structure with the plaintext version in the datasets tested. Last, our method is efficient and practical to complete learning causal structures in tens of minutes even under the privacy protection of FHE.
[LG-4] Graph Set Transformer
链接: https://arxiv.org/abs/2606.05116
作者: Jose E. Escrig Molina,Baoquan Chen,Daniel Probst
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure, conference
Abstract:We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded graph embeddings from a separate GNN, creating a bottleneck between feature extraction and set-level contextualisation. In contrast, GST interleaves node-level feature propagation and cross-graph contextual modelling at every layer, fusing the two levels of information through a gating mechanism. We evaluate GST on a controlled synthetic suite designed to isolate set-conditional structural reasoning and on three real-data benchmarks spanning per-atom reaction-centre identification, reaction yield prediction, and image classification. Under matched parameter budgets, GST performs better than the baselines across these settings. An architectural ablation strongly suggests that the interleaving of local and set context contributes substantially to this advantage.
[LG-5] RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities
链接: https://arxiv.org/abs/2606.05109
作者: Vasiliki Rizou,Pascal Frossard,Dorina Thanou
类目: Machine Learning (cs.LG)
*备注:
Abstract:To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play’ architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.
[LG-6] FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors ICML2026
链接: https://arxiv.org/abs/2606.05101
作者: Sepehr Dehdashtian,Jacob H Seidman,Vishnu N Boddeti,Gaurav Bharaj
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).
[LG-7] Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness
链接: https://arxiv.org/abs/2606.05073
作者: Lixing Zhang,Yidong Ouyang,Weifu Li,Shixiang Zhu,Guang Cheng,Liyan Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.
[LG-8] RIDE: An Open Dataset and Benchmark for Train Delay Prediction
链接: https://arxiv.org/abs/2606.05070
作者: Clément Elliker,Mathis Le Bail,Clément Mantoux,Jesse Read,Sonia Vanier
类目: Machine Learning (cs.LG)
*备注: 58 pages, 41 figures
Abstract:Train delay prediction is an important problem for both passengers and railway operators, yet progress in the field remains difficult to assess due to the lack of standardized datasets, prediction targets, and evaluation protocols. To address this gap, we introduce RIDE, an open dataset and benchmark for train delay prediction built at nationwide scale over the Belgian railway network. RIDE covers 94.5M train events, 3.6M journeys, and 35.7M weather records from 2023 to 2025. It is organized as a layered data pipeline from raw railway and weather sources to two public releases: a reusable intermediate relational dataset and model-ready benchmark datasets. The benchmark standardizes the prediction task and the training and testing data. It also provides a unified evaluation protocol that supports direct comparison across models. Using this framework, we provide the first comprehensive comparative evaluation of non-learning, statistical learning, and deep learning models. We show that learning-based methods clearly outperform non-learning models, with graph neural networks achieving the best mean performance, while the strongest learning-based models remain relatively close to one another. Beyond aggregate mean absolute error (MAE) and root mean squared error (RMSE), the framework also provides breakdowns by prediction horizon and delay change, enabling more detailed analysis of model behavior across forecasting regimes.
[LG-9] FLAGG: Flexible Autoregressive Graph Generation
链接: https://arxiv.org/abs/2606.05067
作者: Samuel Cognolato,Alessandro Sperduti,Luciano Serafini
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at JMLR, currently in press
Abstract:The Deep Graph Generation’s panorama spans two extremes: one-shot and sequential models. The former generates nodes and edges jointly, while the latter samples them autoregressively. Each method performs better in different graph domains depending on size and topology, but neither is applicable to all graph categories. For instance, one-shot methods struggle with generating large graphs, while sequential methods underperform on smaller graphs. A possible way to overcome these limitations is to flexibly combine the two methods in a unique system. In this work, we propose the FLAGG (Flexible Autoregressive Graph Generation) framework, which sequentially generates portions of graphs with one-shot models. FLAGG can apply any one-shot model to make it autoregressive, allowing flexibility in choosing the sequential policy. This policy is specified through a stochastic node removal process, which an Insertion Model learns to reverse. We evaluate FLAGG with the DiGress one-shot model on several data sets of different graph sizes and domains. We show that the approach outperforms both one-shot and autoregressive baselines in terms of sampling quality.
[LG-10] Graph Cascades: Contagion-Based Mesoscopic Rewiring for Structure-Aware Graph Machine Learning
链接: https://arxiv.org/abs/2606.05046
作者: Meher Chaitanya,My Le,Luana Ruiz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Graph Cascades, a mesoscopic rewiring strategy for Graph Neural Networks (GNNs) and Graph Transformers (GTs) that captures intermediate-scale graph structure beyond purely local edges or fully global attention. Using contagion-based diffusion processes, Graph Cascades constructs, in O(|V|+|E|) time, an auxiliary graph where node pairs supported by repeated multi-hop reinforcement are promoted to direct neighbors. We theoretically characterize when reinforcement-based rewiring helps: sufficient conditions under which reinforcement-based edge selection is more label-aligned than direct adjacency, an SBM witness in which two-hop reinforcement is perfectly homophilic, and a formalization of mesoscopic connectivity via graph effective resistance. Empirically, across node-classification benchmarks, Graph Cascades improves multiple GNN and sparse-GT backbones, with the most reliable gains observed on heterophilic and moderate- to high-degree homophilic graphs. The theoretical conditions also identify regimes where mesoscopic rewiring is unlikely to be beneficial – low-degree regular graphs and graphs with structural bottlenecks – and these predictions match the observed failures. We additionally observe tight correlations between performance and structural properties in the rewired graphs.
[LG-11] Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling
链接: https://arxiv.org/abs/2606.05021
作者: Marc Walden,Jason Liu,Shaashwath Sivakumar,Ryan Liu,Hamza Khan
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents’ intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non-stationarity inherent in multi-agent environments. We evaluate both modifications on the discrete-action Predator-Prey task provided by the PettingZoo library, a flexible Python interface for general multi-agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter-agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at this https URL
[LG-12] New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models
链接: https://arxiv.org/abs/2606.04994
作者: Yiming Liao,Yiheng Li,Ning Jiang,Bo Li,Keke Chen
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 6 pages, 1 figure. Preprint version
Abstract:Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.
[LG-13] AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization
链接: https://arxiv.org/abs/2606.04980
作者: Wanqi Yang,Yuexiao Ma,Alexander Conzelmann,Xiawu Zheng,Michael W. Mahoney,T. Konstantin Rusch,Shiwei Liu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 11 figures
Abstract:Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4 \times memory compression. Our code is available at this https URL.
[LG-14] Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?
链接: https://arxiv.org/abs/2606.04971
作者: Anna Richter,Julia Stoyanovich,Sebastian Schelter
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification, focusing on fairness across skin tones as a responsibility constraint. When evaluating two recent MLE agents, we find that agent-generated pipelines show high variance and consistently underperform manually designed baselines in both predictive quality and fairness, despite fairness-oriented prompts. These preliminary results suggest that further research is needed towards redesigning MLE agents to allow humans to guide the search process and reliably assess the compliance and quality of the generated ML pipelines.
[LG-15] A General Framework for Dynamic Consistent Submodular Maximization ICML2026
链接: https://arxiv.org/abs/2606.04946
作者: Paul Dütting,Federico Fusco,Silvio Lattanzi,Ashkan Norouzi-Fard,Ola Svensson,Morteza Zadimoghaddam
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026
Abstract:Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of n insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a \frac 12 - O(\varepsilon) approximation that is O\left(\frac1\varepsilon^2\right) consistent. For rank- k matroid constraints, we construct a \frac 14 - O(\varepsilon) approximation to the dynamic optimum that is O\left(\frac\log k\varepsilon^2\right) consistent. Comments: Accepted at ICML 2026 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.04946 [cs.DS] (or arXiv:2606.04946v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.04946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models
链接: https://arxiv.org/abs/2606.04945
作者: Xin Yan,Aqiang Wang,Zhenglin Wan,Xingrui Yuand Ivor Tsang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.
[LG-17] Mean-based algorithms: A lower bound and regret
链接: https://arxiv.org/abs/2606.04931
作者: Julius Durmann,Amelie Kleber
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Mean-based algorithms are a class of online learning algorithms that assign low probability to actions with low average rewards. Recent work indicates these algorithms converge favorably to serially undominated actions, which approximate Nash equilibria in economic games. However, empirical studies also show slower convergence compared to established algorithms in bandit-feedback scenarios. We study mean-based algorithms when the time horizon is unknown and only bandit feedback is available. In this setting, we provide the first lower bound on the algorithm-defining sequence \gamma_t that formally establishes a limit on how fast these algorithms can learn. Additionally, we propose two mean-based algorithms: one generalizes \epsilon -greedy, and the other extends the mean-based Exp3 to unknown horizons. Our experiments show that mean-based algorithms, although slightly slower, can perform competitively with other bandit-feedback algorithms. We further analyze the relationship to no-regret algorithms. Depending on the choice of \gamma_t , the intersection with no-regret algorithms is non-trivial, and we show that algorithms exist that are both mean-based and no-regret. This adds context to the “exploitability” of this class of algorithms that previous contributions suggest. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2606.04931 [cs.LG] (or arXiv:2606.04931v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] Sequential Data Poisoning in LLM Post-Training
链接: https://arxiv.org/abs/2606.04929
作者: Jack Sanderson,Yihan Wang,Xiaoqian Lu,Gautam Kamath,Yiwei Lu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT \to DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT \to PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at this https URL.
[LG-19] Worker Utility as Hysteresis: A Preisach Model of Transaction Acceptance in Gig Labour Markets
链接: https://arxiv.org/abs/2606.04916
作者: Piotr Frydrych
类目: Machine Learning (cs.LG); General Economics (econ.GN); Machine Learning (stat.ML)
*备注: 18 pages, 5 figures
Abstract:Worker utility is not observed – only its consequence is. Each gig transaction produces a single bit: accepted or rejected. We argue this structure points directly to the Preisach hysteresis model as the natural representation of latent worker preferences. The Preisach operator models aggregate output as an integral over a population of binary threshold elements – precisely the structure that emerges when heterogeneous workers each carry a private acceptance wage. We estimate two latent utility surfaces: acceptance utility U_1(X) and rejection utility U_0(X), via a dual-output neural network (shared layers 256-128, margin loss enforcing U_1 = U_0). Classification reduces to the Preisach gap U_1(X) - U_0(X), passed into an XGBoost classifier alongside clip-stabilised price-to-threshold encodings. On 36,891 gig transactions, this pipeline achieves Jaccard = 0.827 and ROC AUC = 0.799. The price-to-threshold encoding accounts for +11.0 pp AUC over raw utility features. The model confirms the directional asymmetry hysteresis predicts: price decreases depress completion rates more than equivalent increases raise them. Applied to the full dataset, the model’s recommendations simultaneously reduce the total wage bill by 21.3% and increase expected fill rate by 9.7 pp. For 74.2% of transactions, P(accept) already exceeds 0.80; reducing the wage keeps it above threshold (mean post-cut P = 0.972), releasing cost savings (median 31%). For the remaining 25.4%, a median 7% wage increase recovers +43 pp acceptance. A model without an explicit indifference zone cannot execute both moves simultaneously.
[LG-20] owards Pretraining Text Encoders for TabPFN
链接: https://arxiv.org/abs/2606.04876
作者: Mustafa Tajjar,Alexander Pfefferle,Lennart Purucker,Frank Hutter
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular foundation models, such as TabPFN, achieve strong performance on tabular datasets with numerical and categorical data, but do not natively handle high-cardinality text features. Standard pipelines, therefore, embed text with a language model and compress the resulting vectors with PCA into a small number of scalar features before inputting them into TabPFN. This creates an information bottleneck: most embedding dimensions are discarded, and the compressed representation must then be expanded again by TabPFN’s feature encoder. End-to-end alternatives can avoid PCA, but they require large amounts of pretraining data containing text cells and usually perform subpar compared to tabular foundation models that were pretrained on large amounts of synthetic data. Inspired by modality-alignment approaches like LLaVA (vision-to-LLM token projection) and TableGPT-style systems (table-to-LLM token projection), we introduce the TabPFN Text Adapter (text-to-TFM token projection). We freeze both the sentence encoder and TabPFN, and train only a lightweight adapter that maps text embeddings into a short sequence of tokens in TabPFN’s embedding space. This design removes the PCA bottleneck, preserves TabPFN’s numerical strengths, and is more efficient to train than end-to-end text-tabular pipelines.
[LG-21] Provably Reduced Sample Cost in Prior-Guided Hyperparameter Optimization
链接: https://arxiv.org/abs/2606.04866
作者: Leona Hennig,Jasmin Brandt,Lukas Fehring,Barbara Hammer,Marius Lindauer,Marcel Wever
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale hyperparameter optimization (HPO) in automated machine learning (AutoML) consumes substantial computational resources, raising growing concerns about scalability and energy efficiency. Existing methods use prior information heuristically to accelerate both black-box and multi-fidelity settings, but they lack a characterization of how prior informativeness quantitatively reduces sample complexity. In this work, we provide the first distribution-dependent sample complexity bounds for multi-fidelity HPO with priors through the formal lens of fixed-budget best-arm identification. By modeling priors directly over arm means as configuration performance, we derive explicit, distribution-dependent error bounds that quantify the relationship between priors and evaluation budget. Our analysis shows that informative priors, which concentrate probability mass on near-optimal arms, yield reductions in the number of required evaluations, whereas baseline performance is recovered with uninformative or misleading priors. We conduct proof-of-concept experiments on a synthetic benchmark and on LCBench, a common multi-fidelity HPO benchmark for deep learning, to confirm our theoretical results, achieving up to 90% budget reduction while retaining solution quality. Together, our results provide a principled foundation for prior-guided and compute-efficient green AutoML.
[LG-22] Rethinking Incompleteness: Formalizing Protocol Divergence and Train-Once Learning for Robust IMVC
链接: https://arxiv.org/abs/2606.04857
作者: Haolu Liu,Xiyue Wang,Xuanting Xie,Liangjian Wen,Zhao Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Standard IMVC evaluation retrains separate models for different missing-data configurations. We show that this paradigm obscures a fundamental vulnerability: missing rate alone is insufficient to characterize data incompleteness. Specifically, we show that protocols with identical nominal missing rates can differ by up to 50\times in their proportion of fully observed samples, inducing drastically different learning regimes. We formalize this phenomenon as incompleteness divergence, providing measures that capture structural disparities across missing-data protocols. We further prove that for a broad class of reconstruction-based objectives, learning becomes structurally ill-posed when the proportion of complete samples falls below a critical threshold, leading to near-random performance. To bypass this theoretical bound, we propose CRAFT (Complete-data Robust Attention-masked Fusion Transformer). CRAFT shifts the burden of robustness from the loss function to the architecture via two key properties: (i) per-sample independence, which removes reliance on complete-sample co-occurrence, and (ii) mask-aware variable-length fusion, which aggregates only observed views through attention masking. This design allows a single model, trained once on complete data, to generalize to diverse missing patterns at inference time without retraining. Extensive experiments on seven benchmarks show that CRAFT matches or outperforms per-configuration baselines while reducing training overhead by 8.8\times , demonstrating that robustness to missing data can be achieved as an inherent architectural property. Code (CRAFT) and our imvc-audit toolkit are available at this https URL and this https URL.
[LG-23] Prediction Under Imperfect Compression: A Theory of Approximate MDL
链接: https://arxiv.org/abs/2606.04834
作者: Qian Li,Xinyu Mao,Shang-Hua Teng,Guangxu Yang
类目: Machine Learning (cs.LG)
*备注: 26 pages
Abstract:Minimum Description Length (MDL) formalizes the principle of Occam’s razor by optimizing the total description length: L(\mathrmmodel)+L(\mathrmdata \ | \ \mathrmmodel) . For sequential prediction, the MDL method repeatedly selects a model with a minimum objective score of the observed prefix for the next step prediction. Classical MDL prediction theory shows that exact optimization of the MDL objective indeed provides a strong compression guarantee that supports reliable prediction. However, practical machine learning usually can only find models by approximately optimizing the objective function. To bridge this gap, this paper addresses the following fundamental question: Under what forms of approximation and regularization does approximate MDL still guarantee reliable sequential prediction? This work offers a principled characterization. We prove that for any approximation with additive slack C of the more general form of the balanced MDL objective: \lambda\cdot L(\mathrmmodel)+L(\mathrmdata \ | \ \mathrmmodel) , the cumulative expected squared prediction error is finite for all \lambda\ge1 . The case \lambda1 is proved by an affinity-telescoping argument, while the boundary case \lambda=1 is proved by a likelihood-ratio stopping argument based on exact static MDL bounds. Our results establish that classical MDL regularization remains robust to any fixed additive optimization error. Furthermore, we establish that our characterization of the approximate MDL framework is sharp: When 0\lambda1 , overfits can happen to incur infinite cumulative expected error in the universal class of estimable measures, and hence a strong form of model-complexity regularization is necessary. In addition, model selection may fail in every regularized regime \lambda 0 , under multiplicative approximation, and thus, additive approximation is both sufficient and essential.
[LG-24] Reconciling Causality and Non-Equilibrium Thermodynamics with Hamiltonian Causal Models
链接: https://arxiv.org/abs/2606.04822
作者: Dario Rancati,Max Welling,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal modeling of physical temporal phenomena must handle interventions that act along trajectories, nonstationary induced laws, path-dependent effects, and feedback mediated by dynamics, all challenging in standard causal models. We introduce Hamiltonian Causal Models (HCMs), a trajectory-level framework in which observed variables interact with local environments and interventions act as controls of Hamiltonian mechanisms. HCMs separate immutable equations of motion from intervenable mechanisms and define causal effects as discrepancies between interventional path laws. A key motivation for HCMs is their natural interface with non-equilibrium thermodynamics. Entropy production quantifies the irreversibility of a process and is a central causal observable: it is estimable from data and witnesses causal effects along the system’s evolution that are invisible to endpoint and cumulative versions of the standard average treatment effect. As in physics, cause and effect are not primitives of the relation between two random variables but arise from the non-invertibility of the thermodynamic arrow. With this, our paper reconciles the language of statistical causal models and non-stationary thermodynamics, offering new tools to describe causality in a wide range of physical systems.
[LG-25] he Right Measure for Physics-Constrained Generation: A Co-Area Correction for Posterior-Consistent PDE Inverse Problems
链接: https://arxiv.org/abs/2606.04804
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models – diffusion and flow matching – are increasingly used to solve partial differential equation (PDE) inverse problems, enforcing the governing physics as a \emphhard constraint (via projection or guidance) and reporting the resulting samples as a Bayesian posterior with calibrated uncertainty. We show that this widely adopted recipe samples the wrong distribution. Conditioning a generative prior on a hard PDE constraint is conditioning on a measure-zero manifold – an operation that is intrinsically ambiguous (the Borel–Kolmogorov paradox) and whose physically correct resolution, the small-residual-noise limit, carries a co-area (Fixman) Jacobian factor [det(JJ^\top)]^-1/2 that projection- and guidance-based methods silently omit. We make the bias precise, show that it grows with the heterogeneity of the constraint sensitivity, and validate it on controlled problems against an \emphi.i.d. ground-truth arbiter. The omitted factor is not a second-order detail: removing it inflates the posterior error to 20\times the sampling-noise floor; minimal-displacement projection (as in PCFM) is biased at 9\times the floor; and a naive scalar reweighting does not fix it. We introduce \textbfCoCoS, a measure-aware constrained sampler that targets the correct co-area posterior, and show that it matches the gold-standard posterior to within sampling noise. Our results imply that satisfying the physics'' is not the same as sampling the posterior,‘’ and give a principled correction for uncertainty-aware scientific inference.
[LG-26] Uncertainty-Aware (Un)Supervised Few-Shot User Adaptation for On-Device Personalized Human Activity Recognition
链接: https://arxiv.org/abs/2606.04798
作者: Maximilian Burzer,Till Riedel,Michael Beigl,Tobias Röddiger
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 2 tables, 2 algorithms
Abstract:Sensor-based Human Activity Recognition (HAR) models often degrade on unseen users due to domain shifts caused by individual movement patterns and sensor placement. Practical wearable HAR systems therefore require personalization methods that are lightweight, applicable whether calibration data is labeled, unlabeled, or unavailable, and robust under limited calibration. We present a gradient-free framework that repurposes pretrained HAR classifiers as Prototypical Networks using using prior prototypes, which preserve zero-shot performance and regularize adaptation. For labeled calibration, we introduce closed-form Bayesian prototype estimation and extend the same principle to unlabeled calibration. With only 3 seconds of calibration data per activity (one shot), supervised adaptation improves macro-F1 by +2.76 to +33.44 percentage points across four datasets, while unsupervised adaptation improves by +0.56 to +32.13 points. Since adaptation requires only closed-form prototype updates, the framework enables efficient and robust on-device personalization of preexisting HAR classifiers.
[LG-27] UniFair: A unified fair clustering approach based on separation and compactness
链接: https://arxiv.org/abs/2606.04777
作者: Antonia Karra,Vasiliki Papanikou,Georgios Vardakas,Evaggelia Pitoura,Aristidis Likas
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 Figures
Abstract:Clustering is increasingly used to support high-impact decisions, yet standard objectives such as k -means can produce clusterings that treat demographic groups unequally. Existing fair clustering methods typically optimize a single notion of fairness and often overlook how clustering costs interact with the geometry of the induced decision boundaries. We propose \textscUniFair, a unified framework that jointly optimizes \emphseparation fairness and \emphsocial fairness. Separation fairness encourages protected groups to lie farther from the induced decision boundaries, while social fairness reduces disparities in within-cluster distortion by penalizing group-wise clustering costs. We develop gradient-based optimization procedures for separation-fair and unified k -means objectives, and extend them to deep clustering by enforcing the same criteria in the latent space of an autoencoder. Experiments on tabular and image datasets show that \textscUniFair reduces both boundary-related and cost-based group disparities with only a modest increase in clustering loss.
[LG-28] Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability ICML2026
链接: https://arxiv.org/abs/2606.04754
作者: Vincent Bürgin,Daniel Herbst,Ya-Wei Eileen Lin,Stefanie Jegelka
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Many striking phenomena in deep learning, such as linear mode connectivity and the structured behavior of training dynamics, are closely tied to parameter symmetries: transformations that leave the realized function unchanged. Despite growing attention to parameter symmetries, the exact interplay between parameters, data, and representations remains underexplored. To investigate this, we develop a theoretical framework of effective function classes, i.e., the set of functions a neuron can realize on its input support, and the norm cost of realizing them. We then formalize effective symmetry breaking via neuron identifiability across independent training runs. Our analysis shows that neural networks can admit large families of approximately equivalent solutions even in structurally asymmetric models. We further show that neuron identifiability enables representation merging without prior alignment, and characterize when such merging admits a linear low-loss path. These findings highlight the role of effective function classes in affecting the loss landscape.
[LG-29] COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection
链接: https://arxiv.org/abs/2606.04749
作者: Guopeng Li,Moritz A. Zanger,Matthijs T. J. Spaan,Julian F. P. Kooij
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, 2 tables
Abstract:Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.
[LG-30] Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data
链接: https://arxiv.org/abs/2606.04733
作者: Jannik Presberger,Alexander Männel,Maynard Koch,Thomas C. Schmidt,Matthias Wählisch,Bjoern Andres
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Code: this https URL
Abstract:Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.
[LG-31] Cone-Compatible Monge Geometry for High-Dimensional Ordered Optimal Transport
链接: https://arxiv.org/abs/2606.04695
作者: Lei Luo,Hongliang Zhang,Jian Yang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 2 figures, including appendices
Abstract:High-dimensional optimal transport is seldom available in closed form. The one-dimensional case is exceptional because the order of the real line is compatible with convex transport costs, making monotone rearrangement optimal. This paper studies when an analogous Monge structure can be recovered in higher dimensions from a partial order. We introduce a cone-compatible Monge geometry: a closed convex cone (K) induces the order (x\preceq_K y) whenever (y-x\in K), and is compatible with a cost if ordered pairs satisfy a Monge exchange inequality. For squared Mahalanobis costs (c_M(x,y)=(x-y)^\top M(x-y)), we prove a sharp characterization: compatibility holds exactly when (K) is acute under the (M)-inner product, namely (u^\top Mv\ge0) for all (u,v\in K), equivalently (K\subseteq K_M^*). Under this condition, measures supported on cone chains admit a quantile-type closed-form optimal coupling, yielding exact transport under the original ground cost rather than after projection or metric replacement. We distinguish the resulting cone-chain Wasserstein metric on canonically ordered chain distributions from an extended directed cone transport cost on general measures, and develop feasibility, duality, stability, approximation, Gaussian recovery, statistical, and computational results. The theory is complementary to sliced and tree Wasserstein distances: it is not a universal fast surrogate, but a way to obtain interpretable, direction-valid, original-space monotone transport for ordered high-dimensional data.
[LG-32] st-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers
链接: https://arxiv.org/abs/2606.04678
作者: Yacouba Kaloga,Shashi Kumar,Shakeel A. Sheikh,Driss Khalil,Petr Motlicek,Ina Kodrasi
类目: Machine Learning (cs.LG)
*备注:
Abstract:End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse a shared Transformer block recurrently, but we find that naive looping does not fully exploit additional recurrent compute. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback. These components structure the loop into recognition checkpoints separated by latent refinement phases and allow shared weights to specialize across recurrent steps. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines. Our results show that test-time compute scaling can extend beyond autoregressive language-model reasoning to continuous non-autoregressive speech recognition.
[LG-33] Fitting scattered data with optional monotonicity constraints on GPU: LipFit package
链接: https://arxiv.org/abs/2606.04670
作者: Gleb Beliakov
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:
Abstract:This paper presents a method of multivariate scattered data interpolation and approximation that produces optimal Lipschitz-continuous approximation, subject to the desired monotonicity constraints. This method relies on tight upper and lower approximations to the data, and is similar in its spirit to the nearest-neighbour approximation but does not suffer from discontinuities. Local Lipschitz interpolation and Lipschitz smoothing are also presented. This approach falls under the umbrella of instance-based approximation with no training phase, and it is suitable for GPU-based parallelisation. A Python GPU-friendly package LipFit which implements the methods discussed is discussed.
[LG-34] owards Accurate Model Selection in Deep Unsupervised Domain Adaptation
链接: https://arxiv.org/abs/2606.04665
作者: Kaichao You,Ximei Wang,Mingsheng Long,Michael I. Jordan
类目: Machine Learning (cs.LG)
*备注: upload to arxiv for record
Abstract:Deep unsupervised domain adaptation (Deep UDA) methods successfully leverage rich labeled data in a source domain to boost the performance on related but unlabeled data in a target domain. However, algorithm comparison is cumbersome in Deep UDA due to the absence of accurate and standardized model selection method, posing an obstacle to further advances in the field. Existing model selection methods for Deep UDA are either highly biased, restricted, unstable, or even controversial (requiring labeled target data). To this end, we propose \textitDeep Embedded Validation (\textbfDEV), which embeds adapted feature representation into the validation procedure to obtain unbiased estimation of the target risk with bounded variance. The variance is further reduced by the technique of control variate. The efficacy of the method has been justified both theoretically and empirically.
[LG-35] U-Net-Accelerated Quality-Diversity Optimization for Climate-Adaptive Urban Layouts
链接: https://arxiv.org/abs/2606.04658
作者: Alexander Hagg,Tania Guerrero,Dirk Reith
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Optimizing urban layouts for climate adaptation requires balancing building density with cold-air ventilation. Because physics-based climate simulations are computationally expensive, planners typically evaluate fewer than ten manual designs. \glsqd algorithms offer a way to systematically illuminate the design space, but they require surrogate models to be practical. In this paper, we replace a slow, regulatory physics simulator with a spatial deep-learning surrogate (U-Net) inside an offline MAP-Elites loop. We systematically compare this spatial approach with a traditional \glsgp surrogate across different training-data strategies (quasi-random Sobol sampling vs.\ active \glsqd bootstrapping). Our results reveal that scalar \glsgp surrogates fail catastrophically when trained on random samples, requiring expensive, actively generated \glsqd archives to generalize. In contrast, the spatial inductive bias of the U-Net allows it to learn the underlying physics mapping robustly ( R^2 = 0.996 ), completely independent of the training data source. This allows offline \glsqd optimization to achieve highly accurate fitness rankings ( \rho = 0.994 ) using only a one-time batch of random training samples. The resulting pipeline, deployed in the open-source OpenSKIZZE tool, generates thousands of diverse, climate-evaluated building layouts in under ten minutes. Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG) Cite as: arXiv:2606.04658 [cs.NE] (or arXiv:2606.04658v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2606.04658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] ALINC: Active Learning for Inductive Node Classification via Graph Sampling KDD2026 ECML
链接: https://arxiv.org/abs/2606.04647
作者: Pascal Plettenberg,Denis Huseljic,André Alcalde,Bernhard Sick,Josephine M. Thomas
类目: Machine Learning (cs.LG)
*备注: Accepted at ECML PKDD 2026
Abstract:Active learning (AL) for node classification typically focuses on selecting the most informative nodes for annotation within one or a few large graphs (e.g., in social network analysis). However, in other domains, such as molecular chemistry or electronic design automation, datasets consist of thousands of independent graphs. In many of these inductive settings, annotating an individual node requires a full-graph analysis, which effectively yields the remaining node labels on-the-fly. Therefore, these scenarios require AL strategies that select entire graphs instead of single nodes, a problem which has not been tackled in the literature so far. Thus, we introduce ALINC, an AL framework for inductive node classification via graph sampling. It bridges the existing methodological gap by elevating node-level utility measures to graph-level selection criteria through various aggregation mechanisms. In an extensive benchmark including ten strategies, three aggregation methods, and four datasets, we identify CoreSet, TypiClust, and BADGE as the top-performing graph sampling strategies. Our detailed analysis further reveals that the choice of the aggregation method is pivotal, as it substantially affects model performance and annotation costs. Finally, we demonstrate the effectiveness of ALINC in two use case studies: site-of-metabolism prediction in molecules and design automation of printed circuit board schematics.
[LG-37] Explainably Safe Reinforcement Learning
链接: https://arxiv.org/abs/2606.04634
作者: Sabine Rieder,Stefan Pranger,Debraj Chakraborty,Jan Křetínský,Bettina Könighofer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque. Shielding is a prominent model-based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield’s decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Our method facilitates explainability of the safety aspect in safe-by-shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.
[LG-38] Learning symplectic model reduction based on a approximation theorem of symplectic embeddings
链接: https://arxiv.org/abs/2606.04623
作者: Liyi Feng,Yifa Tang,Yulin Xie,Ruili Zhang,Aiqing Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:High-dimensional Hamiltonian systems play a central role in many scientific and engineering disciplines, with dynamics evolving on symplectic manifolds. Although deep learning provides powerful tools for constructing low-dimensional surrogates from data, the intrinsic symplectic structure is easily destroyed during model reduction. As a result, a standard autoencoder may produce latent coordinates that do not support a Hamiltonian flow, leading to unstable long-time prediction. In this paper, we first establish a universal approximation theorem for symplectic embeddings. Based on this theory, we propose symplecticity-preserving autoencoders (SpAE), in which the decoder is parameterized as a symplectic embedding and the encoder is constructed as the corresponding symplectic projection. This architecture is expressive enough to approximate nonlinear symplectic embeddings and the associated symplectic projections, preserves the symplectic structure exactly by construction, and can be trained by standard unconstrained optimization, thereby improving both reconstruction and prediction accuracy. Extensive experiments on high-dimensional lattice and particle systems demonstrate the effectiveness of the proposed method.
[LG-39] HalfNet: Randomized Neural Networks with Learned Subspace Geometry
链接: https://arxiv.org/abs/2606.04583
作者: Ethem Alpaydin
类目: Machine Learning (cs.LG)
*备注: 6 pages (+2 pages of appendix), 6 figures
Abstract:Many researchers investigated neural networks with some of their weights fixed to values randomly drawn from a given distribution, e.g., N(0, I) . Our proposed HalfNet draws random weights from N(0, \Sigma) , where \Sigma , which defines the geometry of the distribution, has a low-rank factorization that we learn from data. Experiments on MNIST and CIFAR-10 demonstrate that HalfNet can match the performance of fully trained multilayer perceptrons while using substantially fewer parameters. Spectral analysis indicates that much of the predictive power of neural networks lies in the geometry of their weight space rather than in the precise values of individual parameters, and we observe that accuracy scales smoothly with rank. HalfNet is not a neural architecture trick for low-rank structure; it implements a data-dependent random embedding that can also be interpreted through supervised metric learning, or random-feature and kernel perspectives.
[LG-40] Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.04574
作者: Damian Lebiedź,Robert Ślepaczuk
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR); Machine Learning (stat.ML)
*备注: 61 pages, 37 figures, 16 tables
Abstract:This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical “Filter-then-Rank” pair selection methodology and a proprietary “Fixed Risk, Adaptive Mean” execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent’s risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.
[LG-41] SurvPFN: Towards Foundation Models for Survival Predictions ICML
链接: https://arxiv.org/abs/2606.04564
作者: Samuel Böhm(1),Lennart Purucker(2,3),Frank Hutter(2,3),Pascal Schlosser(1,4,5) ((1) Institute of Epidemiology and Prevention, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany, (2) Department of Computer Science, University of Freiburg, Freiburg, Germany, (3) PriorLabs, Freiburg, Germany, (4) Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, US, (5) CIBSS - Centre for Integrative Biological Signalling Studies, University of Freiburg, Freiburg, Germany)
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure. Accepted to “Foundation Models for Structured Data” Workshop at the International Conference on Machine Learning (ICML) 2026
Abstract:Tabular foundation models (TFMs) have made rapid progress in standard classification and regression, but time-to-event survival prediction tasks have remained largely untouched. Unlike in standard regression tasks, survival prediction models must account for censored data. Standard TFMs cannot handle natively censored data, leading to biased and inaccurate predictions, making them unsuitable for real-world applications. To overcome this fundamental limitation, we propose \textttSurvPFN, a prior-data fitted network (PFN), for survival prediction tasks. We pretrain \textttSurvPFN on millions of synthetic survival prediction tasks to learn survival via distributional regression that accounts for censored data. \textttSurvPFN works by (1) generating data with Weibull event times and a non-informative censoring mechanism; (2) integrating a censored event indicator; and (3) minimizing a censored negative log-likelihood. On SurvSet, a collection of real-world survival tasks, \textttSurvPFN is highly competitive with classical and deep survival baselines without per-dataset fitting, a survival-specific architecture, or feature engineering. We show that survival can be treated as a continuous-time distributional regression problem with censored loss, unlocking the power of PFNs for time-to-event predictions.
[LG-42] Modeling and Interpreting Teamwork Dynamics in Cancer Care Outcome Prediction
链接: https://arxiv.org/abs/2606.04499
作者: Yuhua Huang,Hsiao-Ying Lu,Kwan-Liu Ma
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Cancer care requires a longitudinal approach in which treatments are planned and delivered over time according to the needs of each individual patient. While prior research has thoroughly explored how clinical and demographic factors, such as comorbidities and age, inform treatment planning, far less attention has been devoted to the delivery phase of care. Yet planning and delivery are both team-based processes that depend on coordinated efforts among multiple healthcare professionals (HCPs). As such, the human factors embedded in these collaborative practices are crucial to optimizing patient outcomes. Despite this importance, the existing literature on human factors in cancer care is limited, and very few studies have investigated how collaboration within care teams evolves over the course of treatment. To fill this gap, this work examine how HCPs’ collaboration, captured through electronic health record (EHR) systems, affects cancer patient outcomes, with particular emphasis on teamwork dynamics. We represent EHR-mediated HCP interactions as networks and apply machine learning methods to identify predictive signals of patient survival embedded in these collaborative structures. We further interpret model predictions by pinpointing network characteristics and dynamic patterns associated with particular outcomes. We evaluate our model through robustness analyses to ensure that the findings are stable and not driven by stochastic variation in training. Additionally, our insights align with hypotheses proposed in the medical literature, and our results provide the empirical, data-driven evidence supporting these claims. Overall, our work contributes a practical workflow for leveraging digital traces of collaboration to evaluate and strengthen longitudinal team-based healthcare, offering actionable insights to guide data-informed interventions in healthcare delivery.
[LG-43] Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2606.04492
作者: Zicheng Zhao,Yu Lan,Chengzhengxu Li,Zhaohan Zhang,Xiaoming Liu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Under Review
Abstract:Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly constructs and selectively leverages historical experiences. EMTC introduces two synergistic components: (1) a Temporally Consistent Semantic Embedder that integrates contrastive learning with time-conditioned state reconstruction, preventing representation collapse and enabling precise memory retrieval; and (2) a Temporal Consistency Gating Mechanism that dynamically modulates episodic incentives based on temporal consistency error. This adaptive gate filters misleading signals from pseudo-successful trajectories, effectively mitigating Q-value overestimation. We provide theoretical guarantees, establishing a strict error bound that directly links the observable temporal consistency error to the underlying trajectory optimality and representation quality. Extensive evaluations on the SMAC and GRF benchmarks demonstrate that EMTC consistently outperforms state-of-the-art baselines. Notably, compared to the strongest episodic baseline, EMTC achieves absolute win-rate improvements of up to 24% in super-hard SMAC scenarios and an average improvement of 28% across GRF tasks.
[LG-44] LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models ICML2026
链接: https://arxiv.org/abs/2606.04485
作者: Yuanrui Wang,Xingxuan Zhang,Han Yu,Mingchao Ming,Gang Ren,Hao Yuan,Li Mao,Yunjia Zhang,Chun Yuan,Peng Cui
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified \emphtokenize-and-route framework for strong TFMs: \textbfRaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block \textbfS \rightarrow N \rightarrow F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield \textbfLimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy–efficiency trade-off in TFMs. Model checkpoints and inference code are available at this https URL.
[LG-45] When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks COLT
链接: https://arxiv.org/abs/2606.04476
作者: Berk Tinaz,Changzhi Xie,Mahdi Soltanolkotabi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 47 pages, 8 figures, published at the 39th Annual Conference on Learning Theory (COLT), 2026
Abstract:In this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.
[LG-46] On Out-of-sample Embedding in UMAP
链接: https://arxiv.org/abs/2606.04451
作者: Mohammad Tariqul Islam,Jason W. Fleischer
类目: Machine Learning (cs.LG)
*备注: 22 pages, 16 figures
Abstract:Neighbor embedding algorithms reveal correlations in high-dimensional data by constructing an equivalent graph representation in a lower-dimensional space. An increasingly popular algorithm is Uniform Manifold Learning and Projection (UMAP), which uses algebraic topology to map distances between the two spaces. While it works well on many types of data sets, UMAP has trouble adding out-of-sample points to a pre-existing mapping. In particular, UMAP often places new points on the periphery of the found clusters, rather than in their interiors with their correlated neighbors. Here, we overcome this ``repulsion effect’’ by optimizing pairwise interactions within the original k-nearest-neighbor graph. Moreover, we show that parameterizing UMAP obtains better embeddings than non-parametric algorithms, particularly as the data gets more complex (e.g., medical images). We also show that the repulsion effect is naturally mitigated when a parameterized UMAP is employed to embed the data. We characterize different UMAP approaches using trustworthiness, nearest neighbor classifiers, and by analyzing attractive and repulsive forces in the embeddings.
[LG-47] D2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models
链接: https://arxiv.org/abs/2606.04446
作者: Liyuan Zhang,Jiarui Zhang,Jinwei Yao,Ran Yan,Yuchen Yang,Jiahao Zhang,Tongkai Yang,Yi Wu,Binhang Yuan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2606.04446 [cs.DC] (or arXiv:2606.04446v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.04446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-48] he price of multi-group transductive learning
链接: https://arxiv.org/abs/2606.04423
作者: Noah Bergam,Samuel Deng,Daniel Hsu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We show every multi-group learner in the transductive setting may incur a multiplicative penalty in its error rate on some group relative to the error rate achievable in the single-group setting, and the penalty can increasing linearly with the number of groups, up to roughly the square-root of the sample size. This stands in stark contrast to optimal multi-group learners in an analogous (group-realizable) statistical setting, where the penalty is always at most logarithmic in the sample size and independent of the number of groups.
[LG-49] Loss-Conditional PINNs for Parametric PDE Families
链接: https://arxiv.org/abs/2606.04420
作者: Anna Lazareva,Alexander Tarakanov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed neural networks (PINNs) approximate solutions of ODEs and PDEs by minimising a weighted combination of residual, boundary, initial, and data losses. Their performance is often dominated by the choice of loss weights: a poor weighting can drive training to a degenerate solution in which one physical constraint is satisfied while another is ignored. Existing methods select or adapt a single good set of weights. We take a different view: instead of tuning one weight vector, we explore the entire weight space during training. We introduce LC-PINN, which adapts the loss-conditional training of Dosovitskiy and Djolonga (2020) to the PDE-residual setting: the conditioning vector (either the loss weights or a scalar physical coefficient) is treated as a network input and sampled from a simple prior at every optimisation step. This turns PINN training into learning a continuous family of solutions indexed by that vector, with no solver-generated paired data. LC-PINN thus lies between classical PINNs and operator learning: it stays fully physics-informed but amortises training over a parametric family. Our contribution is not the loss-conditional construction itself, but its extension to PINNs, the unification of the loss-weight and parametric-coefficient regimes under one architecture (concatenation for loss weights, FiLM for coefficients), and a fixed-quadrature L-BFGS finishing protocol that makes the parametric-coefficient regime trainable. We give a lambda-invariance result for the conditional optimum and study LC-PINN on parametric Helmholtz, Schrodinger, viscous Burgers, and Buckley-Leverett equations. A single LC-PINN matches or improves retrained per-weight PINN baselines while parameterising the full family in one model, at a total cost that amortises favourably against per-instance retraining. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.04420 [cs.LG] (or arXiv:2606.04420v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] (Mis)generalization of Helpful-only Fine-tuning
链接: https://arxiv.org/abs/2606.04413
作者: Mohammad Omar Khursheed,Baram Sosis,Fabien Roger
类目: Machine Learning (cs.LG)
*备注: 77 pages, 50 figures
Abstract:Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI RD where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.
[LG-51] ANDEM: Bi-Level Data Mixture Optimization with Twin Networks
链接: https://arxiv.org/abs/2606.04401
作者: Jiaxing Wang,Deping Xiang,Jin Xu,Mingyang Yi,Guoqiang Gong,Zicheng Zhang,Haoran Li,Pengzhang Liu,Zhen Chen,Ke Zhang,Ju Fan,Qixiang Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The capabilities of large language models (LLMs) significantly depend on training data drawn from various domains. Optimizing domain-specific mixture ratios can be modeled as a bi-level optimization problem, which we simplify into a single-level penalized form and solve with twin networks: a proxy model trained on primary data and a dynamically updated reference model trained with additional data. Our proposed method, Twin Networks for bi-level DatA mixturE optiMization (TANDEM), measures the data efficacy through the difference between the twin models and up-weights domains that benefit more from the additional data. TANDEM provides theoretical guarantees and wider applicability, compared to prior approaches. Furthermore, our bi-level perspective suggests new settings to study domain reweighting such as data-restricted scenarios and supervised fine-tuning, where optimized mixture ratios significantly improve the performance. Extensive experiments validate TANDEM’s effectiveness in all scenarios.
[LG-52] DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data
链接: https://arxiv.org/abs/2606.04399
作者: Yunsheng Yuan,Xue Xiao,Lina Wang,Feng Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art studies, it entails extensive gradient information exchanging among the agents and thus induces high risk of privacy leakage for the individual agents. Moreover, in real-world applications, the training data are usually non-identically and independently distributed across the agents, inducing more challenges to enable privacy-preserved decentralized learning. To address these issues, we propose a privacy-preserved decentralized learning algorithm with non-IID data, DPDL, which leverages the notion of Differential Privacy (DP) in cross-gradient aggregation through a similarity-based calibration technique. Specifically, in each round, each agent perturbs the cross-gradients (i.e., the derivatives of its neighbors’ local model in its private local data) by Gaussian noise mechanism before sharing them with its neighbors; it then adopt cosine similarity to calibrate the received perturbed cross-gradients such that the aggregation of the calibrated cross-gradients can be utilized to effectively update local model in a momentum-like manner. Our rigorous theoretical analysis not only reveals the minimum noise level required to achieve a specific level of privacy preservation, but also illustrates that our algorithm still achieves a linear speedup in training with non-IID data. We finally conduct extensive experiments on real-world dataset to validate the effectiveness of our algorithm in defending privacy attacks and in training accurate models.
[LG-53] Shortcomings and capacities of real-constrained neural networks in complex spaces
链接: https://arxiv.org/abs/2606.04390
作者: Andrew Gracyk
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Probability (math.PR)
*备注: First version
Abstract:We find the asymptotic ratio between the storage capacities when enforcing real pre-activations in a complex hypothesis class as opposed to complex ones in the same class. Our methods depend on Gardner volume comparisons at critical capacity. Our proof relies on an application of the Harish-Chandra-Itzykson-Zuber (HCIZ) formula, nonstandard in literature. With the HCIZ formula, we may obtain a more robust approximation for the final asymptotic ratio. This strategy is applicable to our work specifically since we integrate over the unitary and orthogonal compact manifolds, facilitated via the Weyl integration formula and the Haar measure.
[LG-54] Revisiting Privacy Amplification by Subsampling in Selective Release DPSGD
链接: https://arxiv.org/abs/2606.04384
作者: Xiaobo Huang,Fang Xie
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Machine learning’s reliance on sensitive data necessitates privacy-preserving techniques like Differentially Private Stochastic Gradient Descent (DPSGD). However, DPSGD suffers from substantial utility degradation and slow convergence due to gradient clipping and noise injection. Prior works have attempted to improve DPSGD from various perspectives; notably, the Differentially Private Selective Update and Release (DPSUR) algorithm has achieved remarkable model utility. However, the privacy accounting in DPSUR overlooks the variation in sampling probability introduced by the selective release mechanism, which compromises the rigor of its privacy guarantees. To address these limitations, we re-evaluate the privacy analysis of the selective release mechanism and propose a novel algorithm: Differentially Private Selective Release based on Clipped Gradients (DPSR-CG). Through a rigorous, newly derived privacy analysis and extensive experiments on multiple datasets (MNIST, CIFAR-10, IMDB, and FMNIST), we demonstrate that our DPSR-CG mechanism maintains strict privacy guarantees while achieving exceptional model performance.
[LG-55] When Do Fewer Coordinates Suffice in DP-SGD?
链接: https://arxiv.org/abs/2606.04375
作者: Huiqi Zhang,Fang Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages
Abstract:Differentially private stochastic gradient descent (DP-SGD) injects noise into every updated coordinate, making the injected noise energy scale with the ambient parameter dimension (d). We ask when private training can update fewer coordinates without losing the signal needed for optimization. We propose \textscTP-TopK (Two-Phase TopK DP-SGD), a two-phase method for coordinate-sparse private training without public data, in which a private warm-up phase identifies a coordinate support used to guide the main training phase. We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension (k) rather than the full parameter dimension (d), and provide a lower bound on the reliability of warm-up-based coordinate ranking. Experiments on MNIST, FMNIST, and CIFAR-10 show that learned coordinate supports can retain more gradient energy than size-matched random supports, with the largest gains when the active dimension is small and warm-up scores are informative.
[LG-56] MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers ICML2026
链接: https://arxiv.org/abs/2606.04366
作者: Yanshun Zhao,Xiaoyu Peng,Jiamin Jiang,Congcong Zhu,Jingrun Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: ICML2026
Abstract:Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)-inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse-grained global context and fine-grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, which we view as a practical inductive bias rather than a formal optimality guarantee. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency-accuracy trade-off compared to uniform-grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling. Code is available at this https URL.
[LG-57] Literature-Guided Minimax Optimization of Virtual Epilepsy Neurostimulation
链接: https://arxiv.org/abs/2606.04339
作者: Cathy Liu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures. Code and interactive essay at this https URL
Abstract:Computational models of epilepsy promise patient-specific treatment design, but most optimization workflows still search for parameters that perform well on average. In neuromodulation, this is a weak target: a protocol that improves the mean response can still fail in the patient whose network is least tolerant to stimulation. We present a literature-guided minimax pipeline that couples PubMed-scale hypothesis extraction, The Virtual Brain (TVB) Epileptor simulations, and large-language-model-guided black-box optimization. The optimizer proposes either intrinsic model-control parameters or clinically interpretable external-stimulation protocols; TVB evaluates each proposal across sampled virtual patients; and the objective maximizes worst-case reward, defined as the negative variance of simulated seizure activity. In the intrinsic model-control experiment, the best archived parameter set improved worst-case reward from -0.5285 to -0.3182, a 39.8% gain over baseline. The clinical-style external-stimulation search produced a much smaller worst-case improvement (1.7%), and a 20-patient virtual cohort showed no aggregate benefit (p=0.9019), despite a 55% responder rate and a positive temporal-lobe subgroup signal. The study should be read as an in silico proof of concept for robust, literature-aware neurostimulation design, not as clinical evidence.
[LG-58] Federated Learning for Multi-Center Sepsis Early Prediction with Privacy-Preserving
链接: https://arxiv.org/abs/2606.04338
作者: Xixi Tian,Di Wu,Xiang Liu,Yiziting Zhu,Yujie Li,Xin Shu,Bin Yi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Privacy-sensitive and distributed characteristics of multi-center medical data bring severe obstacles to centralized modeling for accurate early prediction of sepsis. Federated learning (FL) has attracted growing attention as a promising framework for collaborative model development, as it allows multiple institutions to jointly train predictive models without directly sharing or centralizing raw data. Nevertheless, its practical performance, robustness, and privacy-preserving benefits remain insufficiently evaluated using real-world clinical datasets. To bridge this gap, this study systematically examines the application of federated learning to multi-center sepsis prediction. The experimental dataset consists of 648 clinically screened samples collected from three tertiary hospitals in China, with rigorous inclusion and exclusion criteria. We establish a centralized training paradigm as the performance baseline, and then implement a horizontal federated learning framework for distributed collaborative modeling. Extensive experimental results demonstrate that the federated learning-based model achieves highly comparable prediction accuracy to the centralized counterpart, while fundamentally avoiding privacy leakage. Further privacy security analysis verifies that malicious attackers cannot reconstruct the original patient data from the transmitted model parameters, indicating strong resistance against data reconstruction attacks. This work not only validates the practicality and security of federated learning in clinical sepsis prediction, but also provides a reliable and feasible solution for privacy-preserving multi-center medical collaboration.
[LG-59] Policy Gradient for Continuous-Time Robust Markov Decision Processes
链接: https://arxiv.org/abs/2606.04335
作者: Tanya Veeravalli,David M. Bossens,Atsushi Nitanda
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an \tilde\mathcalO(\frac1\epsilon^2) sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an \tilde\mathcalO(\frac1K) oracle-based convergence rate and an \tilde\mathcalO(\fracN^2\epsilon) sample complexity under N -particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.
[LG-60] Neural Galerkin Normalizing Flows for Bayesian Inference of Diffusions with Inaccessible Boundaries
链接: https://arxiv.org/abs/2606.04324
作者: Riccardo Saporiti,Fabio Nobile
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, 12 figures
Abstract:One of the primary challenges in Bayesian inference on the parameters of a diffusion model from discrete observations is the unavailability of an analytical expression for the transition density function between consecutive observation times, which is needed to derive the likelihood function. Extending previous studies that solve Fokker-Planck (FP) type partial differential equations with Normalizing Flows, we propose a new Normalizing Flow architecture to learn the transition density function of the diffusion process between two observation times. We do so by solving in a Neural Galerkin framework the associated FP equation with a Dirac mass as initial condition, over a specified training distribution of the initial datum and the coefficients of the diffusion. We specifically focus on processes whose diffusion matrix vanishes in certain inaccessible boundary regions, such as Stochastic Volatility models that satisfy a Feller condition. The product of the obtained transition densities evaluated along the observed trajectory approximates the likelihood function, thereby enabling cheap posterior sampling via Markov chain Monte Carlo (MCMC). After the offline training phase, inference becomes significantly more efficient, as it avoids the need to solve the FP equation in real time for each parameter proposed by the MCMC sampler or to rely on other likelihood-free methods for Bayesian inference that involve repeated simulation of diffusion bridges. Comments: 27 pages, 12 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.04324 [cs.LG] (or arXiv:2606.04324v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-61] oward a Generalized Defense Across Sparse Continuous and Structured Parameter Attacks
链接: https://arxiv.org/abs/2606.04317
作者: Bin Duan,Zeyu Bai,Guowei Yang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Deep neural networks are increasingly deployed across heterogeneous and partially untrusted environments, where models are distributed through cloud storage, CI/CD pipelines, containerized services, and edge execution platforms. This broad deployment landscape exposes model parameters to various integrity risks. Unlike input-space adversarial attacks, parameter attacks directly tamper with the model’s internal parameters and persist across all subsequent inferences. Existing defenses either require retraining, incur significant accuracy degradation, or are limited to specific attack classes. However, in real-world deployment scenarios, the forms of parameter attacks are often unpredictable. To address this challenge, we present ParDef, a generalized defense for deep neural networks against diverse types of parameter attacks. ParDef integrates keyed channel reparameterization, which obscures sensitive parameter directions, QC-LDPC quantization, which embeds redundancy and supports error correction, and adaptive robust inference, which stabilizes predictions under uncertainty. Our evaluation on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet and VGG models demonstrates that ParDef consistently reduces attack success rates across different parameter attacks while maintaining high model performance and incurring only moderate deployment overhead. These results highlight that ParDef is a practical and generalized defense for DNN deployments.
[LG-62] sting Neural Networks via Bayesian-Guided Exploration of Decision Landscapes
链接: https://arxiv.org/abs/2606.04314
作者: Bin Duan,Meiru Che,Guowei Yang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or coverage-guided strategies, both of which struggle to efficiently uncover diverse model failures while remaining proximate to the original data distribution and semantics. We propose BayesWarp, a testing framework that addresses this limitation by mutating decision-critical input regions identified via interpretable saliency techniques and adaptively guiding the testing process using an uncertainty-aware Bayesian Optimization strategy, enabling the discovery of diverse failures while preserving distributional and semantic proximity to the original data. Evaluation on MNIST, CIFAR-10, and ImageNet across six neural network models shows that BayesWarp improves failure discovery, failure diversity, test case quality, and critical neuron coverage under a fixed mutation budget. These results demonstrate that BayesWarp improves testing effectiveness. Moreover, fine-tuning with the generated failure cases leads to improvements in model performance.
[LG-63] Latent Anchor-Driven Test Generation for Deep Neural Networks
链接: https://arxiv.org/abs/2606.04310
作者: Bin Duan,Matthew B. Dwyer,Guowei Yang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches explore either the input space or a learned latent space. While latent-space generation can better maintain plausibility than direct input-space mutation, current methods still face a trade-off among exploration controllability, failure diversity, and seed-relative semantic drift. To overcome these limitations, we propose Latte, a black-box testing framework that generates semantically proximate, diverse, and fault-revealing test cases by leveraging the latent space. Specifically, Latte encodes each input seed with a pre-trained VQ-VAE and performs a seed-centered, one-step latent mutation along directions defined by anchors sampled from alternative classes, followed by quantization and decoding back to the input space. This explores local neighborhoods around each seed within the learned latent manifold, resulting in a larger number and broader diversity of oracle-triggering prediction discrepancies under the same budget. We evaluated Latte on 5 datasets and 10 DNN models in single-model and multi-model testing scenarios. Across the evaluated datasets and models, Latte improves fault exposure and behavioral diversity under matched testing budgets. Under the single-model setting, it also maintains low seed-relative semantic drift with respect to the source seeds.
[LG-64] Folded Transport MCMC: Certifiable Quotient Posterior Computation for Symmetric Bayesian Models
链接: https://arxiv.org/abs/2606.04307
作者: Jun Hu
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 48 pages (including supplementary material), 5 figures, 6 tables. Submitted to Journal of the Royal Statistical Society: Series B
Abstract:Bayesian models with finite symmetry - mixture models with exchangeable components, structural identification with closely-spaced modes - define posteriors that are invariant under a group of label permutations, creating redundant multimodality that degrades MCMC convergence diagnostics. We introduce Folded Transport MCMC (FolT-MCMC), which performs inference directly on the quotient posterior by constructing an independence sampler on the fundamental domain of the symmetry group. The quotient proposal is formed by symmetrising a learned normalising flow over the group orbits. We prove that the LCNF oscillation-based certification framework transfers to the quotient metric with a stabiliser-corrected ball-mass bound and improved covering radius, and that the quantile-core certified lower bound improves whenever the unfolded flow exhibits cross-mode proposal deficiency. On Gaussian mixtures (d = 2 - 20), label-switching targets (up to 24 equivalent modes), and a standard Bayesian three-component mixture posterior, the quantile-core certified improvement ratio ranges from 2x to 145x, with the folded certificate empirically nearly dimension-free. On real accelerometer data from a supertall building during Typhoon Mangkhut, FolT-MCMC yields a non-vacuous quantile-core certificate where the unfolded certificate is vacuous.
[LG-65] Offline-to-Online Learning in Linear Bandits
链接: https://arxiv.org/abs/2606.04305
作者: Kushagra Chandak,Toshinori Kitamura,Xiaoqi Tan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.
[LG-66] PE-MHL: Physics-Encoded Modular Hybrid Layers for Scalable Learning of Complex Systems
链接: https://arxiv.org/abs/2606.04290
作者: Ismail Hassaballa,Mircea Lazar
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Hybrid models that combine physics-based and data-driven components have shown strong potential for achieving accuracy and interpretability in control applications. While recent methods have made progress in incorporating physical consistency, challenges remain in scalability, robustness to noise, and control of model complexity. This paper proposes a Physics-Encoded Modular Hybrid Layer (PE-MHL) framework, in which a baseline physics-based model is incrementally refined through the addition of new sub-models, where each new component adds complexity while preserving what previous components have already learned. We establish a theoretical guarantee for this construction: with a least-squares initialization of each new sub-model, the training error is monotonically non-increasing in the number of sub-models and provably converges. Empirical evaluations on a nonlinear NARX benchmark and the Quanser Aero 2 platform demonstrate that PE-MHL outperforms equivalently sized monolithic networks in both accuracy and generalization, while also providing more stable training dynamics and better preservation of underlying data structures.
[LG-67] Derivative Informed Learning of Exchange-Correlation Functionals
链接: https://arxiv.org/abs/2606.04279
作者: Eike S. Eberhard,Luca A. Thiede,Abdul Aldossary,Andreas Burger,Nicholas Gao,Vignesh Bhethanabotla,Alán Aspuru-Guzik,Stephan Günnemann
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Proceedings of the 43rd International Conference on Machine Learning
Abstract:Machine-learned (ML) exchange-correlation (XC) functionals aim to replace human-designed density functional approximations by learning directly from reference data, but they still do not consistently outperform traditional \mathcalO(N^4) -scaling hybrid functionals. We study a hybrid-distillation setting in which \mathcalO(N^3) -scaling ML-XC functionals are trained to reproduce B3LYP/def2-SVP targets. We introduce Derivative Informed XC-Loss (DI-Loss), a loss that incorporates additional information from the reference hybrid functional by supervising first and second derivatives of the energy on the Grassmannian of admissible density matrices. Rather than only matching the self-consistent fixed point, DI-Loss aligns the local first- and second-order response of the learned functional with that of the target functional. Across four evaluated architectures, DI-Loss consistently improves the main energy metrics. Averaged uniformly across architectures, the total-energy MAE decreases by 66% relative to energy and density supervision alone. The density-sensitive mean-field energy metric E_\rho improves from 1.2 to 0.8 mEh on average, while dipole and \mathcalL_2 density errors do not improve uniformly. We further show that densities from the distilled functionals reduce hybrid-functional SCF iterations by up to 50%. In downstream TDDFT calculations, Hessian supervision improves excited-state predictions, with XCdiff reducing the mean excitation-energy MAE by 19 - 35%.
[LG-68] RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training
链接: https://arxiv.org/abs/2606.04272
作者: Rachit Bansal,Clara Mohri,Tian Qin,David Alvarez-Melis,Sham Kakade
类目: Machine Learning (cs.LG)
*备注:
Abstract:The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT \to RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model’s distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.
[LG-69] Long-Term and Short-Term Transistor Aging in Deep Neural Networks: Impact and Mitigation
链接: https://arxiv.org/abs/2606.04266
作者: Alireza Sarmadi,Virinchi Roy Surabhi,Prashanth Krishnamurthy,Hussam Amrouch,Ramesh Karri,Farshad Khorrami
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 pages, 16 figures
Abstract:Deep neural networks (DNNs) are used in a variety of real-world applications including, for example, image classification and speech recognition. The inference accuracy of DNN implemented on hardware in integrated circuits (ICs) degrades under phenomena such as transistor aging. Aging slows down the switching speed of transistors, resulting in system-level timing violations due to unsustainable clocks. To maintain reliability for the entire projected lifetime, designers add guardbands to prevent timing violations; however, adding large timing guardbands causes losses in performance (speed or throughput). This chapter provides a detailed discussion of the effects of long-term and short-term transistor aging on DNN inference accuracy. Furthermore, to mitigate aging effects on DNN’s accuracy and keep them at bay, a methodology for aging-aware retraining is presented in order to generate a resilient DNN even when aggressive (i.e., smaller than required) guardbands are used. This improves the inference accuracy of the DNNs even in the presence of aging-induced degradation. These effects are discussed in this chapter along with mitigation strategies on a hardware implementation of a DNN for image classification on an off-the-shelf image dataset. The application of short-term aging as an excitation mechanism for the detection of hardware Trojans in integrated circuits is also briefly discussed.
[LG-70] Edge of Stability Selectively Shapes Learning Across the Data Distribution ICML
链接: https://arxiv.org/abs/2606.04212
作者: Shauna Kwag,Anakha Ganesh,Tomaso Poggio,Pierfrancesco Beneventano
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, 22 figures, ICML HiLD 2026
Abstract:Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.
[LG-71] A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support
链接: https://arxiv.org/abs/2606.04209
作者: Ioanna Gemou,Matteo Gamba,Randall Balestriero,Ritambhara Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations seek small, semantically meaningful changes to an input that alter a model’s prediction, and are widely used to interpret and audit machine learning systems. In modern vision, language, and multimodal systems, pretrained encoders map inputs to representation spaces, and downstream classifier heads impose decision boundaries within those spaces. As a result, the feasibility and distance of nearby counterfactuals depend on boundary placement relative to the data. Yet models with similar predictive performance can differ substantially in whether such changes are achievable and how far representations must move. This work examines this variation using a standardized local search probe across several pretrained encoders and linear classifier heads. Results show that despite similar predictive performance, models differ substantially in their counterfactual behavior. Under fixed representations, varying only the classifier head alters counterfactual outcomes while leaving predictive performance largely unchanged. This variation is explained by the interaction of decision-boundary proximity and local data support, which jointly determine whether prediction changes are both feasible and lie in regions supported by the data, and can also improve counterfactual search within fixed models. Together, these findings identify counterfactual behavior as a distinct dimension beyond predictive performance and show that it can be altered without changing accuracy, with implications for model selection, robustness, and the reliability of counterfactual methods.
[LG-72] KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
链接: https://arxiv.org/abs/2606.04180
作者: Youqi Wu,Mohammad Jalali,Farzan Farnia
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under another. We propose \emphKernel Optimization for Discrepancy Analysis (KODA), a kernel-based framework for contrastive representation comparison and alignment. KODA constructs unified multimodal kernels through modality-wise kernel composition and formulates discrepancy discovery as a constrained optimization problem that searches for coherent structures in one representation while suppressing coherence in a reference representation. This yields interpretable discrepancy directions associated with specific sample subsets and modality interactions. To scale KODA to large vision-language datasets, we develop randomized low-dimensional approximations of joint kernels using random projections, including Random Fourier Features for shift-invariant kernels. Empirically, KODA identifies consistent and interpretable discrepancy structures across vision-language representations and provides sample subsets for representation alignment. The code is available at this https URL.
[LG-73] Low-rank Distributional Matrix Completion
链接: https://arxiv.org/abs/2606.04176
作者: Jiayi Wang,Raymond K. W. Wong
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study a distributional generalization of the matrix completion problem in which each entry of the target matrix is a probability distribution rather than a scalar. In this setting, only a subset of matrix entries is observed, and even for observed entries, the underlying distributions are not directly accessible; instead, we observe finitely many samples drawn from them. To represent distributional entries, we employ kernel mean embeddings and introduce a notion of Tucker rank for distribution-valued matrices to capture their low-rank structure. The infinite-dimensional nature of kernel embeddings poses significant methodological challenges. To address this, we introduce functional unfolding operators that link the proposed distributional low-rank structure to the classical Tucker rank for finite-dimensional tensors. Based on this framework, we propose a novel estimator for distributional matrix completion. We establish non-asymptotic error bounds that characterize the statistical performance of the estimator. Extensive experiments on synthetic data and a real-world application demonstrate the effectiveness of the proposed method.
[LG-74] When Autoregressive Consistency Hurts Safety Alignment
链接: https://arxiv.org/abs/2606.04168
作者: Bochen Lyu,Yiyang Jia,Xiaohao Cai,Zhanxing Zhu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 21 pages
Abstract:Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model’s behavior near the first few output tokens. We argue that this phenomenon can be understood through autoregressive consistency, the tendency of next-token prediction to preserve and extend the current response trajectory consistently. By analyzing the learning dynamics of safety alignment, we show that autoregressive consistency can concentrate alignment updates on early tokens, offering a mechanistic explanation for shallow safety alignment. The same mechanism also predicts a broader class of attacks on LLMs: attacks that induce harmful continuation states at arbitrary positions in the output trajectory. As a concrete example, we introduce random insertion attack, which inserts a short harmful span into an otherwise safe refusal trajectory and exploits autoregressive consistency to sustain the resulting harmful branch, thereby bypassing safety alignment. Notably, a short harmful span can redirect the generation to be harmful even after a long refusal prefix, highlighting autoregressive consistency as a potential broader failure mechanism. This suggests that safety alignment should also break harmful autoregressive consistency throughout the output trajectory. We therefore propose adversarial safety alignment, an initial framework based on worst-case harmful continuation states, and instantiate it with random worst-insertion training. Overall, our results suggest that autoregressive consistency should be treated as a central consideration in both safety alignment and attack design.
[LG-75] When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction
链接: https://arxiv.org/abs/2606.04161
作者: Tyler Crosse,Alan Nadelsticher Ruvalcaba,Dustin Khang LeDuc,Thomas Trask,Nicholas Lytle,David Joyner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Different predictors often excel on different inputs, so picking the best one per instance promises higher accuracy than committing to a single model. In practice, selectors trained from logged data routinely fail to beat the strongest single predictor. Three causes typically go unseparated before more tuning is applied: a mismatched learner, a state that does not predict which model wins, or buffer-to-deployment label shift. A three-stage diagnostic rules them out on a shared buffer. Stage~1 estimates a local ceiling on oracle recovery from k -NN label consistency. Stage~2 asks whether paired BC and offline-RL learners (BC, DQN, and CQL across penalty weights) reach that ceiling. Stage~3 ablates the selector state to test whether richer features would raise it. The combined verdict points to the most promising next step: tuning the learner, redesigning the state, or collecting new data. We apply it to selecting among five dropout-prediction models on edX clickstream data. Across 16 windows, the oracle beats the strongest single base model by 9.7 accuracy points on average, yet BC, DQN, and CQL land in the same test-accuracy band below it (robust to a tenfold buffer sweep and N=2,000 held-out examples). The bottleneck is local representational ambiguity: CQL closes the imitation gap without a deployment gain (not conservatism), regret clusters tightly across learners (not tie-breaking), and the three learners converge on test accuracy (not shift). The next iteration should change the state or collect new data, not tune the offline learner further. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.04161 [cs.LG] (or arXiv:2606.04161v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04161 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-76] Stationarity-Aware Retrieval-Augmented Time Series Forecasting KDD2026
链接: https://arxiv.org/abs/2606.04135
作者: Shiqiao Zhou,Holger Schöner,Zipeng Wu,Edouard Fouché,IAG Wilson,Shuo Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026 research track
Abstract:Time series forecasting relies on historical patterns, but real-world series often exhibit non-stationarity and regime shifts that challenge fully parametric forecasters. Inspired by Retrieval-Augmented Generation (RAG), recent work augments forecasters by retrieving relevant historical segments and using them as external evidence at inference time. However, due to the intrinsic non-stationarity of real-world time series, a highly similar past segment does not necessarily imply a similar future, rendering similarity-only retrieval brittle and prone to redundancy. We propose Stationarity-Aware Retrieval-Augmented Time Series Forecasting (SARAF), a framework that adaptively balances relevance and diversity in retrieval. SARAF first forms a candidate pool via temporal similarity with time-aligned enhancement, then applies a diversity-aware selection strategy to cover heterogeneous historical regimes, with the diversification strength automatically modulated by dataset-level stationarity. Moreover, SARAF uses stationarity-aware aggregation to fuse the retrieved futures. Extensive experiments on eight real-world datasets show that SARAF achieves competitive forecasting performance and improves average accuracy and robustness over strong baselines, with particularly clear benefits under challenging non-stationary settings. Code: this https URL.
[LG-77] veriFIRE: an Industrial Case Study in Verifying Consistency Properties for a DNN-Based Wildfire Detection System
链接: https://arxiv.org/abs/2606.04121
作者: Idan Refaeli,Maya Swisa,Itay Buchnik,Alon Zada,Guy Amir,Elad Mandelbaum,Ziv Freund,Guy Katz
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: To appear in The 9th International Symposium on AI Verification (SAIV)
Abstract:We present our ongoing work on the veriFIRE project: a collaboration between industry and academia, aimed at applying verification to increase the reliability of a real-world, safety-critical system. Specifically, we target an airborne platform for wildfire detection, which incorporates two deep neural networks. We present an end-to-end methodology for verifying \textitconsistency properties in this system. Our approach encodes application-grounded requirements into solver-compatible queries for existing neural network verifiers. We study properties of interest over critical operational scenarios: (i) monotonicity of detector confidence as target intensity increases; and (ii) bounded detector response under physically plausible blur over the sensor. We instantiate these encodings using state-of-the-art neural network verification backends and evaluate them at scale on real background samples. For the first property, all verification queries are solved in under five minutes. For the second property, verification is substantially harder, highlighting key scalability challenges for richer, higher-dimensional specifications. Overall, the results demonstrate that meaningful, domain-specific guarantees can be obtained for industrial systems.
[LG-78] Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification SIGIR
链接: https://arxiv.org/abs/2606.04110
作者: Neeti Pokharna,Olivier Jeunen,Yatharth Saraf,Aleksei Ustimenko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted as Industry Track paper in the 2026 ACM SIGIR Conference on Research and Development in Information Retrieval
Abstract:Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both mean and variance, leading to low statistical power and unreliable conclusions in A/B experiments – especially under limited traffic. We present a practical framework for variance reduction in online experiments by combining post-stratification with CUPED. Our approach leverages pre-experiment covariates to improve the sensitivity of monetization experiments without requiring additional traffic. Deployed at ShareChat across ranking-driven monetization experiments, the method substantially reduces variance and improves decision stability, achieving equivalent statistical confidence with ~45% less traffic than standard metrics. We further discuss practical design choices, guardrails, and limitations, providing guidance on when post-stratification is appropriate for real-world information retrieval and Recommendation systems. Comments: Accepted as Industry Track paper in the 2026 ACM SIGIR Conference on Research and Development in Information Retrieval Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.04110 [cs.LG] (or arXiv:2606.04110v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04110 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805712.3808428 Focus to learn more DOI(s) linking to related resources
[LG-79] UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing
链接: https://arxiv.org/abs/2606.04101
作者: Xinming Wei,Chao Jin,Tuo Dai,Yinmin Zhong,Shan Yu,Chengxu Yang,Bingyang Wu,Zili Zhang,Jing Mai,Qianchao Zhu,Zhouyang Li,Yuliang Liu,Guojie Luo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49 \times improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30 - 4.01 to 1.01 - 1.04. Additionally, we validate UltraEP’s scalability and robustness in production MoE training with 2560 GPUs. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2606.04101 [cs.DC] (or arXiv:2606.04101v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.04101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-80] Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials
链接: https://arxiv.org/abs/2606.04100
作者: Joanna Zou,Fraser Birks,Dallas Foster,Youssef Marzouk
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Machine learning interatomic potentials (MLIPs) enable efficient and accurate atomistic simulations but depend critically on the quality and diversity of the training data. We introduce Stein kernelized molecular dynamics (SKMD), an enhanced sampling method that uses interacting particle dynamics to acquire informative training configurations for the active learning and fine-tuning of MLIPs. SKMD corresponds to a stochastic variant of Stein variational gradient descent that is adapted for molecular dynamics by incorporating asynchronous particle updates and a kernel of global atomic descriptors, which provides a symmetry-aware measure of configurational similarity. Unlike other enhanced samplers used in molecular dynamics, SKMD preserves the Boltzmann distribution as the asymptotic distribution of the dynamics. This property enforces a balance between the exploration of diverse configurations and attraction toward high-probability regions of the energy landscape. We further propose an approach to efficient online data acquisition using an adaptive stopping criterion that selects non-redundant training data over the course of simulation. We demonstrate SKMD for the active learning of a neural network model of the Müller-Brown potential and the fine-tuning of a MACE interatomic potential for alanine dipeptide. Compared to active learning baselines, our method achieves higher model accuracy in fewer training iterations with the same number of acquired training samples.
[LG-81] CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles
链接: https://arxiv.org/abs/2606.04072
作者: Pragya Sharma,Brian Wang,Mani Srivastava
类目: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud-hosted intelligence through vehicle-to-everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems-level challenges: network latency, compute heterogeneity, and multi-tenant contention, all critically affect real-time decision-making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state-of-the-art models, incorporates trace-driven network and workload emulation, and provides synchronized model-, system-, and task-level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud-based perception and RSU-assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset-driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at this https URL. Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2606.04072 [cs.RO] (or arXiv:2606.04072v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.04072 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICRA 2026
[LG-82] Bayesian Membership Privacy for Graph Neural Networks
链接: https://arxiv.org/abs/2606.04069
作者: Sinan Yıldırım,Megha Khosla
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Existing privacy analyses for Graph Neural Networks (GNNs) largely inherit assumptions from non-graph settings, overlooking structural correlations and stochastic training-graph sampling. In particular, node-dependent priors make type-I and type-II errors alone insufficient to characterize the best membership inference test. To address this, we introduce Bayesian Membership Privacy (BMP), a sampling-aware formulation of node-level membership privacy that incorporates node-dependent priors and treats graph sampling probabilities as part of the adversary’s knowledge. BMP casts membership inference as a Bayesian hypothesis test and accordingly quantifies membership privacy in terms of posterior membership probability. We explore theoretical properties of BMP in relation to the existing definitions in the literature. We further propose a practical, sampling-aware auditing mechanism to estimate the parameters of BMP as a measure of node-level privacy leakage in GNNs. We conduct experiments on benchmark graph datasets and show that BMP yields fine-grained privacy insights that are not visible through global attack accuracy alone.
[LG-83] Self-Distilled Policy Gradient
链接: https://arxiv.org/abs/2606.04036
作者: Yifeng Liu,Shiyuan Zhang,Yifan Zhang,Quanquan Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at this https URL.
[LG-84] Inverse Critical Experiment Design via Gradient Optimization and a Multigroup Attention-Based Neural Network Architecture
链接: https://arxiv.org/abs/2606.04033
作者: Will Savage,Logan Burnett,Dean Price
类目: Machine Learning (cs.LG)
*备注:
Abstract:The validation of advanced nuclear reactor designs and fuel concepts requires critical experiments with high neutronic similarity to the target technology. Neutronic similarity is quantified by the correlation coefficient c_k , which captures the shared bias in k_\texteff induced by uncertainties in nuclear data. Generally, a c_k\geq0.9 is needed for an experiment to be sufficiently similar to a target technology. This work presents a methodology for the inverse design of critical experiments. Deep neural network surrogate modeling and nonparametric gradient optimization are used to generate experiment geometries that maximize c_k . A deep neural network is trained on OpenMC-calculated sensitivity vectors for grid-based critical experiment geometries. The model architecture combines a U-Net convolutional encoder-decoder with a novel multigroup attention pooling layer, introduced to capture the differing spatial dependencies of sensitivities. Multigroup attention pooling is shown to achieve better performance than traditional pooling, as well as interpretable internal behavior. The differentiability of the surrogate enables gradient-based optimization of the full combinatorial design space, allowing c_k to be maximized by directly changing the material assignment of each position in the geometry grid. The method is applied to the validation of the TN-Americas TN-LC transportation cask with HALEU fuel, for which existing critical experiment coverage is limited. The optimization procedure is shown to produce experiment geometries achieving c_k scores of 0.97757, 0.81324, and 0.93276 for three configurations of interest. This approach demonstrates the potential of deep learning and gradient optimization to accelerate the development of advanced nuclear technology. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.04033 [cs.LG] (or arXiv:2606.04033v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.04033 Focus to learn more arXiv-issued DOI via DataCite
[LG-85] Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent ICML2026
链接: https://arxiv.org/abs/2606.04031
作者: Ahanaf Hasan Ariq
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 11 pages, 3 tables. Accepted as poster at HiLD 2026 (4th Workshop on High-dimensional Learning Dynamics, ICML 2026)
Abstract:Coupled gradient descent–where the update of one parameter block depends on another–underlies bilevel optimization, two-time-scale stochastic approximation, and adversarial training. When the coupled Jacobian is block-triangular, asymptotic stability is governed by the spectral radii of the diagonal blocks, yet transient amplification before convergence can be arbitrarily large due to non-normality. We develop a sharp pseudospectral theory for such block-triangular Jacobians, proving that the Kreiss constant satisfies K(J) \leq 2/(1-\gamma) + |C|/(4(1-\gamma)) when the diagonal blocks are symmetric with spectral radii at most \gamma 1 , and we establish matching minimax lower bounds. We characterize the critical coupling threshold for spectral instability and extend the analysis to nearly self-referential systems via a Neumann-series perturbation framework. As a consequence, we obtain a finite-horizon iteration-complexity bound of O(K(J)^2 \log(1/\delta)) for stochastic coupled descent. Framed as scaling laws for non-stationary two-time-scale optimization, our results expose a non-asymptotic, instance-dependent regime of high-dimensional learning dynamics that is invisible to spectral-radius analysis. Experiments on linear-quadratic problems, IQC-based comparisons, and neural-network training confirm the theory.
[LG-86] Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning
链接: https://arxiv.org/abs/2606.04028
作者: Andrew Fitzgibbon,Christoph M. Wintersteiger,Jeffrey Sarnoff
类目: Machine Learning (cs.LG)
*备注:
Abstract:The IEEE P3109 draft standard defines a parameterized family of binary floating-point formats and associated operations, with a focus on facilitating machine learning. These formats allow efficient and consistent representation of values in a small number of bits. The defined formats are parameterized over width and precision in bits, signedness, and the presence of infinities. Operations are defined by decoding floating-point values to the set of closed extended reals: the reals augmented with positive and negative infinity and NaN (Not a Number). Explicit treatment of NaN and infinite operands ensures that only real arithmetic is invoked in operation definitions. Extensive rounding and saturation modes are defined; stochastic rounding is included. Operations are exception-free, accelerating throughput, with exceptional situations communicated through return values, e.g., NaN. Operations on blocks of values sharing a common scale factor are defined in terms of the underlying operations in a uniform manner. System vendors may describe approximate implementations via a novel scale-invariant measure, akin to units in the last place, called kappa-approximation. Standard function definitions and various other properties are mechanically verified and generated using formal specifications.
[LG-87] Learning Control-Affine Reduced-Order Models via Autoencoders
链接: https://arxiv.org/abs/2606.05045
作者: Ali Mjalled,Martin Mönnigmann
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:
Abstract:We present in this paper a framework for the identification of control-affine reduced-order models (ROMs). The proposed method utilizes autoencoders (AEs) to transform the high-dimensional states, and potentially the high-dimensional inputs, into reduced latent ones suitable for control-affine state-space dynamics. This is achieved by simultaneous training of the AE and the state-space model. In addition, we extend the discrete ROM formulation to a sequence-based model, which processes state and input histories to improve prediction accuracy while preserving the control-affine structure. We motivate our framework by applying feedback linearization to the derived models, and we present guidelines for its efficient use. The proposed framework is assessed on two numerical examples and its performance is compared to a baseline model, where the AE identifies a latent space with linear state-space dynamics. The assessment involves evaluating the prediction accuracy of the ROM on test data and its effectiveness in controlling the system to a desired state or trajectory.
[LG-88] Bayesian learning for the stochastic shortest path problem
链接: https://arxiv.org/abs/2606.04845
作者: Chon Wai Ho,Sumeetpal S. Singh,Jiaqi Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 50 pages, 19 figures
Abstract:Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function Q^* , but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for Q^* through Bellman’s optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of Q^* , a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work. Comments: 50 pages, 19 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2606.04845 [stat.ML] (or arXiv:2606.04845v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.04845 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-89] Near-Optimal Decentralized Stochastic Convex Optimization over Networks
链接: https://arxiv.org/abs/2606.04757
作者: Nitai Kluger,Amit Attia,Tomer Koren
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 12 papers
Abstract:We study decentralized stochastic smooth convex optimization, where M workers minimize an average objective using local stochastic gradients and neighbor-only communication over a fixed gossip network. A central question in this setting is to determine the largest number of workers that can be used under a total budget of N gradient samples while still preserving the centralized O(1/\sqrt N) statistical rate. We introduce an accelerated decentralized method that preserves this rate for up to \smashM\lesssim \sqrt\rho,N^3/4 workers, where \rho is the spectral gap of the gossip network, improving the best prior maximal scaling of \smashM\lesssim \rho\sqrt N . The method is based on a one-step-delayed stochastic acceleration scheme that enables workers to interleave minibatching with accelerated gossip while controlling residual disagreement, and its guarantee depends only logarithmically on the optimum-local heterogeneity. We also establish a matching lower bound for linear-span decentralized first-order methods, showing that the method is optimal up to logarithmic factors.
[LG-90] QPredSGG: Hybrid Quantum Predicate Learning for Long-Tailed Scene Graph Generation
链接: https://arxiv.org/abs/2606.04689
作者: Prerana Ramkumar,Nouhaila Innan,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures
Abstract:Scene Graph Generation (SGG) requires relational reasoning over objects and their interactions, but performance is often limited by severe long-tail predicate imbalance. Classical SGG models frequently rely on dataset statistics, leading to biased predictions toward frequent relations rather than fine-grained semantic predicates. Although existing debiasing strategies improve mean recall, predicate classification in current frameworks still often depends on large classical decision modules with high parameter cost. This work introduces a hybrid quantum predicate classifier for SGG by replacing the classical predicate head in Causal Feature Enhancement Network (CFEN) with a Quantum Predicate Head (QP-Head) trained using weighted cross-entropy. To the best of our knowledge, this is among the first studies to evaluate a hybrid quantum architecture for scene graph predicate classification on Visual Genome 150. We study the effect of qubit count, encoding strategy, entangling structure, and circuit depth on relational prediction. The best 4-qubit QP-Head uses Amplitude Embedding and Strongly Entangling Layers to compress 4096-dimensional pair features into a 16-dimensional quantum-compatible representation, corresponding to a 256 \times reduction. It achieves an mR@100 of 57.25%, compared with 41.1% for the classical CFEN reference, while using only 96 trainable quantum parameters. Scaling to 8 qubits maintains strong long-tail performance, reaching an mR@100 of 55.38% with 384 quantum parameters, while the depth analysis shows a trade-off between expressibility and runtime overhead. These results suggest that compact hybrid quantum predicate heads can support parameter-efficient long-tail relational classification in complex visual reasoning tasks.
[LG-91] Reconstructing Unobservable Temperature Fields via Simulation-Aided Intelligent Sensing
链接: https://arxiv.org/abs/2606.04582
作者: Monika Stipsitz,Hèlios Sanchis-Alepuz,Jacob Reynvaan,Silvester Sabathiel
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: Presented at IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Nancy, France, 2026
Abstract:Real-time monitoring of the temperature distribution within components and sub-structures is a challenging topic in many systems due to restrictions on feasible sensor locations. While machine learning (ML) proves a versatile tool in many applications, its adoption for high-resolution thermal monitoring is hindered by the availability of high-quality datasets for training. In this work, we propose a novel approach for generating datasets for industrial applications based on randomized physics-based simulations. We demonstrate the approach in a proof-of-concept hardware setup: A neural network (NN) trained only on such a synthetic dataset, is used to reconstruct the internal temperature field from sparse sensors embedded in the hardware. The NN-based reconstructions do not only outperform Kriging in robustness but also enable real-time inference, making the method suitable for online monitoring of otherwise unobservable thermal states.
[LG-92] ReSGA: A Large Tail Risk Model for Learning Value-at-Risk and Expected Shortfall
链接: https://arxiv.org/abs/2606.04576
作者: Yichi Zhang,Ke Zhu,Zhoufan Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Risk Management (q-fin.RM)
*备注:
Abstract:Learning Value-at-Risk (VaR) and Expected Shortfall (ES) is important for managing financial risks effectively. Existing approaches with limited parameters are vulnerable to model misspecification in the era of big data. To address this limitation, we propose a large tail risk model, the retrieval-enhanced self-grouping autoencoder (ReSGA), which is designed with millions of parameters to exploit the rich cross-sectional dependence and long-term temporal dynamics of assets using their characteristics. Applied to monthly US equity returns from 1926 to 2023 with 153 firm characteristics, ReSGA outperforms twelve econometric and machine learning competitors in terms of out-of-sample loss and statistical backtesting. In addition, its forecast advantages can translate into significant economic gains from long-short decile portfolios that are constructed by a new size-enhanced left-side momentum strategy. To clarify the role of complexity, we further conduct a systematic scaling analysis and demonstrate that improvements in joint VaR-ES forecasting are primarily driven by data complexity rather than model complexity. Finally, our analyses of group-importance and transfer-learning exhibit the interpretability and cross-market generalizability of ReSGA.
[LG-93] Scaling Datasets for Multi-Sensor Multi-Agent and Multi-Domain Learning in Autonomous Systems
链接: https://arxiv.org/abs/2606.04444
作者: R. Spencer Hallyburton,David Hunt,Miroslav Pajic
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Existing datasets cannot support large-scale learning in multi-agent, multi-sensor, or multi-domain autonomy, where diversity and coordination are essential. We present a modular dataset generation pipeline that creates terabyte-scale, ground-truth-labeled data for ground, aerial, and infrastructure-based systems using the AVstack framework and CARLA simulator. Supporting single- and multi-agent configurations with flexible sensor suites, the pipeline enables controllable experimentation across challenging conditions. Representative perception and fusion studies show how generated data can support application-specific training and collaborative autonomy.
[LG-94] Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks
链接: https://arxiv.org/abs/2606.04429
作者: Harsh Vardhan,Hossein Taheri,Arya Mazumdar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that “flat interpolators generalize well” (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with 2 -layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the “flattest” interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.
[LG-95] Knockoffs-based False Discovery Rate Control and Simplification for Deep Neural Networks
链接: https://arxiv.org/abs/2606.04404
作者: Huiqi Zhang,Wenyu Liao,Yiqing Shi,Xiaobo Huang,Fang Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The deep neural network is a widely used framework in machine learning that has been widely applied in various fields. However, deep neural networks often involve a large number of parameters and inputs, many of which may be irrelevant to the goal or true output. These parameters and \textcolorblackinput variables not only increase computational complexity, but also contribute to additional computational cost. One solution to this problem is knockoff methods, which have proven successful in controlling false discovery rates in high-dimensional regression. Building on the knockoff methods and using the regularised neural network, this paper proposes three variable screening methods under the condition of controlling false discovery rates: \textitone layer filter, \textitmultiple layers filter, \textitvariable weight aggregation filter. In comparison with existing algorithms, we find that our algorithms show satisfactory performance.
[LG-96] REGAIN: REconciliation GAIN-driven Auxiliary Direction Learning
链接: https://arxiv.org/abs/2606.04380
作者: Weijia Li,Shun Hu,Yanfei Kang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Forecast reconciliation usually starts from a fixed measurement system and asks how forecasts should be projected onto a coherent space. We ask a different question: which additional linear measurements should be forecast and included in the reconciliation system? We propose REGAIN, a reconciliation-gain framework that learns normalized auxiliary directions, forecasts the induced series with a frozen forecasting oracle, and selects directions by their target-weighted loss reduction after augmented generalized least-squares reconciliation. Unlike variance-based components or predictability-based auxiliary selection, REGAIN optimizes the downstream effect of an auxiliary measurement on the final reconciled forecasts. We provide a statistical characterization showing that useful auxiliary directions must provide complementary information about unresolved target uncertainty, rather than merely being easy to forecast. The analysis also clarifies the covariance-risk reduction mechanism, the role of bias changes in realized quadratic risk, and the stability of estimated gain signals. A stagewise learning algorithm with held-out gain screening is developed, together with an optional joint refinement step. Experiments on Beijing PM2.5 and Australian Tourism data show that gain-selected measurements can improve both ordinary multivariate and hierarchical forecasts, especially when they reveal residual uncertainty not captured by the original measurement system.
[LG-97] Nonlocal Mean Field Schrödinger Bridge with Learned Interactions
链接: https://arxiv.org/abs/2606.04265
作者: Daisuke Inoue,Mathieu Laurière,Dante Kalise
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 31 pages, 15 figures
Abstract:The Schrödinger Bridge Problem constructs a stochastic process that connects an initial distribution to a terminal distribution with minimum energy. This work considers its mean-field extension, the Mean-Field Schrödinger Bridge, for interacting particle systems. With nonlocal interactions, evaluating the resulting particle-dependent distributional terms can scale quadratically with the population size, which makes large-scale problems intractable. We address this bottleneck by approximating the nonlocal interactions with neural network surrogates. The resulting four-stage alternating algorithm reduces the per-step cost from quadratic to linear in the population size at inference. We also derive Grönwall-type stability bounds that show how surrogate errors propagate to the generated trajectories. In numerical experiments on navigation and opinion-dynamics tasks, the proposed method reproduces trajectories obtained with analytical evaluation and reduces training time.
[LG-98] Representation Matters in Randomized Smoothing for Audio Classification
链接: https://arxiv.org/abs/2606.04210
作者: Jong-Ik Park,Shreyas Chaudhari,José M. F. Moura,Carlee Joe-Wong
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level \sigma=0.0025 , the two datasets share the same median raw radius .007996 , but different waveform energies yield different SNR-equivalent scales ( 83.98 vs. 90.97 dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ( 68.42% vs. 65.53% ), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly 230 – 351\times . We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.
[LG-99] CaloTrilogy: Toward a Breakthrough in One-Step End-to-End Physics-Guided Shower Generation for Modern Calorimeters
链接: https://arxiv.org/abs/2606.04165
作者: Cheng Jiang,Sitian Qian,Kevin Pedro,Oz Amram,Huilin Qu,Maggie Voetberg
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Instrumentation and Detectors (physics.ins-det)
*备注:
Abstract:High-precision calorimeter simulation at current and future colliders imposes rapidly growing computational demands, motivating the development of machine-learning surrogates for traditional Monte Carlo tools such as Geant4. Flow matching and diffusion-based generative models have become leading approaches for high-dimensional fast simulation because of their sample quality, but typically require \cal O(100) function evaluations at inference and often rely on auxiliary networks to constrain global observables, compromising streamlined end-to-end generation. We introduce a unified framework that improves the balance between speed, shower quality, and physics fidelity. The method combines: (i) an average velocity field integrator that enables sampling in one or a few evaluations; (ii) a learned generative prior in shower space, constructed from data rather than random noise; and (iii) physics-guided loss terms that impose inductive biases on key observables during training. These elements are training time regularizers, preserving end-to-end inference with no additional cost. With only one or a few evaluation steps, the model achieves shower quality competitive with state-of-the-art flow and diffusion approaches, tested on several public high granularity calorimeter datasets. The results demonstrate inter-layer shower structure consistent with the underlying physics, providing a strong candidate for future fast simulation workflows.
[LG-100] EpiFormer: Learning Antigen-Antibody Interactions for Epitope Prediction via Geometric Deep Learning
链接: https://arxiv.org/abs/2606.04154
作者: Mansoor Ahmed,Huirong Chai,Haoxin Wang,Hemanth Venkateswara,Murray Patterson
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Antibodies neutralize foreign antigens by binding to specific surface regions called epitopes. Computational epitope prediction is critical for understanding immune recognition and guiding antibody engineering. However, existing methods face three fundamental challenges: antibody-aware models encode each chain independently and combine them only at a late stage, failing to capture co-dependent structural features that define binding interfaces, whereas severe class imbalance and scarcity of known antibody-antigen complexes render standard training objectives ineffective. We propose EpiFormer, a general encoder-decoder framework that addresses these challenges jointly. Our key design principle is interleaved cross-attention within GNN encoding layers, enabling bidirectional antigen-antibody information flow throughout representation learning rather than only at the output. This early-fusion principle is backbone-agnostic, providing consistent gains across GNN architectures from simple GCNs to equivariant models. We further show that sparsity-aware objectives are effective when paired with early-fusion architectures for the epitope prediction task. EpiFormer improves over the previous best method by over 40% in F1 score on standard benchmarks, demonstrating generalizability and cross-dataset transferability. Notably, EpiFormer discovers known biological principles as emergent behaviors of end-to-end training, where the learned cross-attention gates favor antigen-to-antibody information flow, consistent with the asymmetric roles of the two chains at the binding interface, and the model’s preference for geometric over evolutionary features aligns with the established finding that epitope residues are not evolutionarily conserved. The source code is available at: this https URL
[LG-101] SC-TauPath: A Structural Connectivity Attribution Framework for Mapping Tau Propagation Pathways in Alzheimers Disease
链接: https://arxiv.org/abs/2606.04066
作者: Jing Zhang,Norman Scheel,Minheng Chen,Tong Chen,Yanjun Lyu,David C. Zhu,Rong Zhang,Dajiang Zhu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Understanding how structural connections are associated with tau propagation in Alzheimer’s disease (AD) remains a central open question, yet existing computational models either rely heavily on biophysical assumptions or lack neurobiologically interpretable pathway maps. We present SC-TauPath, a structural connectivity (SC) attribution framework that maps tau propagation pathways from in vivo neuroimaging data. SC-TauPath combines a Network Diffusion Model (NDM)-augmented multilayer perceptron with gradient \times input attribution to score each SC edge’s contribution to tau prediction, then translates these attribution scores into multi-scale pathway maps (backbone edges, high-traffic routes, and hub ROIs), which validates established Braak staging anatomy. Applied to 234 ADNI participants with paired DTI SC and 18F-Flortaucipir PET, SC-TauPath achieves strong cross-validated tau prediction and yields attribution-based pathway maps consistent with established Braak staging anatomy, demonstrating that SC encode spatially specific information about regional tau distribution in AD.
[LG-102] Finite-Iteration Local Dynamics and Warm Starts for Alternating Power Iteration in Spiked Tensor PCA ALT
链接: https://arxiv.org/abs/2606.04065
作者: Yanjin Xiang,Zhihua Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 67 pages, 0 figures. The paper studies local dynamics and warm-start analysis for alternating power iteration in spiked tensor PCA
Abstract:We study simultaneous alternating power iteration for fixed-order asymmetric rank-one spiked tensor models. Our main contribution is a finite-iteration local theory that is independent of any particular initialization. Once the iterates enter a sufficiently small neighborhood of the planted rank-one direction, their error decomposes into a geometrically decaying transient and an intrinsic noise floor caused by fixed orthogonal noise contractions at the planted point. The deterministic finite-sample conditions are stated explicitly, but under a coarse fixed-order multilinear noise event they reduce to a conservative high-signal regime for fixed or slowly expanding local radii. We then separate the warm-start mechanism from any specific spectral construction. A generic one-sweep principle shows that, if a sign-compatible initializer has correlation (\gamma_N), first-sweep noise level (a_N), and (a_N/(\gamma_N^d-1\omega_N,d)\to0), then one can choose an expanding radius (r_N=o(\omega_N,d)) for which the first sweep enters the local basin. After entry, the local affine contraction yields convergence to the unique informative local fixed point in that basin. For centered-Gram initialization, we verify the required correlation and same-sample first-sweep noise bound under i.i.d. finite-fourth-moment noise by a signal-preserving noise-only leave-one comparison and an averaged leave-one slice-contraction estimate, which we call a pressed-back estimate. The leave-one comparison keeps the spike fixed and averages over the deleted coordinate, so planted coordinates enter through (\ell_2)-weighted sums rather than worst-case incoherence bounds. Comments: 67 pages, 0 figures. The paper studies local dynamics and warm-start analysis for alternating power iteration in spiked tensor PCA Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: 62H12, 62H25, 15A69 Cite as: arXiv:2606.04065 [stat.ML] (or arXiv:2606.04065v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.04065 Focus to learn more arXiv-issued DOI via DataCite
[LG-103] Structure-Aware Prediction of PROTAC-Mediated Protein Degradability via Graph Neural Networks
链接: https://arxiv.org/abs/2606.04021
作者: Bryan Cheng,Austin Jin
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, ACM-BCB 2026 Main Conference Full Paper
Abstract:Proteolysis-targeting chimeras (PROTACs) can selectively degrade disease-causing proteins, yet predicting which targets are amenable to degradation remains a critical bottleneck: existing computational methods require the complete PROTAC molecular structure, information unavailable before synthesis. We present DegradoMap, a graph neural network that predicts PROTAC-mediated degradability from protein structure and E3 ligase identity alone – the minimal information available at the target selection stage. The model encodes biophysical priors through lysine-weighted graph pooling with per-protein normalization, models protein-E3 compatibility via cross-attention, and integrates cellular context from the Cancer Dependency Map. On the PROTAC-8K benchmark (3,101 samples, 155 targets, 10 E3 ligases), DegradoMap achieves 0.646±0.124 AUROC on target-unseen evaluation (best seed: 0.7449) and 0.811 AUROC on CRBN-VHL E3-unseen transfer, outperforming GNN and machine learning baselines. The model additionally recommends optimal E3 ligases with 74% Hit@3 accuracy. Two findings carry broader implications: E(3)-equivariant architectures underperform the simpler invariant design for this scalar prediction task, and ESM-2 embeddings improve peak performance only with careful regularization – naive integration fails. DegradoMap provides pre-synthesis computational guidance for degradability assessment; its well-calibrated confidence scores (ECE = 0.029, target-unseen) enable practitioners to prioritize high-confidence predictions for experimental follow-up. However, the high seed variance (std = 0.124) and limited E3 coverage require ensembling for reliable deployment.
[LG-104] SpliceBind: Isoform-Aware Prediction of Binding Pocket Druggability
链接: https://arxiv.org/abs/2606.04020
作者: Bryan Cheng,Austin Jin,Joshua Chang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, ACM-BCB 2026 Main Conference Short Paper
Abstract:Splice-mediated drug resistance occurs in up to 40% of patients on targeted kinase inhibitors, yet state-of-the-art druggability tools operate on single structures and cannot compare across isoforms. We introduce SpliceBind, a graph neural network framework for isoform-aware druggability prediction. Beyond improving prediction accuracy (AUROC 0.703 vs. P2Rank 0.634, p = 0.026), we address a more fundamental question: when do structural methods succeed, and when must they fail? Systematic analysis of six clinically validated variants spanning five mechanism classes reveals a two-tier resistance taxonomy. Domain deletions (AR-V7, Delta = -18.39) and pocket disruptions produce structurally detectable changes, while allosteric mechanisms (BRAF-p61) remain fundamentally invisible to any pocket-centric approach – a boundary no algorithmic improvement can cross. Notably, learned embeddings capture affinity-based resistance missed by geometry alone (ALK-L1196M: Delta_SB = -0.228 vs. Delta_P2Rank = -0.95), partially bridging the structural-biochemical gap. On 229 kinase pockets spanning 25 families, SpliceBind achieves AUROC 0.703 (p = 0.026 vs. P2Rank) with robust generalization to held-out families (AUROC 0.761). This taxonomy transforms clinical workflows: upon discovering a splice variant, clinicians can immediately determine whether computational triage suffices or biochemical validation is required – reducing time from variant discovery to therapeutic decision.
[LG-105] SPLIT-PINN: Separable Probability Learning Technique via Physics-Informed Neural Networks for High-Dimensional Probabilistic Modeling
链接: https://arxiv.org/abs/2606.04000
作者: Pouria Behnoudfar,Deekshith Naidu Ponnana,Noah J. Schmelzer,Janith Wanni,George T. Gray III,Dan J. Thoma,Curt A. Bronkhorst,Nan Chen,Wenxiao Pan
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:We present a probabilistic modeling framework for incorporating small-scale spatial heterogeneity into macroscopic descriptions of material behavior for polycrystalline metallic materials. Spatially heterogeneous material state fields are represented using probability density functions (PDFs), providing a principled statistical description of microstructural variability and state evolution across different computational polycrystalline realizations. The framework is built on the inverse identification of a probabilistic transport model, formulated as a Liouville equation with an unknown drift term. To enable accurate, stable, and interpretable inference of this drift field in high-dimensional, transport-dominated settings, we develop a Separable Probability Learning Technique via Physics-Informed Neural Networks (SPLIT-PINN). This method incorporates a marginal-correction drift decomposition, orthogonality constraints, and residual-based adaptive training to enhance well-posedness, numerical stability, and physical consistency without imposing restrictive parametric assumptions. Using SPLIT-PINN, the drift field governing the temporal evolution of joint state PDFs is inferred directly from data. After benchmark validation, the framework is applied to physical computational datasets describing the evolution of polycrystalline microstructural states, including von Mises stress, dislocation density, and equivalent plastic strain rate. The learned Liouville model, trained on a single dataset, is subsequently used in forward predictions of the temporal evolution of joint and marginal PDFs for multiple unseen polycrystal realizations. Quantitative comparisons with reference PDFs demonstrate that the proposed framework yields accurate and robust probabilistic predictions and generalizes effectively across datasets.
附件下载


