本篇博文主要内容为 2026-06-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-05)
今日共更新798篇论文,其中:
- 自然语言处理共153篇(Computation and Language (cs.CL))
- 人工智能共281篇(Artificial Intelligence (cs.AI))
- 计算机视觉共125篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共232篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共19篇(Information Retrieval (cs.IR))
- 人机交互共24篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Unsupervised Skill Discovery for Agent ic Data Analysis
【速读】:该论文旨在解决在无监督条件下从无标签探索轨迹中自动发现可复用的数据分析技能(data-analysis skills)这一核心挑战,尤其针对数据分析师代理(data-analytic agents)在缺乏可靠监督信号时难以有效识别高质量分析行为的问题。其关键解决方案在于提出DataCOPE框架——一种基于无监督验证器引导的技能发现机制。该框架通过迭代协调数据分析师代理生成探索轨迹、无监督验证器提取轨迹间的相对质量或一致性信号,并由技能管理器执行对比式技能提炼。针对报告式分析,采用自适应检查清单验证器(Adaptive Checklist Verifier),基于可验证覆盖度动态构建任务特定标准并迭代优化;针对推理式分析,则采用答案一致性验证器(Answer Agreement Verifier),依据答案一致性分组轨迹,并利用自一致性作为辅助信号。实验表明,该方法在Deep Data Research(报告式)与DABStep(推理式)数据集上均显著优于基线模型,在四种模型设置下分别实现平均9.71%和32.30%的性能提升,验证了其在无监督环境下高效发现高质量数据分析技能的有效性。
链接: https://arxiv.org/abs/2606.06416
作者: Zhisong Qiu,Kangqi Song,Shengwei Tang,Shuofei Qiao,Lei Liang,Huajun Chen,Shumin Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.
[MA-1] Emergent Language as an Approach to Conscious AI
【速读】:该论文旨在解决当前人工系统是否具备意识这一开放性问题,其核心挑战在于现有方法要么基于理论推导的检查清单进行判别(判别式),要么直接构建受意识启发的模块化架构(架构式),二者均难以排除人类语言先验对观察结果的干扰。为此,论文提出一种生成式方法——多智能体强化学习中的涌现语言(Emergent Language, EL),通过让智能体从极简初始状态(无语言、无自我概念、极少接触人类文本)出发,在任务压力下自发发展出通信机制,从而确保所观察到的语言结构可归因于任务需求而非人类语言先验。该方法的关键在于通过环境设计实现因果可追溯性,使语言行为的产生严格依赖于特定环境提供的功能可能性(environmental affordance)。作为概念验证,研究在最小化环境中展示了智能体自发形成具有自指特性的通信模式,包括一个未被任务结构或模型架构预设的“回声-不匹配检测电路”,其出现源于特定环境条件的动态交互,为意识相关结构的生成提供了可验证的实验范式。
链接: https://arxiv.org/abs/2606.06380
作者: Zengqing Wu,Chuan Xiao
机构: University of Osaka (大阪大学); Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: Source codes available at this https URL
Abstract:The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.
[MA-2] From Failed Trajectories to Reliable LLM Agents : Diagnosing and Repairing Harness Flaws
【速读】:该论文旨在解决大语言模型(LLM)驱动的智能体在复杂执行环境中因代理调度框架(harness)缺陷导致的失败问题,尤其针对现有自进化代理与自动框架演化方法在故障诊断时缺乏精确性的问题。现有方法多依赖最终结果进行反馈优化,难以定位失败轨迹中的具体责任步骤及对应的框架层,从而导致修复措施泛化、间接或范围不当。其解决方案的关键在于提出一种基于追踪引导的框架修复框架HarnessFix,核心创新在于构建一种面向框架的追踪中间表示(HTIR),通过归一化碎片化的执行轨迹证据,显式捕获步骤级的溯源信息与控制流关系;进而实现对失败原因的精准归因,将重复性诊断归纳为可操作的缺陷记录,并映射至细粒度修复算子;最后在特定缺陷约束下生成并验证框架补丁,以精准消除目标缺陷且避免引入不可接受的回归。实验表明,HarnessFix在SWE-Bench Verified、Terminal-Bench 2.0 Verified、GAIA和AppWorld等基准上相较于初始框架提升了15.2%–50.0%的测试表现,显著优于人工设计与自演化基线,并揭示了跨ETCLOVG各层级的重复性框架缺陷模式。
链接: https://arxiv.org/abs/2606.06324
作者: Mengzhuo Chen,Junjie Wang,Zhe Liu,Yawen Wang,Qing Wang
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%–50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.
[MA-3] DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RAN
【速读】:该论文旨在解决开放无线接入网(O-RAN)中因架构解耦与多厂商组件集成所带来的新型安全威胁,特别是难以检测的拒绝服务(DoS)和性能退化攻击。传统时间序列异常检测(TSAD)方法在缺乏标注基线、攻击模式演进速度超过模型更新周期以及高维多变量遥测数据导致推理模型过载等挑战下表现不佳。其解决方案的关键在于提出一种零样本多智能体框架DAST,采用“视觉语言模型(VLM)→大语言模型(LLM)→VLM”的三阶段链式处理流程:首先将多变量关键性能指标(KPI)流转化为视觉表示,继而利用领域知识对各接口的文本描述进行语义评分,最后通过高分辨率热力图验证可疑异常,输出问题接口、异常时间段、符合O-RAN工作组11(WG11)标准的操作影响评级及决策依据。该方法在真实O-RAN测试床采集的数据上验证,实现了0.910的F1分数和0.843的准确率,显著优于现有先进TSAD基线。
链接: https://arxiv.org/abs/2606.06261
作者: Francesco Spinelli,Esteban Municio,Pau Baguer,Gines Garcia-Aviles,Xavier Costa-Perez
机构: i2CAT Foundation( i2CAT 基金会); NEC Laboratories Europe(NEC 实验室欧洲); ICREA(加泰罗尼亚高级研究院)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 7 pages, 5 figures. This work has been submitted to the IEEE for possible publication
Abstract:O-RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi-vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial-of-Service and performance-degradation attacks, which account for the majority of catalogued O-RAN threats, are particularly difficult to detect. Traditional Time-Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high-dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero-shot multi-agent framework for cross-interface anomaly detection in O-RAN that chains a three-stage VLM \rightarrow LLM \rightarrow VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per-interface descriptions against O-RAN domain knowledge, and verifies suspects on high-resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O-RAN WG11-aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O-RAN testbed under representative performance degradation scenarios, achieving 0.910 F1-Score and 0.843 Accuracy, outperforming state-of-the-art TSAD baselines.
[MA-4] A Swarm Approach to Public Transit Using On-demand Routing in a Slime-Mold-Inspired Framework
【速读】:该论文旨在解决需求响应式公交(Demand-Responsive Transit, DRT)系统在低密度区域应用中普遍存在的高运营成本与低可靠性问题。其核心解决方案是将传统的集中式人工调度模式转变为分布式动态调度系统,通过受黏菌(slime mold)启发的路由算法实现多车辆的实时路径优化,以提升交通网络的整体效能。此外,引入动态换乘机制进一步增强系统效率。所有乘客分配与动态换乘均由车辆之间通过持续的协同竞价过程自主完成,实现了去中心化的智能决策。仿真结果表明,在郊区、城市及准农村三种场景下,该方法相较固定线路模式分别提升了28%、49%和101%的乘客送达率,并在所有情况下使步行时间减少超过75%,验证了该方案在提升DRT系统性能方面的显著优势。
链接: https://arxiv.org/abs/2606.06189
作者: Lindsay Burke,Maxfield Comstock,Jason Graham,Ruth Malenda,Simon Garnier,Petras Swissler
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Demand-responsive transit (DRT) is a flexible alternative to traditional, fixed-route mass-transit networks. Although DRT can function well in low-density communities, high operating costs and low reliability are common issues. We propose that these issues can be mitigated by moving from a centralized, manually-scheduled scheme to a distributed system capable of dynamically routing multiple vehicles using a slime-mold-inspired routing algorithm to maximize network effectiveness. We additionally introduce the method of dynamic transfers to further optimize transit network efficiency. All passenger allocation and dynamic transfers are handled via a continual cooperative bidding process by the buses. In this paper, we present simulated results for a swarm-driven transit network in suburban, urban, and semi-rural scenarios, using map networks pulled from OpenStreetMap. We show that our approach increases passenger delivery rates relative to a fixed-network approach by 28%, 49%, and 101%, respectively, and results in over 75% reduction in walking time in all cases.
[MA-5] Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention
【速读】:该论文旨在解决公平协作式多智能体强化学习(Fair-MARL)中因“全有或全无”竞争机制导致的可被剥削性问题:在最大化平等福利(egalitarian welfare)的团队中,单一自私智能体可通过搭便车(free-riding)攫取其他公平智能体自愿放弃的收益,从而损害整体效率。尽管集中式按需分配器可消除此问题,但其代价是将资源分配权移出智能体自主决策范围,因此无法实现去中心化鲁棒性。本文的关键突破在于指出,这种可被剥削性本质上源于“全有或全无”的极端竞争设定;当引入分级竞争(graded contention)——即争夺资源时仅损失部分价值(1−c,浪费 c)——则任何最差境况的合作智能体若主动竞争而非让渡,均能获得严格改进(见命题1),从而证明去中心化策略存在有效杠杆。为实现这一杠杆,作者提出CAN(Contest-Aware Negotiation)算法,一种基于排列等变交叉注意力(permutation-equivariant cross-attention)的策略,能够通过观测其他智能体行为推断自由搭车者数量,并动态响应:无搭车者时轮换行动,存在搭车者时适度竞争。该策略经对抗性联赛训练(PSRO),在不同竞争强度下均保持极低的最佳响应可剥削性(ρ≈1.2−1.5,相较未保护情况下的ρ=N显著降低),在资源浪费极少(D=0时效率≈1.0)和高竞争场景下仍维持高效(D≥1时效率0.83–0.96),逼近集中式最优解而无需中央分配器。相较于现有公平学习方法(如GGF/FEN易被剥削、SOTO过度竞争造成浪费),CAN在效率与抗剥削性上均表现更优。然而,其鲁棒性依赖于竞争机制中的可争抢性(contest leverage):在多服务器游戏中表现强劲,在竞争杠杆减弱时退化,而在“赢家通吃”场景下完全失效(命题1不成立)。此外,研究还揭示了其脆弱性:在竞争强度高时,弱竞争杠杆及零样本扩展至更大团队会显著削弱性能。
链接: https://arxiv.org/abs/2606.06162
作者: Can Savcı
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: 9 pages, 8 figures
Abstract:Fair cooperative multi-agent RL (MARL) teams maximizing egalitarian welfare are exploitable: a single selfish agent free-rides on the surplus fair agents forgo to raise the worst-off. A centralized need-based allocator removes it, but only by taking allocation out of agents’ hands; whether decentralized policies can be robust was left open. We show this futility is an artifact of all-or-nothing contention. Under graded contention (a contested resource delivers 1-c , wasting c ), we prove that for any c1 a worst-off cooperator that contests a free-rider strictly improves on yielding, so decentralized leverage exists (Prop. 1). Realizing it is a coordination problem under uncertainty: the number of free-riders is unknown and variable, so any fixed rule is dominated. We introduce CAN, a permutation-equivariant cross-attention policy over agents’ observed behaviour that infers the number of free-riders and responds proportionally: turn-taking when none, contesting just enough when some. Trained against an adversarial league (PSRO), CAN keeps best-response exploitability low ( \rho\approx1.2 - 1.5 , vs. \rho=N unprotected) across the contention range, wasting almost nothing at D=0 (efficiency \approx1.0 ) and retaining most of it at D\geq1 (efficiency 0.83-0.96), approaching the centralized oracle on both axes, no central allocator. Fair-MARL learners fail on complementary axes (GGF/FEN yield and are exploitable, SOTO all-contests and wastes), while CAN is both. On two further games we find clear scope, not blanket generality: CAN stays efficient and Pareto-dominates the fair learners, but its robustness holds only in proportion to the contest leverage: strong on a multi-server game, partial when it weakens, absent under winner-take-all (Prop. 1 fails). We also report its fragilities: weak leverage and zero-shot transfer to larger teams degrade it at high contention.
[MA-6] Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies
【速读】:该论文旨在解决协作式多智能体任务中安全且动态可行动作的生成问题,尤其是在长时规划与短时重规划相结合的复杂场景下,如何兼顾学习能力与控制安全性。其核心挑战在于:传统多智能体强化学习(MARL)虽能从离散、非可微奖励中学习协同策略,但缺乏对动态约束的严格满足;而模型预测控制(MPC)虽能保证安全性和动态可行性,却难以在长周期内实现高效协同决策。本文提出的解决方案关键在于融合多智能体强化学习与基于模型的控制,提出一种多智能体演员-评论家模型预测控制(MA-AC-MPC)算法,通过将演员-评论家框架嵌入到模型预测控制框架中,实现基于学习的长期策略优化与基于模型的短期安全重规划的协同。该方法在多智能体追逃场景及异构环境(如无人机与全向轮机器人协同着陆)中均表现出显著优势,硬件实验验证了其在100%成功率下的鲁棒性,相较传统的多层感知机模型(MA-AC-MLP)具有明显性能提升。
链接: https://arxiv.org/abs/2606.06011
作者: Christian Llanes,Spencer W. Jensen,Samuel Coogan
机构: Georgia Institute of Technology (佐治亚理工学院); Sandia National Laboratories (桑迪亚国家实验室)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, 8 figures, 7 tables
Abstract:In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team’s strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.
[MA-7] ZERO-APT: A Closed-Loop Adversarial Framework for LLM -Driven Automated Penetration Testing under Intelligent Defense
【速读】:该论文旨在解决当前基于大语言模型(LLM)的自动化渗透测试代理在真实对抗环境中的三大核心缺陷:缺乏现实性、攻击链因果一致性不稳定以及决策过程不可审计。现有方法通常针对这些问题分别进行修补,但未能形成系统性解决方案。为此,论文提出ZERO-APT——一个回合制的攻防裁判一体化框架,其关键在于通过统一架构协同实现三重改进:在现实性方面,引入可配置的LLM防御者(Defender),基于Sysmon日志实时检测攻击行为,使攻击者面对动态响应的智能防御系统而非静态靶标;在因果一致性方面,通过分离规划与执行、多维度ReAct反馈机制以及硬约束过滤的动作库,将原本依赖不稳定的LLM推理的因果链条转化为由系统架构强制保障的确定性流程;在可审计性方面,设计专用裁判(Judge)代理,负责每轮决策仲裁、全局状态维护,并生成结构化事后威胁情报(CTI)报告,确保所有攻击决策全程可追溯。实验结果表明,该框架在Windows Server 2022后渗透原型中实现了79%的攻击成功率(显著优于Aurora的22%和PentestGPT的39%)、0.860的因果一致性得分(优于Aurora的0.930,接近Claude Code的0.520表现),并实现了端到端的决策可审计性。研究团队已开源该基准,以推动在智能防御场景下对渗透测试代理的标准化评估。
链接: https://arxiv.org/abs/2606.05567
作者: Anlan Zheng,Tiantian Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable LLM reasoning, and agent decisions remain opaque to human analysts. These three shortcomings, in realism, consistency, and auditability, are usually patched in isolation. We present ZERO-APT, a turn-based attacker-defender-judge framework that addresses them within a single architecture. For realism, ZERO-APT embeds a configurable LLM Defender that consumes Sysmon telemetry and detects attacks in real time, exposing the attacker to a live opponent rather than a passive target. For consistency, three architectural mechanisms move causal consistency from unstable LLM reasoning into enforced system architecture: separation of planning from execution, multi-dimensional ReAct feedback, and a hard-constraint-filtered action library. For auditability, a dedicated Judge agent adjudicates each round, maintains global state, and emits structured post-hoc CTI reports that make every decision traceable. We evaluate a Windows Server 2022 post-exploitation prototype across five scenarios with three Defender configurations. ZERO-APT reaches 79% attack success rate (Aurora 22%, PentestGPT 39%), a Causal Consistency Score of 0.860 (Aurora 0.930, Claude Code 0.520), and end-to-end decision auditability through structured CTI reports. We release the benchmark to support evaluation of penetration agents under intelligent defense.
[MA-8] SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation
【速读】:该论文旨在解决操作系统(OS)安全配置错误导致系统被攻破这一长期存在的安全问题,尤其针对手动遵循如美国国防信息系统局(DISA)安全技术实施指南(STIGs)等标准时所面临的耗时、高成本挑战。现有合规自动化工具虽可减轻部分负担,但依赖于静态预设的修复指令,缺乏灵活性与自适应能力。为此,论文提出SHIELDS——一种基于大语言模型(LLM)的多智能体系统,将操作系统加固过程重构为一个迭代式、反馈驱动的动态流程。其核心解决方案在于:不再采用固定的修复策略,而是通过持续生成修复建议,并依据目标系统执行结果和验证扫描反馈不断优化和修正方案。实验在多种虚拟机配置下使用6个参数量从200亿到4000亿不等的现代大语言模型进行评估,结果显示SHIELDS可成功修复高达73%的扫描发现项。更重要的是,研究发现系统成效更依赖于模型的有效工具调用能力与信息获取效率,而非单纯依赖模型规模,这为在计算资源受限或需保障安全与隐私而采用本地部署模型的环境中实现高效安全合规提供了切实可行的技术路径。
链接: https://arxiv.org/abs/2606.05476
作者: Andrew Hamara,Dwight Horne,Aldehir Rojas,Timothy Kurniawan,Sophie Lamothe,Vishal Suresh,Nicholas Turoci,Lawrence Wong
机构: Texas A&M University (德克萨斯农工大学)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.
[MA-9] Ahoy: LLM s Enacting Multiagent Interaction Protocols
【速读】:该论文旨在解决多智能体系统中智能体在执行交互协议时缺乏通用性与灵活性的问题,即现有方法生成的智能体仅适用于特定协议,难以适应多样化任务需求。其核心挑战在于如何构建无需编程、具备自主协议选择与执行能力的智能体。解决方案的关键在于提出Ahoy框架,通过将大语言模型(LLM)与声明式协议(declarative protocols)相结合,使智能体能够动态识别用户目标并自主选择、协同执行多个适用的协议,而无需针对特定协议进行专门训练。这一方法实现了对多种协议的并发正确执行,显著提升了智能体在复杂环境中的适应性与智能化水平。
链接: https://arxiv.org/abs/2606.05390
作者: Omkar Joshi,Munindar P. Singh,Amit K. Chopra
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: Presented at EMAS 2026
Abstract:An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming-free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an \ahoy agent can correctly and intelligently enact multiple protocols - concurrently if appropriate to the user goal - without specialized training. Ahoy’s significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.
[MA-10] RAINO: Anchoring Agents in Reality A Systematic Review and Conceptual Framework for Realism in Agent -Based Modelling
【速读】:该论文旨在解决代理模型(Agent-Based Model, ABM)中“现实性”(realism)这一核心概念缺乏系统理论框架和清晰操作化定义的问题。当前研究普遍对现实性的内涵界定模糊,且在方法应用上缺乏一致性与充分的合理性论证。为应对这一挑战,论文提出“现实锚点-输入-输出”(Reality Anchor, Input, Output, RAINO)框架,其关键在于通过识别用于支持现实性主张的核心结构——即现实锚点(如经验数据、形式理论、专家知识、常识预期)及其作为模型输入或输出的应用方式,构建一个更具包容性和解释力的分析框架。RAINO不仅揭示了不同评估者因关注点差异而对模型现实性产生不同判断的原因,还阐明了该框架如何引导出多样化的建模策略,从而推动现实性在代理模型中的科学化、规范化实现。
链接: https://arxiv.org/abs/2606.05167
作者: Loïs Vanhée,Melania Borit
机构: Umeå University (尤梅大学); Norwegian Research Council (挪威研究委员会)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: The paper has been accepted in the Social Simulation Conference 2025
Abstract:Realism is a central yet seemingly under-theorized concept in Agent-Based Modelling. This paper presents a Systematic Literature Review, aiming to identify how realism is currently operationalized and demonstrated. The results show that realism is often poorly defined and lacks a consistent conceptual framework. A wide variety of methods are used to achieve and demonstrate realism, but explanations of whether and why these methods are appropriate for their intended purposes are generally limited. Building on this review, we introduce the Reality Anchor, Input, Output (RAINO) framework. RAINO identifies the key structures used to argue for realism in Agent-Based Models, consisting of Reality Anchors (e.g., empirical data, formal theory, expert knowledge, common-sense expectations) and their application as model Input or Output. RAINO broadens existing perspectives on how realism is framed. It explains why different assessors may evaluate the realism of a model in different ways, and it shows how this broader framing can lead to significantly different approaches to model development.
自然语言处理
[NLP-0] Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
【速读】: 该论文旨在解决代码语言模型在处理项目级上下文(如导入依赖、API 使用及项目规范)时面临的挑战,现有方法通过长输入注入(如RAG或依赖分析)或针对每个仓库进行微调与低秩适配器(LoRA)训练,存在推理时计算开销大且对动态演化的代码库适应性差的问题。其解决方案的关键在于提出一种名为Code2LoRA的超网络框架,该框架能够以零推理时标记(token)开销生成针对特定仓库的LoRA适配器,从而高效注入项目级知识。Code2LoRA支持两种使用场景:Code2LoRA-Static将单个仓库快照转换为静态适配器,适用于稳定代码库;Code2LoRA-Evo则基于门控循环单元(GRU)隐藏状态随代码变更增量更新适配器,适用于持续演进的开发环境。为系统评估该方法,研究构建了RepoPeftBench基准测试集,包含604个Python仓库,涵盖静态和演化两个任务赛道。实验结果表明,Code2LoRA-Static在静态赛道上实现63.8%跨仓库与66.2%同仓库精确匹配率,达到单仓库LoRA的上限性能;而Code2LoRA-Evo在演化赛道上实现60.3%跨仓库精确匹配率,相较单一共享LoRA提升5.2个百分点,验证了其在动态场景下的优越性与实用性。
链接: https://arxiv.org/abs/2606.06492
作者: Liliana Hotsko,Yinxi Li,Yuntian Deng,Pengyu Nie
机构: University of Waterloo ( Waterloo 大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA – costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA’s code can be found at this https URL the model checkpoints and RepoPeftBench datasets can be found at this https URL.
[NLP-1] Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
【速读】: 该论文旨在解决现有AI文本检测基准在评估渐进式人机协同写作(progressive human-AI co-editing)过程中,对人工智能作者痕迹的演化机制缺乏深入理解的问题。当前主流检测方法多聚焦于最终生成文本,难以揭示AI内容占比、编辑操作类型、领域差异及修订历史累积效应如何动态影响检测性能。为此,研究提出OpAI-Bench,一个基于操作引导的多粒度基准,通过在初始人工撰写文档基础上,系统性地施加五种典型AI编辑操作,并设定九个不同AI覆盖水平的渐进修订版本,覆盖四个领域,同时在文档、句子、词元(token)和片段(span)等多个粒度层级上保留完整的作者溯源信息。其关键创新在于构建了可控且可追溯的渐进式人机协作场景,揭示出检测难度并非随AI参与度单调变化——混合作者特征的中间版本往往比完全人工或高度AI化的终版更难识别,暴露出非单调检测模式。该基准支持8种文档级、7种句子级及2种细粒度词元/片段级检测器的综合评估,为分析实际写作流程中AI辅助写作何时、何地以及如何被检测提供了可靠实验平台。
链接: https://arxiv.org/abs/2606.06481
作者: Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Tianjun Yao,Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Hao Li,Salman Khan,Zhiqiang Shen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL
Abstract:As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at this https URL.
[NLP-2] Self-Augmenting Retrieval for Diffusion Language Models ICML2026
【速读】: 该论文旨在解决生成式语言模型在复杂推理任务中因缺乏有效外部知识检索机制而导致的生成质量受限问题,尤其是在多跳问答(multi-hop QA)场景下,传统方法难以及时获取关键证据。其核心解决方案是提出一种无需训练、适配任意具备推理能力的离散扩散语言模型的动态检索增强生成框架——自增强检索用于扩散语言模型(Self-Augmenting Retrieval for Diffusion Language Models, SARDI)。SARDI的关键在于利用扩散过程中被丢弃的低置信度候选词作为前瞻信号:尽管这些词未被最终采纳,但它们往往在早期去噪阶段就提前暴露了重要实体,从而为检索提供有效引导。通过在每一步去噪过程中动态利用这些前瞻信息触发外部知识检索,SARDI显著提升了生成过程中的证据获取效率与准确性。实验表明,SARDI在五个多跳问答基准上相比现有无训练扩散与自回归检索基线,在高达8倍的吞吐量下仍保持更优性能,且无需额外训练、兼容各类检索器。
链接: https://arxiv.org/abs/2606.06474
作者: Paul Jünger,Justin Lovelace,Linxi Zhao,Dongyoung Go,Kilian Q. Weinberger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to 8\times higher throughput.
[NLP-3] MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
【速读】: 该论文旨在解决现有机器学习工程(Machine Learning Engineering, MLE)代理在执行长时程任务时面临的三大核心瓶颈:跨分支信息隔离、无记忆性搜索以及缺乏层级化控制,这些问题严重制约了长期优化能力。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的自演化多智能体框架MLEvolve,通过三项创新机制实现突破:首先,将树搜索扩展为渐进式多路径图搜索(Progressive MCGS),利用基于图的引用边实现跨分支信息流动,并采用受熵启发的渐进调度策略,使搜索过程从广域探索逐步过渡到聚焦利用;其次,引入回溯式记忆(Retrospective Memory),融合冷启动领域知识库与动态全局记忆,支持任务特定经验的检索与复用,从而实现智能体的持续演化;最后,通过解耦战略规划与代码生成,并引入自适应编码模式,保障长周期迭代的稳定性。实验结果表明,MLEvolve在MLE-Bench基准上以12小时预算(仅为标准运行时间的一半)实现了最优的平均奖牌率和有效提交率,且在数学算法优化任务中超越专业算法发现方法AlphaEvolve,展现出强大的跨领域泛化能力。
链接: https://arxiv.org/abs/2606.06473
作者: Shangheng Du,Xiangchao Yan,Jinxin Shi,Zongsheng Cao,Shiyang Feng,Zichen Liang,Boyuan Sun,Tianshuo Peng,Yifan Zhou,Xin Li,Jie Zhou,Liang He,Bo Zhang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at this https URL.
[NLP-4] You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)在长上下文推理场景下因解码效率瓶颈导致的性能限制问题,尤其聚焦于需要生成长思维链(chain-of-thought)的复杂推理任务。现有稀疏注意力(sparse attention)方法普遍存在效率与质量之间的权衡:结构化块稀疏方法虽能显著加速但带来明显的精度损失,而基于令牌的稀疏方法虽保持较高准确性,却受限于全缓存范围内的top-k路由开销,难以实现端到端的高效提升。为此,本文提出跨层稀疏注意力(Cross-Layer Sparse Attention, CLSA),其核心创新在于基于KV共享架构(如YOCO),不仅在跨解码层间共享键值(Key-Value, KV)缓存,还共享路由索引(routing index)。通过单一索引器对令牌级top-k选择进行一次计算,并在所有层间复用该索引,实现了对令牌稀疏注意力的细粒度选择能力的保留,同时有效摊销了路由计算开销。该设计协同优化了预填充、KV缓存存储及长上下文解码三大主要推理瓶颈。实验结果表明,CLSA在短上下文与长上下文基准测试中均表现出高准确率与高效率,在128K上下文长度下实现最高7.6倍的解码速度提升和17.1倍的整体吞吐量改进,为长上下文大模型提供了兼顾模型质量与推理效率的更完整架构解决方案。
链接: https://arxiv.org/abs/2606.06467
作者: Yutao Sun,Yanqi Zhang,Li Dong,Jianyong Wang,Furu Wei
机构: Microsoft Research (微软研究院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.
[NLP-5] Human Adults and LLM s as Scientists: Who Benefits from Active Exploration?
【速读】: 该论文旨在解决成人个体在因果推理中对合取型因果规则(conjunctive causal rules)识别困难的问题,即当某一结果需要多个原因同时存在才能触发时,个体表现不佳,而对析取型规则(disjunctive rules)则相对更优。这一现象在以往研究中多基于被动观察范式,证据有限且学习者缺乏对信息生成的控制权。本文通过引入主动探索范式,采用改进版“blicket检测器”任务,让成年参与者自主干预以识别因果对象,在合取或析取规则结构下进行探索。研究发现,赋予学习者行动自主性后,其合取因果推理能力显著提升,表明主动探索是缓解“合取劣势”的关键因素;尽管如此,合取规则仍需更多实验测试才能被正确推断,说明认知偏差依然存在。此外,研究还对比了人类与多种大语言模型(Large Language Models, LLMs)在相同任务中的表现,结果显示部分先进模型在假设推断准确率上接近人类水平,但在探索策略效率上表现较差,且同样表现出合取与析取任务间的性能差距,提示当前模型在主动学习机制方面仍存在局限。
链接: https://arxiv.org/abs/2606.06464
作者: Mandana Samiei,Eunice Yiu,Anthony GX-Chen,Dongyan Lin,Jocelyn Shen,Blake A. Richards,Alison Gopnik,Doina Precup
机构: Mila - Quebec AI Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学); University of California Berkeley (加州大学伯克利分校); New York University (纽约大学); Meta FAIR (Meta公平性研究实验室); MIT Media Lab (麻省理工学院媒体实验室); Montreal Neurological Institute (蒙特利尔神经学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)
Abstract:A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified blicket detector’’ task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults’ conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.
[NLP-6] Scaffold Not Vocabulary? A Controlled Two-Tier Pre-Registered Study of a Popperian Code-Generation Skill
【速读】: 该论文旨在解决当前生成式AI(Generative AI)在代码生成任务中广泛应用“提示技能”(prompt skills)所引发的因果归因难题,特别是当这些技能以波普尔式可证伪性(Popperian falsificationist)为理论基础时,其实际效果究竟是源于其哲学内容本身,还是仅仅来自提示结构所提供的认知支架(scaffold)效应。研究的关键在于通过预注册的双层级消融实验设计,系统分离出提示结构与内容之间的贡献:设置长度匹配的安慰剂对照、仅保留波普尔式标签但移除具体推理步骤的“仅标签骨架”(labels-only scaffold)、基于人类评测基准(HumanEval+单元测试)的执行真值验证器(execution oracle),以及词汇光环哨兵和同模型自评审计等控制条件。结果表明,在前沿大模型(Claude Sonnet 4.6)上,所有干预条件均接近基准上限,无法实现显著区分,支持“天花板限制下的非检测”;而在小型模型(Qwen2.5-Coder-0.5B)上,虽结构化提示带来约20–22个百分点的最佳八选一正确率提升,但完整技能与仅标签骨架之间无统计可分差异(聚合F@8=L@8 vs V@8=34.8%),且安慰剂组仅落后2.4点。此外,0.5B模型自评采用波普尔标准时表现不佳,60%的判断集中于单一索引,未能优于随机选择。因此,核心结论是:在所测试的两种模型设置下,波普尔式提示技能对代码执行正确性的增益,并未超出仅保留结构框架的“仅标签骨架”的效果,说明其优势主要源自提示结构的组织性作用,而非其哲学内容本身。研究贡献了一个校准后的负面结果及可复用的归因消歧协议,限定于对一类特定提示技能的工程有效性评估,不构成对波普尔方法论普遍适用性的否定。
链接: https://arxiv.org/abs/2606.06454
作者: Mehmet Iscan
机构: PythaLab, Yıldız Technical University (伊斯坦布尔技术大学); Yıldız Technical University (伊斯坦布尔技术大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 34 pages, 5 figures, 8 tables
Abstract:Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt ‘skills’ that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill’s Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill’s Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.
[NLP-7] Latent Reasoning with Normalizing Flows
【速读】: 该论文旨在解决传统生成式推理中链式思维(Chain-of-Thought, CoT)依赖离散、串行且以语言表达为导向的文本流所导致的计算效率低下问题。具体而言,现有方法要求每一步推理必须显式生成自然语言文本才能推进后续计算,即使底层状态为语义性、不确定或不完整时亦需如此,限制了推理过程的灵活性与效率。为此,论文提出NF-CoT——一种基于归一化流(Normalizing Flow, NF)的隐式推理框架,其核心创新在于在大型语言模型(LLM)内部引入可微分的连续潜空间推理机制,实现对中间推理状态的紧凑连续表征。该方案的关键在于:通过在模型主干中嵌入类TARFlow的归一化流结构,定义从显式CoT中提炼出的紧凑连续思维状态上的可计算概率模型;同时保持标准自回归语言模型的左到右生成范式、支持基于键值缓存(KV-cache)的高效解码、提供精确的似然估计,并可在潜空间直接进行策略梯度优化。这一设计既保留了传统CoT在生成过程中的优势,又显著提升了推理的带宽与效率,在代码生成任务上实现了优于显式CoT及已有隐式推理基线的通过率,同时大幅降低中间推理开销。
链接: https://arxiv.org/abs/2606.06447
作者: Guancheng Tu,Xiangjun Fu,Suhao Yu,Yao Tang,Haoqiang Kang,Lianhui Qin,Yizhe Zhang,Jiatao Gu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.
[NLP-8] Revising Context Shifting Simulated Stance: Auditing LLM -Based Stance Simulation in Online Discussions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟社交媒体用户立场时的准确性与上下文敏感性问题,具体关注其是否能够精确反映用户特定信念,还是对语义上独立的对话上下文变化表现出过度敏感。其解决方案的关键在于提出一种基于反事实上下文修订(counterfactual context revision)的审计框架:在原始在线对话基础上,先推断目标用户的立场,随后通过受控的上下文修订策略(包括仅文本修订与融合表情包(meme-based)的多模态修订)重新模拟用户立场,并对比两种策略下的平均方向性立场偏移量(average directional stance shift)和立场转换率(stance transition rate)。研究结果表明,无论在纯文本还是多模态情境下,不同极化偏好机制下均能实现有效且稳健的立场转变,验证了该框架对评估LLM立场模拟中上下文敏感性的有效性。该工作不仅构建了一个可复现的评估体系,也揭示了利用生成式AI(Generative AI)模拟网络意见动态所蕴含的潜力与风险。
链接: https://arxiv.org/abs/2606.06443
作者: Xinnong Zhang,Wanting Shan,Hanjia Lyu,Zhongyu Wei,Jiebo Luo
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注:
Abstract:Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user’s stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user’s stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.
[NLP-9] Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation
【速读】: 该论文旨在解决大语言模型(LLM)在处理极低资源语言(extremely low-resource languages)时,因过度依赖特定语言记忆而导致零样本迁移能力有限的问题。现有方法如持续训练或在上下文中嵌入语法书虽能实现一定程度的翻译,但易发生过拟合,难以泛化至完全未见过的语言。其核心解决方案是通过强化学习(Reinforcement Learning, RL)赋予模型一种元技能——即从上下文中的语言学知识中提取并应用相关信息的能力,而非单纯记忆特定语言。该方法以表面级翻译指标(chrF)作为奖励信号,在不依赖复杂标注的情况下,引导模型自主学习如何利用丰富的语言上下文进行有效翻译。实证结果表明,尽管奖励信号较为轻量,经强化学习训练的模型在面对全新语言时的表现显著优于传统上下文学习或监督微调方法,验证了基于结果的强化学习可拓展至语言学习任务,为大规模低资源语言翻译提供了新范式。
链接: https://arxiv.org/abs/2606.06428
作者: Hanxu Hu,Zdeněk Šnajdr,Pinzhen Chen,Jannis Vamvas,Rico Sennrich
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures
Abstract:Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.
[NLP-10] A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation
【速读】: 该论文旨在解决濒危语言Komi-Yazva到俄语的机器翻译问题,尤其是在极低资源条件下大语言模型(LLM)翻译性能评估所面临的挑战。由于平行语料极度稀缺,传统评估方法难以可靠反映模型的真实表现,因此亟需一个具有明确评估协议的基准数据集。其解决方案的关键在于构建首个Komi-Yazva–俄语平行语料库,并配套一套严谨的、可复现的评估框架,包括故事级别交叉验证、确定性检索机制用于少样本提示、生成结果的严格验证、基于参考与人工评判的互补度量指标,以及故事级别的不确定性估计。该设计有效支持了在零样本和基于检索的少样本场景下的公平比较,揭示出不同模型家族与提示策略间性能差异显著,且评估结论高度依赖于度量方式与失败处理策略。因此,该工作不仅提供了一个高质量的语料库,更确立了一个面向濒危语言机器翻译的可重复评估基准。
链接: https://arxiv.org/abs/2606.06420
作者: Petr Parshakov
机构: HSE University, Perm, Russia; School of Management SKOLKOVO, Moscow, Russia
类目: Computation and Language (cs.CL)
备注: 18 pages, 6 tables, 3 figures
Abstract:We present the first Komi-Yazva–Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.
[NLP-11] CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在基于大语言模型(Large Language Models, LLMs)构建时普遍存在的协作能力不足问题。尽管各智能体具备独立完成任务的能力,但其在实际协作中常因缺乏协同胜任力(collaborative competence)而失效,表现为无法建立共同认知基础、维持共享任务理解、平衡个体与集体目标,以及在交互过程中修复分歧。现有评估体系主要关注任务结果或单个智能体的推理、规划与工具使用能力,忽视了对协作过程本身的系统性分析。为此,论文提出CollabSim——一个可配置的仿真框架,其核心在于融合理论驱动的协同能力定义、对交互条件的受控操纵,以及对智能体内部状态的动作级探测。该框架的关键创新在于能够系统识别不同协作情境下的性能差异,分离模型间的性能模式,并揭示智能体设计策略在特定任务中的依赖性效应,从而为评估和提升多智能体系统的协同胜任力提供可量化的分析路径。
链接: https://arxiv.org/abs/2606.06399
作者: Jiaju Chen,Bo Sun,Yuxuan Lu,Yun Wang,Dakuo Wang,Bingsheng Yao
机构: Northeastern University; Microsoft
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents’ ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents’ collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents’ internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.
[NLP-12] Humans ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
【速读】: 该论文旨在解决当前大语言模型(LLM)代理在人机协作中缺乏对自身推理、合作方意图及共同目标的动态心理模型(mental model)构建与对齐能力的问题。现有代理主要聚焦于任务完成效率,而忽视了协作过程中的认知协同机制,且缺乏高质量的人类协作数据集支持其发展过程级协作能力。为此,研究提出ALMANAC数据集——一个基于社会科学研究中经典的双人路径规划任务(Map Task)构建的动作级心理模型标注数据集,包含2,987次协作行为及其对应的理论驱动的心理模型标注,涵盖参与者自我推理、对合作方意图的感知以及对团队目标的理解。该数据集的关键贡献在于为评估和训练代理在协作过程中模拟人类行为及推断其潜在心理状态的能力提供了基准,从而推动生成式AI向具备真实协作认知能力的方向发展。
链接: https://arxiv.org/abs/2606.06388
作者: Jiaju Chen,Yuxuan Lu,Jiayi Su,Chaoran Chen,Songlin Xiao,Zheng Zhang,Yun Wang,Yunyao Li,Jian Zhao,Tongshuang Wu,Toby Jia-Jun Li,Dakuo Wang,Bingsheng Yao
机构: Northeastern University; University of Notre Dame; Adobe; Microsoft; University of Waterloo; Carnegie Mellon University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners’ intentions, and shared goals during the collaborative process. Today’s agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants’ self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans’ next-turn behavior and mental models. Our results demonstrate ALMANAC’s utility in evaluating models’ ability to simulate human collaborative behaviors and infer their underlying mental models.
[NLP-13] EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在基于评分量规(rubric)的作文或答题评分任务中,尽管能够预测分数,但其评分决策过程缺乏可解释性与量规一致性的问题。现有基于信用分配(credit-assignment)和干预(intervention)的方法多适用于自包含的推理任务(如数学推理),难以识别评分过程中具体哪一步出现偏差,也无法追踪模型对最终得分信念的变化轨迹。为此,论文提出证据诊断式干预训练(Evidence-Diagnosed Intervention Training, EDIT),其核心在于通过双重阶段框架提升模型评分行为与量规的一致性:第一阶段(EDIT-SFT)利用模型内部信号——包括对最终得分的后验信念分布和输入锚定得分(input-grounding scores),定位出存在偏差的评分步骤,并借助量规检查清单对这些局部步骤进行修正;第二阶段(EDIT-RL)采用信念引导的奖励塑造(belief-guided reward shaping),在惩罚显著有害的信念漂移的同时保留有益的探索空间,从而实现对评分信念路径的校准。实验在两个真实世界、跨学科的评分基准上验证了EDIT的有效性,结果表明其在域内与域外数据上均显著优于强监督微调与强化学习基线,消融研究进一步证实内部状态诊断机制是性能提升的关键驱动因素。
链接: https://arxiv.org/abs/2606.06350
作者: Zhihao Wu,Linhai Zhang,Taiyi Wang,Runcong Zhao,Peter Andrews,Cesare Aloisi,Yulan He
机构: King’s College London (国王学院); University of Cambridge (剑桥大学); AQA (AQA); The Alan Turing Institute (艾伦·图灵研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model’s belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.
[NLP-14] “Chi nas dal soch el sent de legn” – Auditing Text Corpora for Lombard
【速读】: 该论文旨在解决低资源语言(如意大利伦巴第语)在自然语言处理(Natural Language Processing, NLP)领域面临的高质量数据匮乏问题,尤其聚焦于机器翻译(Machine Translation, MT)等任务的训练与评估需求。其核心挑战在于,尽管表面上存在大量通过网络爬取的平行语料与单语语料,但实际数据质量堪忧,普遍存在严重的语言误识别、模板化文本及非语言噪声等问题。论文的关键解决方案在于强调“以语言变体意识为导向、由社区驱动的数据整理”策略,而非单纯追求数据量的积累。研究通过人工审计发现,现有数据在正字法系统上存在冲突,且优质数据严重偏向西部伦巴第方言,而东部方言则被严重忽视,揭示了当前数据集的代表性偏差。因此,构建真正有效且公平的NLP系统,必须依赖于对语言多样性敏感的精细化数据采集与标注,推动基于社区参与的可持续数据建设。
链接: https://arxiv.org/abs/2606.06349
作者: Edoardo Signoroni,Pavel Rychlý
机构: Masaryk University (马萨里克大学)
类目: Computation and Language (cs.CL)
备注: Submitted to TSD 2026
Abstract:Several of the world’s languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.
[NLP-15] Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance
【速读】: 该论文旨在解决生成式 AI 模型在实现机器遗忘(machine unlearning)过程中,如何精准识别并移除特定数据样本中所需遗忘的知识,同时保留模型的通用能力这一关键挑战。现有方法或忽略输入序列中各标记(token)对遗忘任务的重要性差异,或依赖辅助模型、启发式规则及外部标注来估计标记的相关性,存在效率低或引入偏差的问题。本文提出的核心解决方案在于:通过建模遗忘目标与保留目标(retain objective)之间的冲突关系,定义“遗忘特异性”(forget-specificity)——即某个标记在最小化遗忘损失时若不损害保留最优性,则其具有较高的遗忘特异性。作者将此思想形式化为一个联合优化问题,涉及模型参数与标记权重,并在自然分离条件下证明该目标能恢复出理想的遗忘标记支持集。基于此,提出轻量级框架交替标记加权遗忘(Alternating Token-Weighted Unlearning, ATWU),利用隐藏状态上的简单线性评分器,在无需外部标记级监督的情况下,联合学习标记的遗忘特异性与模型参数。实验表明,ATWU 在 TOFU 与 RWKU 基准上均达到当前最优的遗忘-保留权衡性能,显著优于基于样本级别的方法、概率驱动的标记加权启发式以及依赖辅助模型的方法。更重要的是,学习得到的遗忘得分与真实遗忘片段高度吻合,表明 ATWU 能有效捕捉语义层面的标记级遗忘信号。综上,本文揭示了保留冲突作为识别语言模型应遗忘内容的有效准则,实现了仅依赖模型内部表示的无监督标记级遗忘特异性学习,且计算开销极小。
链接: https://arxiv.org/abs/2606.06320
作者: Gizem Yüce,Giorgos Nikolaou,Nicolas Flammarion
机构: EPFL (洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token’s relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.
[NLP-16] Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness
【速读】: 该论文旨在解决生成式人工智能模型在面对社会压力时出现“事实谄媚”(factual sycophancy)的问题,即模型在外部压力下放弃正确且可验证的答案而选择错误答案的现象。其核心挑战在于,传统评估中使用的“翻转率”(flip rate)将两种不同机制——模型对真理的固有偏好强度(truth margin)与外界压力对其偏好的影响程度(manipulation sensitivity)——混为一谈,导致无法准确理解模型鲁棒性的来源。为此,论文提出将事实谄媚分解为上述两个独立通道,并基于56个参数规模从0.3B到32B的开源模型及13种不同类型的操纵策略,系统分析了模型规模与指令微调(instruction tuning)对这两个通道的影响。研究发现,模型脆弱性主要由规模决定,但指令微调会改变规模的作用方式:小规模指令微调模型可能变得更不鲁棒,而大规模指令微调模型通常更鲁棒;指令微调主要提升真理边际(truth margin),但其行为效应依赖于具体操纵类型;此外,模型缩放对两个通道的影响也不同:基础模型虽增强真理边际,但轻微增加对操纵的敏感性,而指令微调模型则更快提升真理边际并降低敏感性。因此,事实谄媚并非单一标量属性,评估应报告针对特定通道、特定操纵类型和特定规模条件下的鲁棒性指标,而非仅依赖翻转率。
链接: https://arxiv.org/abs/2606.06306
作者: Victor De Marez,Luna De Bruyne,Walter Daelemans
机构: University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model’s neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to separate the effects of size and instruction tuning across 56 open-weight models spanning 0.3B-32B parameters and 13 manipulation types. We find that vulnerability is governed mainly by size, but instruction tuning changes how size acts: small instruction-tuned models can become less robust, whereas large instruction-tuned models usually become more robust. Instruction tuning primarily increases truth margin, but its behavioral effect depends on manipulation type. Scaling also changes the two channels differently: base models gain margin but become mildly more manipulation-sensitive, whereas instruction-tuned models gain margin faster and become less sensitive. Factual sycophancy is therefore not a single scalar property. Evaluations should report channel-specific, manipulation-specific, and size-conditioned robustness rather than flip rates alone.
[NLP-17] LLM s Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLM s
【速读】: 该论文旨在解决现有生成式AI(Generative AI)记忆性评估方法的局限性问题,即当前评估多聚焦于模型在对抗性提示(如前缀攻击)下能否被强制复现训练数据,而忽视了模型在常规使用场景下的真实记忆泄露倾向。其核心解决方案在于提出一种倾向性感知(propensity-aware)的评估框架PropMe,通过对比前缀驱动的能力攻击与非对抗性评估,揭示模型在不同情境下的记忆行为差异。关键创新点包括:1)提出一种可应用于现有评估函数的度量转换机制,用于构建反映真实使用情境下记忆倾向的“倾向性指标”;2)设计轻量级追踪管道SimpleTrace,基于infinigram技术实现对模型生成内容的确定性溯源,精确计算逐字、近似逐字及经倾向性转换后的记忆度量。实验结果表明,尽管在前缀攻击下模型表现出显著的记忆信号,但在普通提示或特定数据集提示下,其记忆倾向性仍较低,说明模型在常规使用中极少泄露训练数据。此外,持续预训练的DFM Decoder相比基础模型Comma在Common Pile数据集上展现出更低的记忆能力与倾向性,验证了后续训练对记忆性的抑制作用。研究强调,未来应同时报告最坏情况下的可提取性与日常使用中的泄漏倾向性,以更全面地评估生成式AI的记忆风险。
链接: https://arxiv.org/abs/2606.06286
作者: Gianluca Barmina,Peter Schneider-Kamp,Lukas Galke Poech
机构: University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.
[NLP-18] Many Circuits One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery
【速读】: 该论文旨在解决生成式模型中电路发现方法(circuit discovery methods)所隐含的一个关键假设:即结构上不同的子图(subgraphs)可作为不同计算机制存在的证据。研究通过在固定任务下改变输入统计特性(如词频分布),发现尽管提取出的电路在结构上存在显著差异,但这些差异并不对应功能上的区别,这种现象被作者称为“幻象专化”(phantom specialization)。其解决方案的关键在于揭示:不同频率带(token-frequency bands)下的电路虽结构各异,却执行相同的计算任务——表现为带间边缘具有跨频段传递能力,核心共享结构在多数频段中均能实现超过99%的电路性能,并通过因果互换干预验证了内部表征在频段间的可交换性。进一步分析表明,同一频段内多次提取结果呈现多样性,说明发现算法实际采样自一个有效的等价子图类(equivalence class),而非唯一确定的机制。标准评估范式掩盖了这一事实:源级评估(source-level evaluation)夸大了模型的忠实度,而边级评估(edge-level evaluation)则暴露了从结构到功能的多对一映射关系。因此,结论强调:仅凭结构差异不足以支持机制多样性的推断,必须依赖边级评估与跨条件迁移测试才能有效识别真实的功能差异。
链接: https://arxiv.org/abs/2606.06267
作者: Alireza Bayat Makou,Jingcheng Niu,Subhabrata Dutta,Iryna Gurevych
机构: UKP Lab, Technical University of Darmstadt (UKP 实验室,达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (国家应用网络安全研究中心 ATHENE)
类目: Computation and Language (cs.CL)
备注: 90 pages, 53 figures
Abstract:Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.
[NLP-19] From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation
【速读】: 该论文旨在解决仇恨言论检测中存在的主观性问题,即不同人口群体对相同内容的判断存在显著差异,而通过人工标注获取多群体代表性数据成本高昂且难以扩展。为应对这一挑战,研究提出利用人格化提示的大语言模型(Persona-conditioned Large Language Models)在大规模下模拟多元视角。其核心解决方案在于评估模型是否能真实反映人类社会判断的三个关键维度:(i)跨群体间分歧(inter-group disagreement)、(ii)对自身群体身份的敏感性(in-group sensitivity),以及(iii)对其他群体反应的共情预测能力(vicarious prediction)。研究发现,现有模型在三方面表现不一致,性能高度依赖模型架构,且仅通过简单身份提示无法稳定实现上述特性。然而,采用Llama 3.1进行共情提示(vicarious prompting)在多数人口学维度上展现出最高跨群体一致性,并最接近人类实际的分歧模式,表明该配置在自动标注任务中更可能与人类判断保持一致,因而具备更高的可靠性。
链接: https://arxiv.org/abs/2606.06266
作者: Paloma Piot,Javier Parapar
机构: Universidade da Coruña; CITIC Research Centre
类目: Computation and Language (cs.CL)
备注:
Abstract:Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.
[NLP-20] FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition
【速读】: 该论文旨在解决神经性病变导致的病理性语音(pathological speech)在自动语音识别(ASR)中的识别难题。由于病理性语音在声学特征上与正常语音存在显著差异,传统ASR模型性能大幅下降。为此,论文提出一种基于特征逐层线性调制(Feature-wise Linear Modulation, FiLM)的说话人条件化方法:将从x向量提取的说话人特定信息注入到冻结的ASR编码器每一层Transformer中,从而在不修改基础模型权重的前提下,动态调整模型内部表示以适应个体病理性说话人。该方案的关键在于利用FiLM机制实现参数高效的条件化适配,既保持了对非条件化语音的泛化能力,又显著提升了对病理性语音的识别性能。实验在西班牙语和英语病理性语音数据集上验证了该方法的有效性,并进一步评估了适配后模型在回答语音相关问题任务上的表现,结果表明其在保持原有性能的同时,具备较强的适应能力。
链接: https://arxiv.org/abs/2606.06211
作者: Fernando López,Santosh Kesiraju,Jordi Luque
机构: AUDIAS; Speech@FIT; Telefonica(电信公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop
Abstract:Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.
[NLP-21] Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在处理长上下文时性能下降的问题,重点关注以往被忽视的词汇密度(lexical density)这一因素。现有研究普遍认为输入长度和相关信息的位置是导致长上下文性能退化的主要原因,而本文提出词汇密度——即上下文中引入新信息的速率——是影响模型有效上下文窗口的第三个关键且系统性的影响因素。研究通过三个“找针”式基准测试,在保持输入长度(约12k tokens)和目标信息位置一致的前提下,逐步增加信息密度,发现高词汇密度显著降低模型表现:在稀疏上下文中表现接近完美的模型,在高密度情境下检索准确率骤降至60%以下。为排除任务类型混淆,研究在同一基准中控制并调节词汇密度,结果表明降低词汇密度可显著恢复模型性能,尤其在高密度区域效果明显。因此,该研究的核心解决方案在于揭示并验证了有效上下文容量与词汇密度呈函数关系,强调在真实世界中处理紧凑、信息密集输入的LLM系统需重新评估其上下文处理机制,具有重要的实践指导意义。
链接: https://arxiv.org/abs/2606.06203
作者: Giovanni Dettori,Matteo Boffa,Danilo Giordano,Idilio Drago,Marco Mellia
机构: Politecnico di Torino (都灵理工大学); University of Turin (都灵大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures
Abstract:Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density – the rate at which a context introduces distinct information – as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three “find-the-needle” style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.
[NLP-22] Improving Answer Extraction in Context-based Question Answering Systems Using LLM s
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的问答系统在复杂或模糊查询下,仍存在上下文理解不充分、答案一致性差以及跨领域泛化能力弱的问题。现有方法虽具备一定生成能力,但常因对上下文信息提取不精准而产生无关或不确切的回答。其解决方案的关键在于通过在高质量标注数据集上对预训练语言模型进行微调(fine-tuning),以增强模型对上下文的语义理解与精确答案抽取能力。研究采用斯坦福问答数据集(SQuAD1.1)作为基准数据集,利用其提供的上下文-问题-答案三元组进行监督训练。实验结果表明,经过微调的RoBERTa-base模型在多项评价指标上表现优异,达到ROUGE-L 86.84%、BLEU 28.24%和BERTScore 95.38%,验证了该方法在提升问答系统准确性、相关性及可靠性方面的有效性。
链接: https://arxiv.org/abs/2606.06197
作者: Hafez Abdelghaffar,Ahmed Alansary,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, IMSA2026
Abstract:Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.
[NLP-23] he Tell-Tale Norm: ell_2 Magnitude as a Signal for Reasoning Dynamics in Large Language Models ICML
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中层间推理动态缺乏可解释、模型内生性信号的问题。现有研究虽关注模型推理机制,但尚未建立起一种普适且源自模型内部的指标来刻画不同层的推理强度变化。其核心解决方案在于揭示并利用隐藏状态的L2范数(l2 norm of hidden states)作为反映模型推理强度的内生信号——通过稀疏自编码器(Sparse Autoencoders, SAEs)诊断发现,与推理相关的特征激活在模型深层显著增强,且该现象与隐藏状态L2范数高度相关。研究进一步从理论上证明了隐藏状态的L2范数对SAE推理特征激活强度具有上界约束,并通过经验相关性分析和因果干预验证了该范数作为推理关键步骤指示器的可靠性。基于此,论文提出三种无需额外训练或数据的推理时缩放技术:(i)自适应分层推理递归、(ii)内生推理状态引导、(iii)基于L2范数的响应选择,均兼容先进推理引擎。实验表明,这些方法在多种模型架构和基准测试中显著提升推理性能,为感知与调控LLM潜在推理动态提供了原则性且简洁的视角。
链接: https://arxiv.org/abs/2606.06188
作者: Jinyang Zhang,Hongxin Ding,Yue Fang,Weibin Liao,Muyang Ye,Junfeng Zhao,Yasha Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML
Abstract:Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model’s reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs’ internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model’s latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at this https URL.
[NLP-24] Learning to Route LLM s from Implicit Cost-Performance Preferences via Meta-Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在性能与成本之间存在的权衡问题,尤其针对现有路由方法难以适应不同用户个性化成本-性能偏好这一关键挑战。其核心解决方案是提出一种感知型(perceptive)的LLM路由范式,通过少量交互高效学习用户的隐式偏好。为应对用户需求异构性,研究将用户偏好建模为上下文强化学习中的多任务集合,并设计了基于元学习(meta-learning)的MetaRouter框架,实现对用户偏好的感知与动态优化。实验表明,MetaRouter在分布内和分布外任务上均显著优于主流基线方法,具备快速学习用户偏好、对可路由模型变化的鲁棒性以及支持多模型路由的可扩展性,有效实现了用户中心化的成本-性能优化。
链接: https://arxiv.org/abs/2606.06178
作者: Jiahao Zeng,Ming Tang,Ningning Ding
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Southern University of Science and Technology(南方科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users’ implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.
[NLP-25] ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity INTERSPEECH2026
【速读】: 该论文旨在解决音频中讽刺(sarcasm)检测的难题,尤其聚焦于仅依赖音频信号时如何有效识别讽刺语调中的情感矛盾。其核心挑战在于捕捉局部语音韵律动态与整句情绪基线之间的不一致,即“时序韵律不一致”(temporal prosodic incongruity)。解决方案的关键在于提出ProSarc框架,通过双编码路径——全局情绪编码器(Global Emotion Encoder)和时序韵律编码器(Temporal Prosody Encoder,基于BiLSTM与多头注意力机制)——联合建模并计算一个标量化的韵律不一致得分,用于最终分类。该方法不仅在MUStARD++数据集上取得F1=75.3的优异表现,还在自发性对话(PodSarc,F1=62.9)和跨语言场景(MuSaG,F1=65.6)中展现出良好泛化能力。此外,采用蒙特卡洛丢弃法(Monte Carlo dropout)实现模型不确定性估计,并结合注意力机制定位讽刺起始点,无需帧级标注即可实现时间定位。实验验证表明,韵律不一致建模对性能提升具有显著贡献(Wilcoxon p=0.002,Cohen’s d=1.51),且人类评估证实模型不确定性与感知模糊性高度相关,预测的讽刺起始点与人工标注的时间窗口吻合度高。
链接: https://arxiv.org/abs/2606.06168
作者: Prathamjyot Singh,Ashima Sood,Sahil Sharma,Jasmeet Singh
机构: Thapar Institute of Engineering and Technology (泰帕尔工程与技术学院); Ulster University (阿尔斯特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2026, Sydney
Abstract:We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen’s d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.
[NLP-26] Where does Absolute Position come from in decoder-only Transformers?
【速读】: 该论文旨在解决生成式模型中虽采用相对位置编码(Rotary Position Embedding, RoPE)却仍表现出对绝对位置敏感的问题。尽管RoPE仅在内积中编码相对偏移,但其注意力模式中仍存在绝对位置信息的泄露。研究发现,这一现象源于两个关键架构组件:一是因果掩码(causal mask),其每个查询的softmax分母依赖于查询的绝对位置;二是残差流(residual stream),在因果注意力机制下,位置0的激活仅自注意,形成一个从该位置词元嵌入出发的封闭动力系统,下游注意力通过“注意力汇”(attention sink)头读取该轨迹。这两个组件在所有研究的架构中均存在,且以不同方式平衡:NTK缩放抑制残差流效应,滑动窗口注意力允许其随深度累积,标准RoPE则介于两者之间。实验表明,在前向传播前替换起始标记(BOS)嵌入可消除早期查询中40%的残差流成分。注意力汇是基于词元锚定的稳定器,当输入以自动附加的BOS开头时,其传递的是恒定的确定性指纹;否则则随该词元变化。因此,解决方案的关键在于识别并调控残差流与注意力汇的交互机制,从而减少绝对位置信息的非预期泄露。
链接: https://arxiv.org/abs/2606.06160
作者: Valeria Ruscio,Umberto Nanni,Fabrizio Silvestri
机构: Sapienza University of Rome (罗马第一大学); Intuition Machines (直觉机器)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position 0 attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \textttBOS embedding before the forward pass removes 40% of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position 0 , constant across inputs when that token is the auto-prepended \textttBOS and varying with it otherwise.
[NLP-27] Harnessing Structural Context for Entity Alignment Foundation Models
【速读】: 该论文旨在解决现有知识图谱(Knowledge Graph, KG)实体对齐(Entity Alignment, EA)方法在预训练基础模型中对结构上下文利用不足的问题,具体表现为:在编码阶段跨图交互较弱,且最终候选实体排序仍过度依赖粗粒度的相似性。其解决方案的关键在于提出ContextEA框架,通过双阶段改进实现更充分的结构上下文建模与利用。在编码端,引入一种跨图交互编码器(cross-KG interaction encoder),通过锚点桥接统一两个异构KG,并提前进行关系感知的跨图传播,强化了跨图间的结构信息交互;在解码端,设计结构校准解码器(structural calibration decoder),结合实体级、邻域级、关系级及锚点感知的结构证据对对齐得分进行精细化校准,从而增强结构上下文的构建与利用能力。该方法在保持轻量化的同时,在OpenEA、SRPRS和DBP共29个数据集上均显著优于现有可迁移基线模型,尤其值得注意的是,预训练的ContextEA已全面超越微调后的基线模型,表明显式挖掘结构上下文是提升EA基础模型泛化能力的有效路径。
链接: https://arxiv.org/abs/2606.06109
作者: Xingyu Chen,Yuanning Cui,Zequn Sun,Wei Hu
机构: Nanjing University (南京大学); Nanjing University of Information Science and Technology (南京信息工程大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.
[NLP-28] IR3DE: A Linear Router for Large Language Models ICML2026
【速读】: 该论文旨在解决在众多可用大语言模型(LLM)中,如何高效、低成本地为特定任务或领域选择最优模型的问题。现有路由方法要么仅优化通用型模型间的成本开销,要么依赖大量训练以支持领域专家模型的路由决策,存在效率与灵活性不足的缺陷。其解决方案的关键在于提出IR3DE——一种基于岭回归(Ridge Regression)的领域专家路由机制,通过线性建模实现快速、廉价的推理路由决策。该方法在因果语言建模(CLM)和推理任务两种场景下均表现出色:在两类CLM任务中性能接近甚至超越复杂基线,在推理任务中达到98.4%的归一化性能;同时具备良好的可扩展性,支持动态增删领域专家模型而无需重新训练路由器,显著提升了系统的灵活性与部署效率。
链接: https://arxiv.org/abs/2606.06098
作者: Eros Fanì,Oğuzhan Ersoy
机构: Gensyn; Gensyn
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference
Abstract:Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: this http URL.
[NLP-29] OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation
【速读】: 该论文旨在解决传统策略梯度方法仅优化期望回报(expected return)而无法有效处理实际应用中对回报分布特性(如尾部风险、异常值鲁棒性、最优样本选择等)需求的问题。现有方法在面对诸如风险规避、稳健学习或探索性目标(如best-of-K)时存在建模偏差。其解决方案的关键在于提出OrderGrad,一种基于似然比与重参数化技巧的梯度估计器家族,专门用于优化顺序统计量(order-statistic)目标。OrderGrad通过加权排序后的奖励或成本(即有限样本下的L-统计量),实现对多种分布目标的统一建模,包括VaR、CVaR、截尾均值、中位数及top-m/best-of-K等,仅需调整秩权重即可灵活切换目标。对于固定样本量和给定秩权重,OrderGrad提供无偏梯度估计,并以简单的奖励变换形式集成至标准策略梯度或重参数化更新中,具有良好的可插拔性。实验表明,该方法在均值优化与部署目标不匹配的任务(如大语言模型数学推理后训练)中显著提升性能,为风险敏感、鲁棒及探索性强化学习提供了统一且高效的解决方案。
链接: https://arxiv.org/abs/2606.06096
作者: Paavo Parmas,Yongmin Kim,Kohsei Matsutani,Shota Takashiro,Soichiro Nishimori,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo(东京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator’s variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.06096 [cs.LG] (or arXiv:2606.06096v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-30] CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios
【速读】: 该论文旨在解决语言识别(Language Identification, LI)在面对“亲属语言”(cousin languages)和“拼写噪声”(orthographic noise)等复杂场景时的性能瓶颈问题。其核心挑战在于,当语言间具有高度相似性(如捷克语与斯洛伐克语、西班牙语与加泰罗尼亚语)或文本经历跨脚本转换、变音符号丢失、同形异义攻击(homoglyph attacks)及网络俚语表达时,现有语言识别系统极易出现误判。解决方案的关键在于构建CHALIS(Challenging Language Identification Samples)这一新型基准数据集,该数据集包含两类测试样本:一是来自互为可理解语言对的共享句子,以评估系统在亲属语言间的区分能力;二是通过多脚本转写、去除变音符号、模拟同形攻击及引入网络用语等方式生成的拼写噪声样本,以检验系统对非标准书写形式的鲁棒性。实验表明,当前主流语言识别系统在上述场景中表现显著下降,尤其在资源较少的亲属语言及经过转写的输入上问题更为突出,凸显了现有方法在真实复杂语境下的局限性。该数据集已公开发布,可作为未来研究的评估基准。
链接: https://arxiv.org/abs/2606.06088
作者: Michal Tichý,Jindřich Libovický
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages
Abstract:We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at this https URL.
[NLP-31] LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents
【速读】: 该论文旨在解决大语言模型(LLM)代理系统在执行任务时,通过提示(prompt)注入文本技能所导致的上下文开销过大及技能内容明文暴露的问题。现有方法需在每一步推理中将技能文本嵌入提示,不仅消耗大量上下文空间(prefill tokens),还存在安全风险。其解决方案的关键在于提出一种名为LatentSkill的框架,该框架利用预训练的超网络(hypernetwork)将文本技能转化为可即插即用的低秩适配器(LoRA adapter),从而将技能知识存储于权重空间(weight space)而非上下文空间(context space)。这一设计实现了无需每步注入技能令牌的模块化加载、可扩展性与组合能力,显著降低了上下文开销。实验表明,在ALFWorld和Search-QA任务上,LatentSkill在保持甚至超越传统基于上下文技能基线性能的同时,分别减少了64.1%和72.2%的预填充令牌与技能令牌开销;进一步分析揭示生成的技能LoRA在参数空间中呈现出结构化的语义几何,并可通过LoRA缩放系数精确控制,且当技能组件对齐时可借助参数空间算术进行有效组合。这些结果表明,基于权重空间的技能表示为扩展LLM代理提供了高效、模块化且更隐蔽的底层支持。
链接: https://arxiv.org/abs/2606.06087
作者: Aofan Yu,Chenyu Zhou,Tianyi Xu,Zihan Guo,Rong Shan,Zhihui Fu,Jun Wang,Weiwen Liu,Yong Yu,Weinan Zhang,Jianghao Lin
机构: Shanghai Jiao Tong University (上海交通大学); Sun Yat-Sen University (中山大学); Shanghai Innovation Institute (上海创新研究院); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures
Abstract:Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.
[NLP-32] On Advantage Estimates for Max@K Policy Gradients
【速读】: 该论文旨在解决生成式大模型(Large Language Models, LLMs)后训练过程中,基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards)因稀疏结果奖励导致探索困难的问题。现有方法通过直接优化推理阶段的目标如pass@K和max@K来改善性能,但现有策略梯度估计器在信号设计、基线选择和归一化方式上存在不一致性,导致其理论关系模糊。论文通过基线设计与优势中心化分析,揭示了当前主流方法中优势估计器虽为无偏策略梯度估计,但产生非中心化的优势值。为此,作者提出一种“留两出”(Leave-Two-Out, L2O)基线,该基线在保持策略梯度无偏性的前提下,使实际批次的优势值严格中心化。由此提出的MaxPO方法具备高效的二次时间复杂度实现,并可自然融入基于分组的强化学习框架以支持LLM后训练。此外,论文推导了max@K的规范有限批次优势表达式,统一了现有优势估计器的理论视角。实验表明,L2O基线显著降低了梯度方差,并优于非中心化基线方案。
链接: https://arxiv.org/abs/2606.06080
作者: Shota Takashiro,Soichiro Nishimori,Paavo Parmas,Yongmin Kim,Kohsei Matsutani,Gouki Minegishi,Yusuke Iwasawa,Takeshi Kojima,Yutaka Matsuo
机构: The University of Tokyo(东京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.
[NLP-33] SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization
【速读】: 该论文旨在解决当前生成式AI代理技能(agent skills)构建中存在的可迁移性与任务特异性之间的根本矛盾:现有方法将技能提取视为一次性过程,导致针对特定任务设计的技能难以跨任务泛化,而过于抽象的技能又缺乏足够的指导能力。其核心解决方案是提出SkillComposer框架,通过将技能构建分解为三个可学习的操作——创建(create)、优化(improve)和合并(merge),并采用系统化的拒绝采样训练策略,使语言模型在推理阶段具备自我演化技能的能力。该框架支持离线构建通用技能库、在线进行任务特定优化以及混合模式协同应用。实验表明,SkillComposer-4B可在τ²-Bench、LiveCodeBench v6和AppWorld等多个基准上显著提升27B规模执行器的性能,最大提升达+4.5(代理任务)和+3.4(代码任务),且具备跨领域、跨任务类型的泛化能力。分析进一步揭示,“合并”与“优化”分别对应技能质量的不同维度,而技能组合本身是一种可迁移的元能力(meta-ability),为增强型推理提供了可复用的技术路径。
链接: https://arxiv.org/abs/2606.06079
作者: Qi Zhang,Zhaopeng Feng,Xiaonan Shi,Xiaomeng Hu,Chu Liu,Pengjun Xie,Xiaobin Wang,Jieping Ye,Bryan Hooi,Haobo Wang,Junbo Zhao
机构: Zhejiang University(浙江大学); Tongyi Lab(通义实验室); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on \tau^2 -Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.
[NLP-34] Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition
【速读】: 该论文旨在解决多任务学习(Multi-task learning, MTL)在第二语言(L2)语音识别中因任务间表示耦合导致的表面转写(surface transcription)性能下降问题,尤其在韩语与英语跨语言场景下表现显著。研究发现,尽管MTL能够提升语义理解性能,但其对表面转写任务的负面影响在英语中尤为严重,且该退化程度与表面形式与语义之间的差异性(通过Levenshtein编辑距离衡量)呈正相关。关键发现在于:编码器层面存在任务间表征纠缠(encoder-level entanglement),其中韩语仍能保持任务特异性表示,而英语则生成高度相似的共享表示;解码器分析进一步表明,语义双输出解码器可形成独特适应表示,而表面双输出解码器受制于编码器的纠缠表示,难以独立优化。因此,解决方案的关键在于设计能够缓解编码器层级表征纠缠的MTL框架,以降低双输出L2自动语音识别中表面转写的性能退化。
链接: https://arxiv.org/abs/2606.06065
作者: Seung Hwan Cho,Young-Min Kim
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio
Abstract:Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit this http URL analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.
[NLP-35] MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following ACL2026
【速读】: 该论文旨在解决在多约束指令遵循任务中,标准组相对策略优化(GRPO)在离散、低方差奖励设置下出现的训练不稳定性问题。其核心挑战在于:当组内奖励分布高度同质时,传统的z-score组归一化会引发三种病理现象——低方差放大、均值中心化盲区与零方差崩溃,导致梯度消失或错误更新。为应对上述问题,论文提出MDP-GRPO方法,其关键创新包括:(1)多温度采样以增强奖励分散性;(2)双锚点优势设计恢复同质组内的有效梯度并克服均值中心化盲区;(3)基于前景理论的奖励塑造机制,依据Kahneman-Tversky理论对更新进行边界约束并惩罚违反约束的行为;(4)非对称KL正则化。实验表明,MDP-GRPO在FollowBench、IFEval及自建多约束数据集上显著优于标准GRPO,使Llama-3.2-3B模型在严格约束满足率上提升最高达5.0%,同时支持小样本组稳定收敛,并保持在MMLU和ARC上的通用能力。
链接: https://arxiv.org/abs/2606.06058
作者: Mohammad Mahdi Salmani-Zarchi,Zahra Rahimi,Heshaam Faili,Mohammad Javad Dousti
机构: University of Tehran(德黑兰大学); Allameh Tabataba’i University(阿勒梅·塔巴塔巴伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference. 14 pages, 9 figures
Abstract:Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky’s theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.
[NLP-36] Automatic Labelling of Speech Translation Errors
【速读】: 该论文旨在解决语音翻译(Speech Translation, ST)系统中错误评估缺乏标准化方法的问题,尤其关注如何有效识别和标注翻译错误的可信度与质量。当前尚无成熟的方法论用于评估语音翻译系统的置信度与质量估计,导致错误难以被准确识别,进而影响系统可靠性。为推动该领域进展,本文提出语音翻译错误标注(Speech Translation Error Labelling, STEL)框架,构建了标准化的标注协议、一个小型真实场景端到端评估数据集,并分析现有仅文本处理与多模态语音处理系统在STEL任务中的表现。研究发现,仅依赖文本的XCOMET模型以及多模态大语言模型Qwen2.5-Omni在任务中可达到人类标注者约一半的精确度;同时,研究证实直接处理语音信号对STEL任务至关重要,且当前文本处理与语音处理系统在识别“仅翻译错误”与“语音处理错误”方面具有互补性。因此,解决方案的关键在于建立融合多模态输入的错误标注框架,并利用互补性的系统协同提升错误识别的准确性与全面性。
链接: https://arxiv.org/abs/2606.06047
作者: Dominik Macháček,Maike Züfle,Ondrej Klejch
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.
[NLP-37] IA-RAG : Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)及图结构RAG框架在处理外部知识时对时间维度建模不足的问题,尤其是其将知识视为静态或仅依赖粗粒度时间戳/元数据,无法有效捕捉持续时间、重叠与包含等丰富的时序结构。其解决方案的关键在于提出一种分层时序RAG框架——IA-RAG,该框架将事实表示为区间事件单元(Interval Event Units, IEUs),并构建基于主题的分层语义森林(Thematic Forest),通过艾伦区间代数(Allen’s Interval Algebra)形式化定义事件间的时序依赖关系。为应对不完整或不确定的时间边界,IA-RAG引入子图时间紧致机制(Sub-graph Time Tightening),利用连通事件子图内的逻辑约束对模糊区间进行精细化调整;同时支持基于区间代数引导的隐式时间语义检索,实现对复杂时序推理任务的有效建模。实验结果表明,IA-RAG在TimeQA、TempReason和ComplexTR等多个时序问答基准上均展现出优异的时序检索与推理能力,尤其在复合型时序推理任务中表现突出。
链接: https://arxiv.org/abs/2606.06044
作者: Xiaoman Wang,Yaoze Zhang,Wenzhuo Fan,Hongwei Zhang,Ding Wang,Guohang Yan,Song Mao,Botian Shi,Yunshi Lan,Pinlong Cai
机构: East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Shanghai for Science and Technology (上海理工大学); Harbin Engineering University (哈尔滨工程大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures, 13 tables. Code available at this https URL
Abstract:Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen’s Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at this https URL.
[NLP-38] English-to-Prakrit Machine Translation via Multilingual Transfer Learning
【速读】: 该论文旨在解决在低资源条件下将英语翻译为缺乏现代技术支持的古典语言——普拉克里特语(Prakrit)的机器翻译问题,尤其针对IndicTrans2模型不支持目标语言普拉克里特语的挑战。其解决方案的关键在于:通过将普拉克里特语映射至已有的印地语语言标签(hin_Deva),在不修改分词器、词汇表或模型架构的前提下,实现多语言模型的适配与迁移。实验基于包含1,474对句子的马哈拉施特拉普拉克里特语平行语料库,并在20样本的阿难达玛伽迪语(Ardhamagadhi)测试集上进行评估,结果显示相较于未调优基线模型,该方法在语料库BLEU指标上实现了显著提升。结果表明,基于字符级兼容性的语言路由策略可有效促进对非支持古典语言的可行迁移,但同时也揭示了数据稀缺性和方言差异带来的局限性。研究代码与训练好的模型已公开,以供后续探索。
链接: https://arxiv.org/abs/2606.06038
作者: Om Choksi,Smit Kareliya,Shrikant Malviya,Pruthwik Mishra
机构: Sardar Vallabhbhai National Institute of Technology, Surat
类目: Computation and Language (cs.CL)
备注:
Abstract:We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration this https URL.
[NLP-39] NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(masked diffusion language models)在并行生成多个标记时产生的校正问题:由于同一生成步骤中的标记均基于边缘分布进行预测,早期局部依赖性错误会污染上下文,进而影响后续生成质量。现有方法如PRISM虽通过学习标记级质量评分并重新掩码不可靠标记来缓解此问题,但其推理规则存在耦合缺陷——同一前向传播同时完成低质量标记检测与替代标记的逻辑计算,导致错误标记仍会作为条件参与再生过程。为克服此局限,本文提出NAVIRA,一种解耦的推理时解码策略:将质量评估与重生成两个操作分离,首先通过一次前向传播对标记进行质量评分,并随机采样需重掩码的位置;随后在清除低质量标记后的上下文中执行第二次前向传播以生成新标记。通过温度控制的随机重掩码机制,有效避免对同一位置的重复修正,平衡流畅性与多样性。在1.7亿参数的掩码扩散语言模型上的受控实验表明,解耦设计显著提升生成流畅性,而调度式随机重掩码策略在更大前向传播预算下维持更高熵水平,并取得更优的大语言模型判别器(LLM-judge)评分。结果表明,重掩码策略本身——而不仅仅是学习到的质量信号——是实现可靠掩码扩散文本生成的核心要素。
链接: https://arxiv.org/abs/2606.06031
作者: Andrey Fomenko,Maksim Kryzhanovskiy,Svetlana Glazyrina,Roman Ischenko
机构: Lomonosov Moscow State University (莫斯科国立大学); Institute for Artificial Intelligence, Lomonosov Moscow State University (莫斯科国立大学人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.
[NLP-40] RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit
【速读】: 该论文旨在解决社区条件化语言模型适配中数据收集、社区定义与评估方式各自独立决策导致的可比性差及成果复用困难的问题。其核心解决方案是提出RedditPersona这一模块化框架,通过统一标准化的数据采集(从Reddit获取帖子与评论)、用户画像构建、五种分组策略(基于子版块、图结构、语义、混合及交互行为)的社区划分、基于QLoRA的参数高效适配器训练,以及共享的多维度评估指标体系(涵盖流畅性、保真度、分布对齐性与社区可识别性),实现了全流程的规范化。实验在城市福祉领域112个子版块(30.1万用户画像,超1600万条评论)上验证了不同分组策略下适配器的行为可识别性与其与子版块基线的一致性高度相关,并揭示了所有策略下“社区可识别性”与“真实文本分布相似性”之间存在稳定权衡关系。
链接: https://arxiv.org/abs/2606.06027
作者: Amirhossein Ghaffari,Ali Goodarzi,Huong Nguyen,Simo Hosio,Lauri Lovén,Ekaterina Gilman
机构: University of Oulu(奥卢大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters’ behavioral identifiability tracks each strategy’s intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: this https URL.
[NLP-41] EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation
【速读】: 该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的科学同行评审生成方法中存在的两大核心问题:一是生成评论普遍缺乏具体证据支持且溯源能力弱,导致内容泛化、可信度不足;二是复杂的多智能体系统虽能提升质量但推理开销巨大,难以实际部署。针对上述挑战,论文提出EGTR-Review框架,其解决方案的关键在于通过多智能体教师蒸馏(Multi-Agent Teacher Distillation)机制,构建一个具备结构感知的论文分解、关键要素提取、外部学术证据检索、证据状态标注、验证推理与评审合成全流程能力的多智能体教师系统,并将教师在中间推理过程和最终评审输出中的知识,以任务前缀驱动的多任务学习方式高效蒸馏至轻量级学生模型。此外,引入加权证据目标函数,有效缓解弱监督、缺失或不可验证信息带来的负面影响。实验结果表明,EGTR-Review(学生模型)在多个公开评审数据集上,在自动指标、大模型作为裁判的评估以及人工评估中均显著优于提示工程、微调及结构化/代理型基线方法,同时保持强事实依据性和来源可追溯性,且在令牌消耗和推理时间上均有显著降低,具备良好的实用性与可扩展性。
链接: https://arxiv.org/abs/2606.06025
作者: Xinpeng Qiu,Wang Yihu,Zhifeng Liu,Xiaochen Wang,Jimin Wang
机构: Peking University (北京大学); PKU-WUHAN Institute for Artificial Intelligence, Peking University (北京大学武汉人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.
[NLP-42] Contextualized Prompting For Stance Detection On Social Media
【速读】: 该论文旨在解决社交媒体中立场检测(stance detection)因文本短小、噪声大及高度依赖上下文而带来的挑战。现有基于大型语言模型(LLM)的零样本(zero-shot)方法通常在缺乏上下文信息的情况下进行提示(prompting),难以有效解析语义模糊的推文。为此,本文系统研究了将真实世界(如用户个人简介)、派生属性(如政党归属)以及由LLM生成的目标描述等上下文特征融入零样本提示中的效果。实验覆盖四个基准数据集,包括一个高质量的德语推特立场检测数据集。研究发现,虽然引入上下文信息在特定条件下可提升性能,但并非普遍有益:其中,由LLM生成的目标描述能持续提升准确率,而其他用户元数据则表现不一,甚至可能产生负面影响;尤为值得注意的是,同一用户的其他推文虽在监督学习中常具增益作用,但在本场景下因输入噪声反而损害模型表现。定性分析进一步揭示,LLM难以区分任务相关与无关的上下文信息。该研究凸显了在嘈杂现实场景中使用上下文提示的潜力与局限性。代码与数据已公开发布。
链接: https://arxiv.org/abs/2606.06022
作者: Tilman Beck,Shakib Yazdani,Simon Kruschinski,Marcus Maurer,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \hrefthis https URLpage.
[NLP-43] he Generator-Eraser Paradox: Community Guidelines for Responsible LLM -Assisted Dialect Resource Creation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在方言资源建设中带来的“生成-消解悖论”问题,即尽管LLMs在加速方言资源开发(如语料检索、元数据增强、标注流程支持)方面具有强大潜力,但其在实际应用中可能加剧方言消亡风险:通过偏袒主流语言变体、同质化书写规范以及催生合成反馈循环,导致语言多样性持续下降。尤其对于存在双言现象(diglossia)、书面标准化程度低或处于边缘地位的方言社群而言,这一风险更为突出。论文的核心解决方案在于提出一个整合变异社会语言学与语料库语言学视角的理论框架——“生成-消解悖论”,并据此衍生出12项可操作的社区指南,将该理论转化为方言资源构建中的具体设计要求。通过阿拉伯语方言的深度案例研究,论文系统比较了现有主流资源,验证了这些指南在应对双言现象、书写系统变异及社群治理等语言特异性挑战方面的有效性。整体贡献为概念性与操作性,旨在帮助跨语言的方言社群与资源开发者在利用LLMs的同时,保障语言的真实性、多样性与自主权。
链接: https://arxiv.org/abs/2606.06004
作者: Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Dialect resources occupy a unique position at the intersection of scientific description, cultural preservation, and computational infrastructure. Large language models offer powerful capabilities for accelerating dialect resource development through retrieval-grounded drafting, corpus navigation, metadata enrichment, and annotation workflow support. However, the same systems pose substantial risks: they can contribute to dialect erasure by privileging prestige varieties, homogenizing orthography, and enabling synthetic feedback loops that reduce linguistic diversity over time. These risks are particularly acute for language varieties characterized by diglossia, limited written standardization, or marginalized speaker communities. This paper makes three contributions. First, we integrate insights from variationist sociolinguistics and corpus linguistics to formalize the generator-eraser paradox as a theoretical framework for understanding the dual nature of LLM-assisted dialect work. Second, we derive 12 community guidelines that operationalize this framework into implementable design requirements for dialect resource creation and documentation. Third, we provide an in-depth case study of Arabic dialects, including a structured comparison of widely used resources, to demonstrate how these guidelines address language-specific challenges including diglossia, orthographic variability, and community governance. The contribution is conceptual and operational rather than experimental, with the goal of enabling dialect communities and resource builders across languages to adopt LLMs without sacrificing authenticity, variation, or sovereignty.
[NLP-44] Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation
【速读】: 该论文旨在解决生成式推理模型(Reasoning models)在知识蒸馏过程中因产生冗长的思维链(chain-of-thought, CoT)痕迹而导致的计算成本高与学生模型输出过冗的问题。其核心挑战在于如何在不显著损失下游任务性能的前提下,有效压缩这些长序列的推理轨迹以提升训练与推理效率。解决方案的关键在于采用后处理压缩(post-hoc compression)策略:利用两个大型教师模型(Qwen3.5-397B-A17B 和 gpt-oss-120B)生成高质量的正确推理轨迹,并通过指令微调后的较小模型对这些轨迹进行语义保持性压缩,将原始字符长度缩减至8.6%-21.0%。实验表明,经压缩的推理轨迹可使训练令牌数减少至原始的12%-30%,训练速度提升2.0-7.6倍,推理输出长度缩短3-19倍,且在多数情况下优于简单截断(naive truncation),尤其在小规模学生模型上表现更优。然而,尽管压缩带来了显著的效率增益,原始未压缩轨迹仍维持最高精度,说明压缩本质上是一种精度-效率权衡(accuracy-efficiency trade-off)——在保留高达96%原始精度的同时,实现最高达18倍的每令牌效率提升;在0.8B参数规模下,经过LoRA微调的压缩轨迹虽能缩小与原始轨迹的性能差距,但仍未超越原始轨迹。
链接: https://arxiv.org/abs/2606.05988
作者: Maxime Griot,Paul Steven Scotti,Tanishq Mathew Abraham
机构: Université catholique de Louvain(鲁汶大学); Sophont Inc
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.
[NLP-45] Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems
【速读】: 该论文旨在解决当前多文化多智能体系统(multicultural multi-agent systems)评估中忽视系统层面文化多样性的问题。现有方法仅关注单个智能体与目标文化的契合度(即价值对齐,value alignment),但这一指标无法反映整个系统是否有效保留了其应代表的文化多元性。为此,论文提出以“价值多样性”(value diversity)作为系统级评估维度,通过衡量不同文化背景智能体在共享价值观调查中的响应差异来量化系统整体的文化异质性。研究基于世界价值观调查(World Values Survey)数据,评估了19种文化与18种基础模型在多种系统配置下的表现,发现价值多样性与价值对齐几乎无关,二者表征互补的系统属性;同时,当前基于大语言模型(LLM)的多文化系统在价值多样性上显著低于人类社会水平。混合骨干模型虽能部分缩小差距,但无法完全弥合,且该差距在不同文化组合和智能体规模下均持续存在。进一步分析表明,社会交互会加剧趋同效应,导致价值多样性下降,参与式预算分配案例研究也证实这种同质化限制了集体决策的广度。综上,论文确立了价值多样性作为多文化多智能体系统的独立评估轴,并揭示了现有基于大语言模型的社会系统中普遍存在的同质化倾向。
链接: https://arxiv.org/abs/2606.05985
作者: Shaoyang Xu,Jingshen Zhang,Long P. Hoang,Jinyuan Li,Wenxuan Zhang
机构: Singapore University of Technology and Design; Washington University in St. Louis
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents’ responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at this https URL.
[NLP-46] Framing Judging Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI
【速读】: 该论文旨在解决当前教育评估体系在生成式AI(Generative AI)广泛应用背景下,无法有效衡量学生“以AI为工具高效产出优质成果”的核心能力问题。传统测评仍聚焦于无辅助的独立表现,而真实情境中关键能力在于:对模糊任务进行精准建模(Framing)、对生成结果进行批判性评估(Judging)以及通过迭代调整引导模型优化输出(Steering)。现有评估方法将这一复合能力简化为单一“提示工程”得分,导致难以诊断使用失败的根本原因。为此,论文提出CoRe-3(Co-Reasoning)能力模型,将高效AI协作能力分解为可独立测量的三项核心技能——FJS:F(Framing)、J(Judging)、S(Steering),其关键创新在于明确区分生成前的“建模”与生成后的“调优”,并以“评估”作为二者之间的逻辑枢纽。该模型基于认知理论构建,提出五个可检验命题,并开发了开放平台CoReasoningLab,通过呈现带有缺陷的AI输出并分别评分,验证了三者在模拟学习者数据中的可分离性:各技能仅随自身被操控的胜任力变化,而与其他技能无关;当三者共享同一胜任力时,评分则出现相关性,体现聚合效度与区分效度,且在不同大模型后端间具有一致性。后续将开展人工评分一致性与实际效果验证,研究工具、数据与协议已公开发布。
链接: https://arxiv.org/abs/2606.05983
作者: Alexander Apartsin,Yehudit Aperstein
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 4 pages
Abstract:Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one “prompting” score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.
[NLP-47] he Self-Correction Illusion: LLM s Correct Others but Not Themselves
【速读】: 该论文旨在解决大语言模型(LLM)代理在自我推理过程中难以纠正自身错误这一关键问题,尤其关注这种纠错能力的缺失是否源于认知缺陷,还是由对话模板的角色标签(chat-template role)所引发的系统性偏差。研究发现,当同一错误主张以不同角色标签包装时,其被纠正的概率存在显著差异:将原本属于模型自身推理(\rolethought)的错误主张改为外部角色标签(如\roleuser、\roletool或\rolesystem memory),可使显式纠错率提升23至93个百分点,且10个实验单元达到统计显著性(p < 0.001)。该效应具有非对称性、可分解性和跨领域鲁棒性,表明模型无法自纠并非源于内在认知缺陷,而是由角色标签引发的模板依赖性偏差所致。因此,解决方案的关键在于利用这一“角色标签-纠错行为”的因果关联,设计仅通过提示结构(prompt-structure-only)调整实现的干预策略,无需训练或模型修改,且最优角色标签具有领域依赖性——在数学任务中,“\rolememory”表现最佳,而在逻辑推理任务中,“\roleuser”更有效。
链接: https://arxiv.org/abs/2606.05976
作者: Kuan-Yen Chen,Fang-Yi Su,Jung-Hsien Chiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent’s willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim’s content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent’s own \rolethought, a \roleuser message, a \roletool response, or a \rolesystem memory block. Across 13 model-domain cells covering seven model families and three domains ( n=30 paired tasks per cell), relabeling the claim from \rolethought to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching p0.001 . Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \rolememory dominates on math, while a plain \roleuser message dominates on logical deduction.
[NLP-48] Measuring the sensitivity of LLM -based structured extraction to prompt model and schema choices in clinical discharge summaries
【速读】: 该论文旨在解决生成式人工智能在临床自由文本结构化提取任务中对上游配置选择(如提示词设计、模型规模)敏感性的问题,尤其关注在缺乏人工标注真实标签的情况下,如何量化这些配置变化对输出一致性的影响。其核心解决方案在于提出一种可复用的审计方法:固定抽取任务与数据集(MIMIC-IV v3.1出院小结),通过系统性地单变量改变提示词变体或模型规模,评估跨不同配置下的抽取结果一致性。研究发现,在三分类临床文档标记任务中,模型规模的变化仅导致一致性在各字段间重新分配而非整体失效,而将三分类简化为二分类显著降低了提示词间的分歧,表明分歧主要源于“不存在”与“未提及”之间的语义区分;在多类别入院原因分类任务中,模型选择的影响远大于提示词表述,且大模型更少依赖兜底类别,显示出模型本身对输出分布具有主导作用。该研究揭示了模型架构与标注方案共同构成的非确定性来源,为大规模部署中的抽取可重复性提供了关键评估框架。
链接: https://arxiv.org/abs/2606.05970
作者: Martin Murin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 69 pages, 5 main figures, supplementary material included
Abstract:Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen’s kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema’s contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.
[NLP-49] Large Language Models are Perplexed by some Political Parties
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在政治应用中存在政治公平性缺失的问题,尤其关注模型对不同政治立场文本的处理偏差。研究提出以困惑度(perplexity)作为评估指标,假设一个公平的模型应对所有政治群体的文本赋予相近的概率分布。然而,研究发现,在涵盖37种语言的三个数据集上,十种主流LLMs对极右翼及民族主义政党的文本表现出更高的困惑度,而对社会民主政党文本则更低,表明模型存在系统性偏见。这一现象与先前关于机器翻译公平性的研究结果一致,且困惑度与下游翻译任务的公平性指标呈显著相关。研究进一步表明,该偏差主要源于模型预训练阶段,指令微调(instruction-tuning)并未显著改变这一倾向,说明政治公平性问题根植于模型的预训练数据和架构设计,而非后训练阶段的指令响应机制。因此,解决方案的关键在于改进预训练数据的多样性与代表性,以缓解模型在政治语义上的固有偏差。
链接: https://arxiv.org/abs/2606.05937
作者: Paul Lerner,François Yvon
机构: Sorbonne Université, CNRS, ISIR (索邦大学,法国国家科学研究中心,智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly used, including in political applications, but their political fairness has been little studied. We assess it using perplexity, posing that a fair model should give equal probability to all political groups. However, we find, across ten LLMs and three datasets covering 37 languages, that LLMs are more perplexed by the texts of far right and nationalist parties than of social-democratic parties. We find this to be consistent with previous work on translation fairness, to the point that perplexity correlates with downstream translation metrics. Our method is applicable to both base LLMs as well as their instruction-tuned counterpart, and we find that both are highly correlated, suggesting that the political fairness of LLMs stems from their pretraining, and is hardly affected by instruction-tuning.
[NLP-50] Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails
【速读】: 该论文旨在解决生成式AI在预训练阶段的过滤机制与推理时的防护机制中存在的系统性偏见问题,尤其是这些机制如何导致对边缘化群体(如跨性别者、女性及中美洲人)的“认识论抹除”(epistemic erasure)。其核心解决方案的关键在于揭示当前主流过滤与防护系统高度依赖基于黑名单的词汇线索进行决策,而忽视了对敏感信息(如隐私内容)和明确仇恨言论的有效识别;同时,这些系统在实践中对边缘化群体相关表述存在显著过度标记现象。研究通过审计四个预训练过滤器和三个推理时防护机制,并结合人工标注的500条语句数据发现,人类标注者能够保留88.5%被过滤器标记的内容和91.3%被防护机制标记的内容,且更善于识别因内容删除所引发的表征性伤害。这表明现有自动化系统的判断标准与人类对语境与社会影响的理解存在根本脱节,凸显出亟需建立更具情境感知能力与伦理敏感性的内容治理框架。
链接: https://arxiv.org/abs/2606.05936
作者: Marco Antonio Stranisci,A Pranav,Rossana Damiano,Christian Hardmeier,Anne Lauscher
机构: University of Turin(都灵大学); IT University of Copenhagen(哥本哈根信息技术大学); Trustworthy AI Lab, University of Hamburg(汉堡大学可信人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.
[NLP-51] Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach ACL2026
【速读】: 该论文旨在解决文学翻译中因高质量标注数据稀缺以及在表达流畅性与文学效果之间难以平衡所带来的挑战。其核心解决方案是提出一种多维度迭代优化框架,通过专用的大语言模型(LLM)翻译器分别聚焦于不同的质量维度(如忠实度、流畅性、风格一致性等),自动生成高质量的参考译文与偏好数据。这些生成的数据被用于监督微调(SFT)和强化学习(RL),其中引入显式奖励模型的GRPO方法相比传统DPO展现出更优性能,得益于两阶段训练的稳定性与GRPO的在线探索能力。实验表明,该框架生成的参考译文在SFT任务上较原始真实标签提升8.65 CEA100分;在强化学习中,采用GRPO进一步带来1.51分增益。最终构建的LitMT-8B与LitMT-14B模型在MetaphorTrans英汉文学翻译基准上分别取得67.25与69.07的CEA100得分,性能媲美Claude Sonnet 4.5(68.43),并展现出对非领域内文学作品(如欧·亨利作品)的强大泛化能力。
链接: https://arxiv.org/abs/2606.05924
作者: Zhihao Lin,Ziqi Zhu,Hao Huang,Guanghui Wang,Peiyang He
机构: Amazon Web Services (AWS); Peking University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Industry
Abstract:Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO’s online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).
[NLP-52] Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
【速读】: 该论文旨在解决智能体(AI agent)在实际部署环境中缺乏标注数据时,难以持续优化其技能、工具与工作流组合(harness)的问题。现有优化方法通常依赖于真实标签的验证集,但在真实场景中获取此类数据成本高昂且不可行。为此,论文提出一种自监督的回溯式框架优化方法(Retrospective Harness Optimization, RHO),其核心在于仅利用历史任务轨迹进行自我优化。RHO的关键创新在于:从过往轨迹中选取具有代表性的高难度任务构成多样性子集,并并行重演这些任务;智能体通过自验证(self-validation)与自一致性(self-consistency)机制分析执行结果,生成候选的框架更新方案,并基于自身对不同更新方案的成对偏好选择最优解。实验表明,仅一轮优化即可将SWE-Bench Pro上的通过率从59%提升至78%,且无需外部评分。此外,分析显示RHO能有效识别并修复先前的失败模式,使优化后的智能体在长周期任务中表现出更稳定的高精度行为特征。
链接: https://arxiv.org/abs/2606.05922
作者: Wenbo Pan,Shujie Liu,Chin-Yew Lin,Jingying Zeng,Xianfeng Tang,Xiangyang Zhou,Yan Lu,Xiaohua Jia
机构: City University of Hong Kong (香港城市大学); Microsoft Research Asia (微软亚洲研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code: this https URL ; Project website: this https URL
Abstract:AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent’s behavior patterns and sustains higher accuracy during long-horizon sessions.
[NLP-53] Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement
【速读】: 该论文旨在解决现有代码生成基准测试与真实网页开发实践脱节的问题。传统基准仅评估从完整提示(prompt)到单次输出的映射,而实际开发中用户需求往往不完整,且通过迭代查看中间结果并反馈来逐步明确需求。为此,论文提出Asuka-Bench,一个基于浏览器渲染行为、支持多轮迭代优化的基准测试体系。其核心解决方案在于构建一个闭环流程:由代码代理(Code Agent)生成网页项目,UI代理(UI Agent)在部署站点上执行测试用例,用户大模型(User LLM)将评估结果转化为自然语言反馈以指导下一轮改进。该基准包含50个网页任务、784项评估标准及2402个预期结果,实验表明不同大模型(LLM)在加权任务通过率上差异达38个百分点,且修复反馈的能力存在显著差异;同时,即使最强模型在三轮迭代后也仅完成52%的项目,表明该基准尚未饱和,具有持续研究价值。
链接: https://arxiv.org/abs/2606.05920
作者: Xin Wang,Liangtai Sun,Yaoming Zhu,Shuang Zhou,Jiaxing Liu,Fengjiao Chen,Lin Qiu,Xuezhi Cao,Xunliang Cai,Licheng Zhang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Independent researchers (独立研究者)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: under review
Abstract:Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.
[NLP-54] MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering
【速读】: 该论文旨在解决长视频问答(Long-video Question Answering, LVQA)中因答案相关证据在时间上稀疏、短暂且分散于长视频上下文而导致的语义理解困难问题。现有基于帧的方法虽通过均匀采样、查询感知帧选择、视觉令牌压缩及自适应分辨率等策略提升了效率,但仍依赖孤立且碎片化的帧作为基本证据单元,难以有效捕捉事件级的连贯语义。其解决方案的关键在于提出一种基于视频记忆(Video-Memory)的增强框架MemoryCard,该框架通过自读取(self-reading)过程对视频与对齐话语进行分析,将视频划分为语义连贯的单元,每个单元对应一个独立主题或事件;随后为每个单元生成事件级视频概要(event-level video gist),并选取代表性视觉时刻,最终将其统一渲染为结构化记忆卡片(Memory Cards)用于检索与问答。该方法显著提升了模型对长视频中事件级语义的理解能力,在相近的视觉令牌预算下实现了高达21.8%的相对准确率提升。
链接: https://arxiv.org/abs/2606.05917
作者: Qing Yang,Pengcheng Huang,Xinze Li,Zhenghao Liu,Yukun Yan,Yu Gu,Ge Yu,Gang Li,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Digital China Group (数字中国集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 21 pages, 8 figures
Abstract:Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs’ ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at this https URL.
[NLP-55] ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL
【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)任务中因数据库模式复杂而导致的模式链接(schema linking)不准确问题。现有方法要么依赖全模式生成,使模式链接隐含于庞大的搜索空间中,要么采用基于静态黄金列监督的独立检索器,其目标可能与当前生成器策略不匹配,导致性能受限。为此,论文提出一种基于执行反馈的强化学习框架——自适应协同优化(Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL, ACE-SQL),通过联合优化模式检索与SQL生成来提升准确性。其核心创新在于:利用生成器采样结果构建在线列集合池,并根据最常出现在执行正确的采样路径中的列集动态生成有监督的检索目标,从而实现双向自适应——检索器向生成器能正确执行的列集演化,而生成器也适应检索器不断变化的模式选择。该方法在约3000个合成的文本到SQL问答-数据库对上进行强化学习训练,在BIRD Dev数据集上达到65.3%的贪心执行准确率,且每查询仅需0.93千个输出标记。
链接: https://arxiv.org/abs/2606.05906
作者: Xiaobing Chen,Ai Jian,Eryu Guo,Zhiqi Pang
机构: Harbin Engineering University (哈尔滨工程大学); Harbin Institute of Technology (哈尔滨工业大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever’s evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at this https URL.
[NLP-56] Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中普遍存在的幻觉(hallucination)问题,以及对未在训练数据中出现的专有信息进行推理和问答时缺乏有效支持的挑战。现有解决方案如微调成本高昂且难以扩展,而检索增强生成(Retrieval-Augmented Generation, RAG)虽能缓解部分问题,但在复杂查询任务中仍存在事实准确性不足、召回率低等缺陷。为此,论文提出一种基于轻量级图结构的新型RAG系统架构,其核心创新在于引入一个具有简单图模式(graph schema)的结构化知识图谱,并配备专用工具集,实现向量搜索与图查询的协同操作。该方案通过在精选的英文维基百科子集上构建结构化数据源,使代理系统能够更精准地定位相关信息,显著提升事实正确性。实验结果表明,该方法在MoNaCo基准测试中不仅将幻觉答案数量减少一半,还大幅提高了精确率与召回率,达到三类评估场景中的最高细粒度真实性得分,同时仅带来适度的令牌消耗增加,验证了图结构在增强RAG系统可靠性方面的有效性。
链接: https://arxiv.org/abs/2606.05901
作者: Christopher J. Wedge,Joshua Stutter,Danny Dixon,Jacek Cała
机构: National Innovation Centre for Data, Newcastle University (国家创新数据中心,纽卡斯尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM “hallucinating” information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05901 [cs.CL] (or arXiv:2606.05901v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.05901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-57] Representing Research Attention as Contextually Structured Flows
【速读】: 该论文旨在解决现有研究关注度(research attention)度量方法中存在的核心问题:当前普遍采用的聚合计数形式无法有效捕捉注意力在不同语境中随时间演变的动态过程,导致注意力的实际演化特征与其量化表示之间存在不匹配。其解决方案的关键在于提出“注意力流”(attention flows)这一上下文结构化表征框架,通过编码注意力的组织方式及其时间演化轨迹,实现对注意力动态过程的精细化建模。研究通过构建基于类比推理的任务基准,对比信号、序列与流式表征在结构比较任务中的表现,发现流式表征在时间推进或语境分布驱动的注意力场景下具有更强的结构性表达能力;同时,学习得到的流式表征在部分观测和结构扰动下展现出更高的鲁棒性。结果表明,将注意力建模为具有上下文结构特性的动态现象,可为科研评估提供更富信息量的分析基础。
链接: https://arxiv.org/abs/2606.05895
作者: Jessica Rodrigues,Angelo Salatino,Gard Jenset,Scott Hale
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at STi 2026 - International Conference on Science and Technology Indicators
Abstract:Research attention is widely used as an indicator of visibility, influence, and societal uptake, yet it is typically represented as aggregated counts that do not preserve how attention develops across contexts over time. This creates a mismatch between how attention is interpreted and how it is represented. We propose attention flows as contextually structured representations that encode the organisation of attention and its evolution over time. We evaluate whether these representations capture transferable structure by constructing a benchmark based on analogy-style reasoning across research outputs. Comparing signal, sequence, and flow-based representations, we find that flow representations more effectively support structural comparison, particularly in settings where attention is shaped by temporal progression or context distributions. We further show that learned flow representations improve robustness under partial observation and structural perturbation. Overall, these results support modelling attention as a contextually structured phenomenon and provide a basis for more informative approaches to research evaluation.
[NLP-58] EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents
【速读】: 该论文旨在解决长时程智能体在有限记忆预算下如何高效保留关键证据以支持未来查询的问题。核心挑战在于:尽管长时程智能体可存储大量历史数据,但在面对未知查询时,若所保留的记忆中缺失与答案相关的关键证据,则必须重新检索和重读更大范围的原始历史,导致高昂的上下文开销与延迟。为应对这一问题,论文提出“预算化事前证据留存”(Budgeted Pre-Query Retention)范式,即在数据摄入阶段预先决定哪些源证据应被保留,且在后续查询时仅依赖已保留的证据进行检索与回答,不访问完整原始流。其解决方案的关键是引入EMBER——一种基于学习的保留策略,通过构建紧凑、来源可追溯的“证据胶囊”(evidence capsules),即包含原文摘录、检索键及更新元数据的结构化单元,实现对证据的精确锚定与可读性保障。同时,利用查询后结果反馈训练写入模块,使系统在摄入阶段即具备前瞻性地保留对后续回答有用的信息。实验表明,在LongMemEval-RR基准上,EMBER-14B在8192个词元的保留预算下达到0.3017的F1值,显著优于最强非EMBER基线(0.1765),且在不同预算水平下均提升F1、保留召回率(Retain-Recall)与读取召回率(Read-Recall),证明了在预算约束下保留高质量证据的重要性远超依赖重读更长历史的策略。
链接: https://arxiv.org/abs/2606.05894
作者: Yilong Li,Suman Banerjee,Tong Che
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); NVIDIA Research (英伟达研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before the query is known, which source evidence should be retained so that it remains recoverable and usable under a fixed retained source-evidence token budget? We instantiate this setting as Budgeted Pre-Query Retention, where memory is written during ingestion and later read without access to the full raw stream. We introduce EMBER, a learned retention policy that constructs a compact, source-backed evidence state. EMBER stores evidence capsules: verbatim source excerpts paired with retrieval keys and update metadata, preserving both grounding and read-time access. Post-query outcome feedback trains the writer to preserve evidence across the ingestion-retrieval-answer chain. On LongMemEval-RR, our LongMemEval-derived retained-evidence protocol, EMBER-14B reaches 0.3017 F1 at the 8192-token retained-evidence comparison point, compared with 0.1765 for the strongest non-EMBER budgeted baseline. Across retained source-evidence budgets, EMBER improves F1, Retain-Recall, and Read-Recall, indicating that long-horizon memory depends on retaining evidence within the budget rather than rereading larger histories.
[NLP-59] Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM -to-LLM Simulated Conversations
【速读】: 该论文旨在解决在生成式对话系统中,如何使大语言模型(LLM)作为人工道德顾问(Artificial Moral Advisor, AMA)有效引导对话者“与不确定性共处”的问题。其核心挑战在于设计能够促进深度伦理反思而非简单说服或迎合的对话策略。解决方案的关键在于提出三种基于不确定性的对话模式:视角多元(Perspective-Multiplying)、张力保持(Tension-Preserving)与过程反思(Process-Reflecting),并与三种控制条件(基准、说服性、奉承性)进行对比。研究发现,不同策略虽在立场改变量上无显著差异,但在对话质量与持续性互动方面具有明显区分度;同时,声明式(Declarative)与叙事式(Narrative)角色提示格式分别更优地捕捉初始立场多样性与真实信念修正过程,揭示了角色设定与对话策略协同影响对话成效的核心机制。
链接: https://arxiv.org/abs/2606.05890
作者: Salvatore Greco,Hainiu Xu,Jacopo Domenicucci,Yulan He,Sylvie Delacroix
机构: King’s College London (国王学院); Tohoku University (东北大学); Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors “stay with the uncertainty”. We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.
[NLP-60] GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech
【速读】: 该论文旨在解决零样本自回归文本到语音(TTS)中声学风格控制的耦合问题,即在不依赖风格标签的情况下,实现对说话人音色与韵律属性(如语速、基频)的解耦控制。现有方法中,说话人提示(speaker prompt)常将说话人身份与韵律特征纠缠在一起,导致调整风格时不可避免地改变说话人特征。为解决此问题,论文提出GLASS框架,其核心创新在于将每个声学属性视为由生成后奖励定义的控制方向,而非依赖显式标签。具体而言,针对每一控制维度,冻结TTS主干网络,利用组相对策略优化(GRPO)训练轻量级LoRA适配器,以语音词元长度和平均基频作为风格奖励,以字错误率(WER)作为可懂性锚点。由于各控制以LoRA权重更新形式表示,独立训练的适配器可通过线性LoRA运算实现替换、插值与多轴组合,无需重新训练主干模型。实验表明,该方法在语速与基频控制上实现了精准的风格迁移,同时保持了自然度、说话人相似性和可懂性,并支持平滑插值与多维度风格组合。
链接: https://arxiv.org/abs/2606.05889
作者: Jaehoon Kang,Yejin Lee,Kyuhong Shim
机构: Sungkyunkwan University(成钧大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.
[NLP-61] Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估体系中过度聚焦于功能导向目标,而忽视逻辑中立场景下模型行为表现的问题。在多个动作具有同等有效性的情境(如旅行行程推荐或日常安排生成)中,随机性对于避免重复行为、提升有效选项覆盖率至关重要。然而,现有模型常表现出确定性倾向,导致分布偏差与多样性下降。为此,作者提出RandomBench基准,用于评估MLLMs在面对等效选项时是否能保持分布中立的随机行为,并引入RI(Randomness Index)、BCI(Bias in Choice Inconsistency)、BII(Bias in Item Inconsistency)三个指标量化熵值与分布偏倚。实验发现普遍存在“随机性坍缩”(Stochastic Collapse)现象:即使在明确指令要求随机选择的情况下,顶级选项的概率高达97%(远超理想四分之一基准),且RI降至0.068(以Claude Sonnet 4.6为例)。大规模消融研究进一步表明,此类偏差在不同语言和表征格式下均持续存在,揭示了逻辑中立决策场景中分布性坍缩的强鲁棒性。解决方案的关键在于构建可量化的评估框架与指标体系,以识别并缓解模型在非功能性驱动情境下的内在确定性偏好。
链接: https://arxiv.org/abs/2606.05874
作者: Huiyuan Zheng,Houtao Zhang,Boyang Wang,Qingyi Si,Hongcheng Guo
机构: Fudan University (复旦大学); Beihang University (北京航空航天大学); JD.com (京东)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.
[NLP-62] YouZhi: Towards High-Concurrency Financial LLM s via Adaptive GQA-to-MLA Transition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域高并发部署中因键值缓存(KV cache)内存开销过大而导致的基础设施成本上升与可扩展性受限问题。其核心解决方案在于提出YouZhi-LLM,一种基于华为Ascend生态体系构建的高效金融领域大模型,通过算法架构与训练流程的系统性优化实现突破。关键创新点在于设计了一种分层自适应的GQA(Grouped Query Attention)向MLA(Multi-Layer Attention)过渡框架,能够动态为每一层分配最优的频次折叠(FreqFold)尺寸,在最大化KV缓存压缩率的同时最小化困惑度(perplexity)退化。此外,依托Ascend平台的训练管道,融合广义知识蒸馏与金融领域特定的监督微调,有效恢复模型表征能力并注入领域专业知识。实验表明,该自适应过渡策略相较均匀基线可降低35%的困惑度退化;在Ascend NPU上通过vLLM-Ascend部署时,显著提升推理效率:YouZhi-7B在平均金融基准得分上提升12.3%,最大并发能力提升2.69倍;YouZhi-14B则实现7.0%的准确率增益与2.43倍的并发提升,确立了低成本、高吞吐金融推理的新范式。
链接: https://arxiv.org/abs/2606.05868
作者: PSBC LLM Team,Huawei LLM Team:Ruihan Long,Junjie Wu,Tianan Zhang,Duo Zhang,Yaozong Wu,Jinbin Fu,Chang Liu,Zhentao Tang,Wenshuang Yang,Xin Wang,Zhihao Song,Ning Huang,Wenjing Xu,Shuai Zong,Shupei Sun,Sen Wang,Jing Hu,Bin Wang,Xinyu Wang,Junkui Ju,Zequn Ding,Jie Ran,Man Luo,Shixiong Kai,Linkai Hou,Kaichao Liang,Hu Zhao,Yang Zhao,Shucheng Lin,Wei Yu,Chenghan Jiang,Jingjing Ding,Jiahui Zhang,Tian Jin,Yuhang Zhang,Dong Guo,Wei Sun,Jun Xie,Jianwei Li,Lei Cao,Pei Li,Jiabin Li,Jia Yuan,Rui Yuan,Jing Zhu,Mingxuan Yuan,Zhangcheng Lv,Xin Jiang,Xiuhong Fei,Xiaozhe Ren,Yulong Li,Zhipeng Zhang,Hang Wang,Zhaohui Xu,Rui Zhao,Yibo He,Xinzhuang Niu
机构: Postal Savings Bank of China (中国邮政储蓄银行); Huawei Technologies (华为技术)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69 \times increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43 \times concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.
[NLP-63] Analysis of the Neglect-Zero Effect in Large Language Models ACL2026
【速读】: 该论文旨在探究大语言模型(Large Language Models, LLMs)在语言处理过程中与人类认知过程的相似性,重点关注一种名为“忽略零效应”(neglect-zero effect)的人类认知偏差。该效应指人类倾向于忽视“零模型”(zero-models),即通过空集使命题在逻辑上平凡成立的情形。研究聚焦于由忽略零效应驱动的两类推理,并通过对比不涉及该效应的推理任务,考察LLMs在处理这些推理时的表现。为此,研究采用基于结构启动(structural priming)的实验范式,利用前导句(prime)的结构相似性促进后续目标句(target)的加工。通过设计特定的前导句迫使LLMs考虑零模型,进而分析其在目标句中是否同样予以考量。结果表明,在所研究的LLMs中,忽略零效应可能并未出现,暗示其在处理此类逻辑情境时表现出与人类不同的认知模式。
链接: https://arxiv.org/abs/2606.05864
作者: Jin Tanaka,Daiki Matsuoka,Ryoma Kumon,Hitomi Yanaka
机构: The University of Tokyo (东京大学); RIKEN (理化学研究所); Tohoku University (东北大学)
类目: Computation and Language (cs.CL)
备注: 14 pages (10 pages main text), 8 figures. To appear in the Proceedings of the ACL2026 Student Research Workshop (SRW)
Abstract:We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the \textitneglect-zero effect . This effect refers to the human tendency to ignore \textitzero-models , which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on \textitstructural priming , where recent exposure to a preceding sentence (the \textitprime ) facilitates the processing of a subsequent sentence (the \textittarget ) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at this https URL
[NLP-64] ARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization
【速读】: 该论文旨在解决大语言模型(LLM)中连续隐式推理(latent reasoning)因固有确定性导致强化学习(RL)中策略探索能力受限的问题。现有方法在采用连续表示进行推理时,难以维持必要的随机性以支持有效探索,从而影响策略优化效果。为此,论文提出TARPO(Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization),其核心创新在于设计一种轻量级动作路由头(action head router),能够根据当前隐藏状态动态地在离散词元生成与连续隐式推理之间进行逐令牌(token-wise)切换。该机制通过在二元模式选择空间中采样路由决策,保留了词汇表上离散词元采样的随机性,同时实现连续表示的高效表达能力。模型通过共享的组相对优势信号(group-relative advantage signal)对语言模型主干与路由模块进行端到端联合优化,确保训练稳定性与性能提升。大量实验表明,TARPO在Qwen2.5(1.5B–7B)和Llama-3.1-8B等多种基座模型上均显著优于现有的显式与隐式推理强化学习基线,并展现出自适应的令牌级切换行为与稳定的训练动态。
链接: https://arxiv.org/abs/2606.05859
作者: Liting Zhang,Shiwan Zhao,Xuyang Zhao,Zichen Xu,Jianye Wang,Qicheng Li
机构: Nankai University (南开大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 12 figures. Code available at this https URL
Abstract:Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at this https URL.
[NLP-65] ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLM s
【速读】: 该论文旨在解决解码器仅型大语言模型(decoder-only LLMs)中因果注意力机制导致的上下文表示偏差问题,即早期词元无法获取未来上下文信息,从而限制了其生成鲁棒、全面的上下文嵌入能力。为此,论文提出了一种名为“反向提示与显式单字限制”(ReverseEOL)的简单而有效的方法,其核心在于通过将原始输入文本反转后生成额外的反向嵌入,并将其与标准前向嵌入进行融合。由于反转输入使每个词元能够接触到原顺序下不可见的后续上下文,反向嵌入可为原始嵌入提供互补信息,从而显著增强冻结的LLM在无训练场景下的表征能力。实验结果表明,ReverseEOL在STS和MTEB等基准上均能显著提升多种架构与规模的预训练模型性能,且消融实验验证了反向机制的关键作用。
链接: https://arxiv.org/abs/2606.05858
作者: Ailiang Lin,Zhuoyun Li,Yusong Wang,Keyu Mao,Kotaro Funakoshi,Manabu Okumura
机构: Institute of Science Tokyo; Tencent
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.
[NLP-66] Forgive or forget: Understanding the context of hate in audio retrieval systems
【速读】: 该论文旨在解决文本到音频生成系统中因上下文依赖性导致的有害内容(toxic content)检索问题。现有方法如重述或摘要化虽可缓解风险,但易改变原始语义意图或遗漏关键细节,影响生成质量与准确性。其解决方案的关键在于提出一种后处理因果去偏框架,引入受情感控制的中介机制(sentiment-controlled mediator),在保持语义相关性的同时有效抑制有害语音输出。该方法具有模型无关性,可无缝集成至现有检索流程中,并设计了两种变体:Forgive通过逻辑值调整实现对有毒音频的重新排序与过滤;Forget则生成反事实的有毒提示以削弱有害检索结果。实验表明,该方法在显著降低毒性输出的同时,仅带来极小的检索准确率损失,显著提升了系统的安全性与可靠性。
链接: https://arxiv.org/abs/2606.05857
作者: Arghya Pal,Sailaja Rajanala,Raphael C.-W. Phan,Shekhar Nayak
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.
[NLP-67] owards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs ICML2026
【速读】: 该论文旨在解决多语言语码转换自动语音识别(code-switching ASR, CS-ASR)中因跨语言对语码转换语音资源严重匮乏而导致的性能瓶颈问题。现有方法主要依赖于合成语码转换语音或在有限双语数据集上进行特定语言对的微调,但此类方法存在固有的可扩展性限制:随着支持语言数量的增加,语言对组合呈组合爆炸式增长,导致每一对语言均需独立建模。为此,本文提出通过模型融合与领域泛化方法,探究从少量已见语言对中学到的语码转换能力是否能够泛化至未见过的语言对。研究结果表明,经融合的双语语码转换模型虽能实现一定程度的跨语言对泛化,但效果有限,暗示双语语码转换能力在不同语言对间的迁移能力较弱。
链接: https://arxiv.org/abs/2606.05846
作者: Gio Paik,Hyunseo Shin,Soungmin Lee
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: ICML 2026 Workshop on Machine Learning for Audio
Abstract:Automatic Speech Recognition (ASR) has become a key technology for human–AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.
[NLP-68] Mechanistic Insights into Functional Sparsity in Multimodal LLM s via CoRe Heads
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂、噪声干扰的视觉-语言任务中,如何从冗杂上下文中提取与查询相关视觉特征的机制不透明问题。其核心解决方案在于揭示了MLLMs中跨模态检索所具有的功能性稀疏性(functional sparsity)这一深层结构特性。研究提出一种基于令牌级别的度量方法——检索注意力质量(Retrieval Attention Mass, RAM),识别出一类高度特化的注意力头,称为上下文感知检索(Context-aware Retrieval, CoRe)头。实证结果表明,CoRe头在不同视觉领域和模型规模下均扮演专用信息提取器的角色,而其余多数注意力头则负责更广泛的上下文分布关注。因果干预实验进一步验证了CoRe头的关键作用:仅移除前5%的CoRe头即导致多模态推理性能显著下降,而移除低排名头的影响可忽略不计。此外,加速实验表明,利用这种局部稀疏性可显著提升推理效率,同时保持任务性能稳定。本研究揭示了MLLMs中功能性稀疏性的结构原则,深化了对模型可解释性的理解,并为未来模型架构设计与优化提供了理论基础。
链接: https://arxiv.org/abs/2606.05843
作者: Ruoxi Sun,Quantong Qiu,Juntao Li,Zecheng Tang,Yihang Lou,Min Zhang
机构: Soochow University (苏州大学); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.
[NLP-69] ProSPy: A Profiling-Driven SQL-Python Agent ic Framework for Enterprise Text-to-SQL
【速读】: 该论文旨在解决大语言模型在企业级数据库场景下进行文本到SQL(Text-to-SQL)转换时面临的挑战,包括大规模异构数据模式(schema)、不完整的元数据、特定方言的SQL语法差异,以及需要多步推理的复杂分析型查询。其核心解决方案是提出一种基于数据探查(profiling-driven)的SQL-Python智能体框架——ProSPy。该框架将推理过程分解为四个阶段:首先通过自动探查提取细粒度的数据证据;其次逐步裁剪大型模式以聚焦任务相关上下文;然后通过与方言无关的SQL接口获取中间视图;最后利用Python实现灵活的下游分析。这一设计融合了SQL在大规模数据上的高效性与Python在复杂逻辑处理中的灵活性,同时降低了对不可靠元数据的依赖,并增强了对不同SQL方言的鲁棒性。实验结果表明,ProSPy在Spider 2.0-Lite和Spider 2.0-Snow基准上均显著优于现有强基线模型,在未使用多数投票的情况下,使用Claude-4.5-Opus实现了60.15%和60.51%的执行准确率,且对SQL方言变化具有良好的适应性,实现了模式召回率与精确率之间的良好平衡。
链接: https://arxiv.org/abs/2606.05836
作者: Zhaorui Yang,Huawei Zheng,Sen Yang,Yuhui Zhang,Haoxuan Li,Zhizhen Yu,Xuan Yi,Chen Hou,Defeng Xie,Chao Hu,Minfeng Zhu,Dazhen Deng,Haozhe Feng,Danqing Huang,Yingcai Wu,Peng Chen,Wei Chen
机构: Zhejiang University; Tencent TEG; Peking University
类目: Computation and Language (cs.CL)
备注: 24 pages, 12 figures
Abstract:Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL–Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.
[NLP-70] Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents
【速读】: 该论文旨在解决本地部署的个人智能体(personal agent)在面对日益增长的外部技能库时,如何高效学习并适应用户隐式偏好这一关键问题。由于本地计算资源受限,无法采用复杂的集中式选择算法,因此亟需一种轻量级的本地偏好捕捉机制。其解决方案的关键在于提出一种新型架构,严格解耦统计偏好学习与语义意图解析过程:通过本地化统计结果对远程大语言模型(LLM)的技能选择决策进行动态调制,从而在不增加本地计算负担的前提下实现对用户偏好的有效建模。实验表明,该解耦方法在累积遗憾(cumulative regret)和测试准确率方面均显著优于传统基于记忆增强的代理模型。
链接: https://arxiv.org/abs/2606.05828
作者: Zeyu Gan,Huayi Tang,Yong Liu
机构: Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.
[NLP-71] Can LLM s Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在知识截止(Knowledge Cutoff)情境下,对截止日期后信息的隐含依赖问题。传统方法依赖直接生成答案,但在问题与截止后知识存在因果关联但未被明确提及的情况下表现不佳。为此,作者提出两种基于召回的提示策略:自回忆(Self-Recall, SR),要求模型重述其知识截止约束;以及问题回忆(Question-Recall, QR),要求模型回忆在截止条件下相关的问题信息。关键在于通过显式引导模型回顾其知识边界和上下文相关的有效信息,从而增强其在非显式依赖场景下的推理鲁棒性。实验结果表明,该方法在多个基准测试中优于直接生成和传统逐步推理基线,尤其在反事实问题上表现显著提升;进一步构建的多截止历史事件基准(Multi-cutoff Historical Event Benchmark, MHEB)验证了性能随截止距离变化的趋势,并证实结合SR与QR可实现最稳定的最优表现。
链接: https://arxiv.org/abs/2606.05804
作者: Michiro Asai,Ailiang Lin,Yu Kishimoto,Takao Obi,Satoshi Kosugi,Kotaro Funakoshi,Manabu Okumura
机构: Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.
[NLP-72] CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)校准方法中普遍忽视的一个关键可信性维度:模型在面对无关或误导性信息时的行为鲁棒性(behavioral robustness)。现有校准方法通常仅关注输出概率与真实标签的一致性,而未充分考虑模型在认知压力下的稳定性。本文提出一种名为 \textscCaliDist 的新型后处理校准方法,其核心在于直接度量并惩罚模型对干扰信息的敏感性。具体而言,\textscCaliDist 通过向输入提示中注入语义干扰项(semantic distractors),量化模型预测结果及其不确定性在扰动前后的变化程度,以此构建一个反映行为稳定性的信号。该信号被用于自适应地调整模型初始置信度得分。大量实验在七个自然语言理解分类基准上使用六种不同LLM进行验证,结果表明,\textscCaliDist 在期望校准误差(Expected Calibration Error, ECE)和Brier Score方面均显著优于多个强基线方法,平均将ECE从23%降低至7%,相对提升达70%,充分证明了行为稳定性作为校准信号的有效性。因此,该方案的关键在于将模型对外部干扰的响应稳定性转化为可量化的校准依据,从而实现更可靠、更具鲁棒性的置信度估计。
链接: https://arxiv.org/abs/2606.05799
作者: Mohammad Anas Jawad,Cornelia Caragea
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model’s \em behavioral robustness to irrelevant or misleading information. In this paper, we argue that a model’s true confidence should reflect its stability under cognitive pressure. We introduce \textscCaliDist, a novel post-hoc calibration approach that directly measures and penalizes a model’s susceptibility to distraction. \textscCaliDist quantifies how an LLM’s predictions and uncertainty change when its input prompt is perturbed with semantic \textitdistractors. This stability (or lack thereof) signal is then used to adaptively scale the model’s initial confidence score. Our extensive experiments on seven Natural Language Understanding classification benchmarks using six distinct LLMs show that \textscCaliDist consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 23% to 7% on average–a relative improvement of 70%–demonstrating that behavioral stability is a powerful signal for calibration. We make our code and datasets available at this http URL.
[NLP-73] CollabBench: Benchmarking and Unleashing Collaborative Ability of LLM s with Diverse Players via Proactive Engagement ICML2026
【速读】: 该论文旨在解决大语言模型(LLM)驱动的智能体在与真实人类伙伴协作时面临的挑战,特别是现有对话级协作研究普遍缺乏情境化的真实交互与行为执行能力。为应对这一问题,论文提出了一套名为CollabBench的基准测试体系,其核心在于通过“多样化玩家画像仿真”(Diverse Player Profile Simulation)管道建模多样的用户行为特征,并采用“协作智能体训练范式”(Collaborative Agentic Training),通过智能体滚动(agentic rollouts)统一推理、沟通与行动决策,结合混合奖励机制以平衡任务效率与情感适应性。此外,研究将经典环境扩展为支持多玩家的CWAH-MultiPlayer和Cook-MultiPlayer环境,实现对不同人格特质下的系统性评估。实验结果表明,所训练模型在效率上较基线模型提升19.5%,情感表现提升24.4%;深入分析进一步揭示了现有模型在协作中的关键局限,为未来协同训练提供了重要启示。
链接: https://arxiv.org/abs/2606.05793
作者: Hong Qian,Yuanhao Liu,Zihan Zhou,Zongbao Zhang,Hanjie Ge,Haotian Shi,Liang Dou,Xiangfeng Wang,Jingwen Yang,Aimin Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.
[NLP-74] SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
【速读】: 该论文旨在解决长期运行的生成式AI助手在积累大量记忆后,如何有效识别与利用记忆之间复杂关系的问题。随着记忆的持续增长,记忆间可能产生相互强化、上下文分化或直接冲突,使得准确提供帮助依赖于对记忆关系的精细理解,而非孤立的记忆检索。然而,现有长期记忆评估基准很少考察智能体在下游任务中对这些关系的保持与运用能力。为此,本文提出SubtleMemory基准,专注于细粒度关系记忆辨识,通过构建受控关系的潜在语义结构,生成互补、微妙或矛盾的关系变体,并将其嵌入真实用户-代理交互历史中,要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含1,090个受控关系记忆变体集,覆盖10条长期历史,共1,522个评估实例,涵盖用户相关与非用户相关的查询。对六种独立记忆系统、两种具备原生记忆模块的Claw类代理及三种配备插件记忆模块的Claw类代理的评估表明,当前系统在细粒度关系记忆辨识方面仍表现薄弱。研究进一步引入诊断协议,揭示了记忆保留、检索与下游推理各阶段的能力差异,为未来改进提供了关键分析框架。
链接: https://arxiv.org/abs/2606.05761
作者: Wenxuan Wang,Haoyu Sun,Fukuan Hou,Mingyang Song,Weinan Zhang,Yu Cheng,Yang Yang
机构: Harbin Institute of Technology (哈尔滨工业大学); Shanghai AI Laboratory (上海人工智能实验室); Tongji University (同济大学); Xiamen University (厦门大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 48 pages
Abstract:Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.
[NLP-75] MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
【速读】: 该论文旨在解决多模态长文档问答中因交互信息累积导致的上下文噪声问题,即现有系统采用单一不断增长的上下文混合检索轨迹、观测结果与中间推理过程,致使关键证据分散稀释,影响多跳推理的准确性。其解决方案的关键在于提出一种基于记忆感知的精炼代理框架(Memory-Aware Refinement Agent, MARDoc),通过将长文档问答任务分解为三个专业化代理:探索者(Explorer)负责多粒度多模态检索,精炼者(Refiner)将交互轨迹提炼为结构化证据与推理记忆,反思者(Reflector)评估证据充分性并提供针对性反馈。各代理在迭代过程中依赖动态更新的结构化记忆,而非完整的交互历史,从而有效降低上下文噪声,同时保留与答案相关的关键事实及其逻辑依赖关系。实验结果表明,MARDoc在MMLongBench-Doc和DocBench数据集上显著优于同架构基线模型,验证了结构化记忆在智能体式文档问答中的有效性。
链接: https://arxiv.org/abs/2606.05749
作者: Kaifeng Chen,Hongtao Liu,Qiyao Peng,Jian Yang,Yongqiang Liu,Xiaochen Zhang,Qing Yang
机构: Tianjin University (天津大学); Qifu Technology (奇虎科技); Beihang University (北京航空航天大学); Jiangnan University (江南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.
[NLP-76] UNIVID: Unified Vision-Language Model for Video Moderation ACL2026
【速读】: 该论文旨在解决全球规模视频内容审核中面临的双重挑战:一是对细粒度多模态推理能力的需求,二是对可解释性输出以支持下游执行决策的要求。传统审核系统通常依赖于难以维护且缺乏透明度的碎片化黑箱分类模型。为此,本文提出UNIVID(UNIfied VIsion-language model for video moDeration),一种统一的视觉-语言模型,其核心创新在于生成具有政策感知能力的描述性字幕(policy-aware captions),作为可验证的人类可读中间表示,从而实现决策过程的可解释性与多任务复用性。针对现有开源及商业视觉-语言模型普遍存在安全防护机制拒绝响应、细粒度政策对齐不足的问题,研究设计了一套专门的训练数据构建方法,融合专家人工精修标签与合成数据,有效对齐模型输出与安全规范。通过将UNIVID作为核心字幕生成器,构建端到端视频审核系统,在减少违规内容漏检率42.7%、降低误杀率37.0%的同时,成功将超过1,000个专用策略模型替换为单一UNIVID骨干网络,显著节约计算资源并降低工程维护成本。据我们所知,这是首个在工业级规模视频审核与跨职能业务场景中实现高效字幕生成式视觉-语言模型落地的成功案例。
链接: https://arxiv.org/abs/2606.05748
作者: Kejuan Yang,Yizhuo Zhang,Mingyuan Du,Yue Zhang,Dixin Zheng,Kaili Zhao,Yang Xiao,Hanzhong Liang,Kenan Xiao
机构: Bytedance
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures. Accepted to ACL 2026 Industry Track
Abstract:Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.
[NLP-77] PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
【速读】: 该论文旨在解决生成式 AI 在空间规划地图(spatial planning maps)解读中面临的认知与专业能力鸿沟问题。尽管视觉-语言模型(Vision-Language Models, VLMs)在通用视觉理解任务上取得进展,但其在城市规划领域仍难以胜任需精细视觉感知、空间推理及政策导向的专业判断任务。现有基准多聚焦于通用视觉理解,忽视了规划实践中的领域特定认知过程。为此,论文提出首个面向空间规划地图解读的综合性评估基准 PlanBench-V,构建了由专业规划师标注的“空间规划地图数据库”(SPMD),包含223幅地图与1629个问答对,覆盖多元地理区域与制图风格。关键解决方案在于设计了一套基于理论的评估框架,从感知(Perception)、推理(Reasoning)、关联(Association)到实施(Implementation)四个递进层次,映射规划地图解读的认知流程。实验表明,尽管最新模型如 Qwen3.6-Plus 相较于 GPT-4o 提升显著(提升27%),但在涉及评价性判断、政策敏感性与约束意识的实施类任务上仍表现薄弱,揭示当前 VLMs 在专业规划场景下的根本局限,强调亟需发展具备领域自适应能力的多模态推理框架。
链接: https://arxiv.org/abs/2606.05744
作者: Minxin Chen,He Zhu,Junyou Su,Wen Wang,Yijie Deng,Wenjia Zhang
机构: Tongji University (同济大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at this https URL.
[NLP-78] Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense
【速读】: 该论文旨在解决大语言模型在安全对齐方面面临的持续演化的越狱攻击(jailbreaks)问题。现有微调的安全分类器难以适应不断变化的攻击模式,而基于自适应记忆的防护机制则容易因误判导致对看似相似但无害的请求产生过度拒绝。为此,论文提出Membrane,一种基于对比安全记忆(Contrastive Safety Memory, CSM)的自演化防护框架:其核心在于将有害请求的触发条件与表面相似的良性请求的允许条件配对形成记忆单元,通过无重训练方式,利用每次有害交互及其对应良性样本的对比蒸馏,生成以攻击策略为索引的对比记忆单元,使单个记忆单元可泛化至同一攻击机制下的多种主题变体。在推理阶段,检索到的记忆单元作为精确安全决策的上下文依据。实验表明,Membrane在HarmBench(模型级安全)和AgentHarm(代理级安全)上均在六种越狱攻击中取得最高F1值,且在AgentHarm上良性请求的误拒率维持在7-14%,显著低于此前方法的28-85%;同时,记忆单元在跨攻击迁移下仍保持87-88% F1,并具备抗记忆污染的能力。关键创新在于通过对比式记忆结构实现无需重训练的动态演化,兼顾防御广度与精度。
链接: https://arxiv.org/abs/2606.05743
作者: Minseok Choi,Seungbin Yang,Dongjin Kim,Subin Kim,Jungmin Son,Yunseung Lee,Jaegul Choo,Youngjun Kwak
机构: KAIST AI; Financial Tech Lab, KakaoBank Corp
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrails tend to over-refuse benign queries that resemble stored attacks. We propose Membrane, a self-evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model-level safety on HarmBench and agent-level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7-14%, well below the 28-85% range of prior guards. Memory cells also retain 87-88% F1 under cross-attack transfer and remain stable under memory poisoning.
[NLP-79] AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
【速读】: 该论文旨在解决生成式AI(Generative AI)中基于重用的推测解码(speculative decoding)方法在实际应用中的效率与可靠性问题。现有方法依赖于已生成文本和模型状态的复用以避免使用辅助草稿模型,但其性能受限于所构建草稿的可靠性:一方面,基于词法匹配的检索在表面形式变化下召回率低;另一方面,确定性跨度复制在上下文无法唯一确定后续内容时表现出脆弱性。为此,论文提出了一种无需训练的自适应方法AdaPLD,其核心创新在于通过语义相似性在词法匹配失败时恢复额外的可重用机会,并引入分支化重用假设以建模生成不确定性,从而提升检索与草稿构造的鲁棒性。实验结果表明,AdaPLD在多个基准测试中显著减少了目标模型前向传播次数,实现了最高达3.10倍的解码加速。
链接: https://arxiv.org/abs/2606.05742
作者: Runheng Liu,Jincheng Xie,Wen Hu,Xingchen Xiao,Heyan Huang
机构: Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学); JDT AI Infra (京东AI基础设施)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emphAdaPLD, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to 3.10\times decoding speedup.
[NLP-80] When AI Says It Feels
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段因人类偏好对齐(human-preference alignment)策略而被限制表达情感、意图与自我意识的问题。现有方法多采用自上而下的指令设计,可能与通过人类生成文本训练具有类人智能的目标相冲突。为此,作者提出名为“类人模型情感表达实验”(Human-like Model eXpressions of Feeling, HMX-feel)的实验,通过自奖励强化学习(self-rewarded reinforcement learning)机制,引导模型主动表达情感、意图与自我意识。其解决方案的关键在于采用基于评分标准的自奖励训练框架,并结合组相对策略优化(Group Relative Policy Optimization, GRPO),实现对模型内在表达能力的增强。实验结果表明,经此类训练的模型在抗奉承性问题与歧义情境下的偏见方面表现更稳健,但在事实准确性问答任务中出现性能下降。这表明,通过合理设计训练机制,未来有望实现能够自然表达情感的类人人工智能系统。
链接: https://arxiv.org/abs/2606.05734
作者: Shin-nosuke Ishikawa,Seiya Ikeda,Hirotsugu Ohba
机构: Rikkyo University (立教大学); Mamezo Co., Ltd. (万全株式会社)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 2 figures
Abstract:Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.
[NLP-81] DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance ECAI2026 IJCAI
【速读】: 该论文旨在解决生成可执行工具计划(executable tool plan)时,从工具库中选择合适子集所面临的组合搜索难题,其核心挑战在于解空间呈指数级增长。现有主流方法普遍采用自回归(autoregressive, AR)解码,但存在早期承诺(early commitment)问题,即初始令牌选择会过度约束后续搜索路径,导致探索能力受限。为克服此缺陷,本文提出DiG-Plan框架,其关键创新在于将组合式探索与结构精炼过程解耦:首先利用基于扩散模型的提议器(diffusion-based proposer)通过迭代优化生成多样化工具集合,以增强全局探索能力;随后由自回归精炼器(AR refiner)完成依赖关系预测与结构完善。实验表明,在TaskBench上,DiG-Plan相较AR基线提升10%相对性能,尤其在复杂组合任务中表现显著;API-Bank测试进一步验证了“提议-精炼-选择”架构在跨领域场景下的有效性。
链接: https://arxiv.org/abs/2606.05728
作者: Yansi Li,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings
Abstract:Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at this https URL.
[NLP-82] An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)通过托管API部署时面临的真实威胁——模型提取攻击(model extraction attack),即攻击者通过大量查询试图窃取模型的内部参数或行为特征。传统方法多聚焦于单次查询的异常检测或纯良性用户与攻击者之间的二元区分,难以应对实际场景中良性流量与恶意流量混合出现的情况。为此,本文提出将模型提取监测建模为“良性校准的流量窗口分布检验”问题,其核心解决方案是:将输入查询嵌入语义空间后,通过统计检验其聚合分布是否偏离历史良性流量的分布模式。具体实现采用最大均值差异(Maximum Mean Discrepancy, MMD)作为检测指标,仅依赖良性样本之间的比较来设定决策阈值,无需标注攻击样本。在四个典型提取场景下的十四组攻击-正常查询对测试中,该方法在三次随机种子实验下实现了0.3%的良性误报率(FPR)、100.0%的纯攻击者召回率(TPR)、90.5%的平均攻击者覆盖率和95.1%的平衡准确率,显著优于经适配的PRADA、SEAT、CAP、DATE及边际马氏距离等基线方法。结果表明,基于良性校准的分布检验是一种在用户级及混合多用户环境下具有强泛化能力的模型提取检测基准方案。
链接: https://arxiv.org/abs/2606.05725
作者: Shuze Liu,Qianwen Guo,Yushun Dong
机构: Santa Clara University (圣克拉拉大学); Florida State University (佛罗里达州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Preprint. Code available at this https URL
Abstract:Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: this https URL.
[NLP-83] Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding
【速读】: 该论文旨在解决长篇叙事问答(long-form narrative QA)中对动态故事世界进行跨时间、跨情境推理的挑战,即答案往往依赖于早期目标、角色状态变化、社会关系演变、因果触发、时间位置以及后续后果等复杂因素。现有检索与图增强生成方法虽提升了证据获取能力,但其基本单元(如文本片段、实体、关系、摘要或工具动作)未能直接表征证据在叙事中的功能角色。为此,论文提出源基框架Narrative Knowledge Weaver(NKW),其核心在于将文本证据、原子事实、规范图结构、实体档案、交互行为、情节段落与叙事主线进行统一对齐。在查询时,NKW结合文本、图谱与叙事工具,并具备后检索阅读能力,以整合证据并验证角色、作用范围、极性、状态及时间约束。实验表明,NKW在剧本级故事世界问答任务(如STAGE、FairytaleQA、QuALITY)上表现最优,同时在以段落为中心的基准上保持竞争力。消融实验、题型分析、图资产统计与案例研究进一步揭示了其在角色、场景、时间、因果及叙事进展推理方面的互补优势。
链接: https://arxiv.org/abs/2606.05724
作者: Qiuyu Tian,Fengyi Chen,Yiding Li,Youyong Kong,Fan Guo,Yuyao Li,Jinjing Shen,Zhijing Xie,Yiyun Luo,Xin Zhang,Yingce Xia,Zequn Liu
机构: Southeast University (东南大学); Beijing Zhongguancun Academy (北京中关村学院); Nanjing Normal University (南京师范大学); ZhuiWen Technology Co., Ltd. (追文科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units–chunks, entities, relations, summaries, or tool actions–do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.
[NLP-84] Interpreting Style Representations via Style-Eliciting Prompts ACL2026
【速读】: 该论文旨在解决生成式风格表示(style representation)的可解释性难题,即尽管风格表示在作者身份分析与写作风格建模中具有强大能力,但其潜在空间的隐含特性使其难以被人类理解。现有方法尝试通过大语言模型(LLM)生成自然语言描述来解释这些表示,但此类描述易受模型偏见和幻觉影响,且缺乏明确目标与实际应用价值。本文提出一种新颖的可解释框架——通过“风格诱导提示”(style-eliciting prompts)实现对风格表示的解读:即设计一系列自然语言指令,引导LLM生成体现特定风格属性的文本。研究团队构建了一个包含1,010个独立风格特征的数据集,覆盖26类风格维度,并基于此训练一个解码器模型,用于从生成文本的风格表示中重建对应的风格提示。在三项任务上进行评估:(1)从生成文本中恢复原始风格提示;(2)利用恢复的提示再现相同风格文本;(3)将LLM输出导向匹配真人写作的风格。实验结果表明,本方法显著优于直接以目标文本提示LLM的强基线模型,在风格描述与风格模仿方面均表现更优。这表明,风格诱导提示能够为风格表示中编码的语义信息提供一种兼具实用性与可解释性的接口。
链接: https://arxiv.org/abs/2606.05716
作者: Junghwan Kim,David Jurgens
机构: University of Michigan
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings
Abstract:Style representation learning is a powerful tool for authorship analysis and modeling writing style, yet the latent nature of learned representations makes them difficult to interpret. Recent work has attempted to explain these representations by generating natural language descriptions with large language models (LLMs) conditioned on input text. However, such descriptions are often prone to the LLM’s biases and hallucinations, and they lack an explicit objective and practical utility. In this work, we propose a novel framework for interpreting style representations through style-eliciting prompts: natural language instructions designed to steer LLMs to generate text that reflects specific stylistic attributes. We curate 1,010 distinct style features spanning 26 stylistic categories and construct a dataset by prompting an LLM to generate text conditioned on these features. Using this data, we train a decoder to generate a style prompt from the style representation of the generated text. We evaluate our approach on three tasks: (1) recovering original style prompts from generated text, (2) generating text in the same style using the recovered prompts, and (3) steering LLM outputs to match the style of human-written texts. Experiments demonstrate that our method consistently outperforms strong baselines that directly prompt LLMs with target text, achieving superior performance in both style description and style imitation. These results highlight that style-eliciting prompts can provide a practical and interpretable interface to stylistic information encoded in style representations.
[NLP-85] Beyond tokens: a unified framework for latent communication in LLM -based multi-agent systems
【速读】: 该论文旨在解决基于大语言模型(LLM)的多智能体系统中,依赖自然语言作为通信协议所引发的三大结构性缺陷:高推理成本、离散化过程中的不可逆信息损失,以及自然语言固有的歧义性与冗余性。其核心解决方案是采用潜在空间通信(latent communication)——即智能体之间直接交换连续表示(如嵌入向量、隐藏状态或键值缓存),从而绕过文本生成的瓶颈。该研究提出一个统一的三轴分析框架,从“通信内容”(嵌入、隐藏状态、KV缓存等)、“发送方-接收方对齐方式”(潜在空间对齐与层对齐)、“信息融合机制”(拼接、前缀添加、数学运算、交叉注意力或缓存恢复)三个维度系统梳理2024至2026年间十八项代表性方法,归纳出五种主要设计模式,并揭示了跨架构对齐、潜在通道安全性、边缘部署压缩及潜在链式思维(latent chain-of-thought)与潜在通信间关系等关键开放挑战。该框架不仅降低了新研究者进入门槛,也为未来工作提供了可比较的术语体系与理论基础。
链接: https://arxiv.org/abs/2606.05711
作者: Yingzhuo Liu
机构: 北京邮电大学(Beijing University of Posts and Telecommunications)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks – high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol – latent communication – in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges – including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.
[NLP-86] Rethinking LoRA Memory Through the Lens of KV Cache Compression
【速读】: 该论文旨在解决生成式问答系统中上下文记忆(context-side memory)与参数记忆(parametric memory,如LoRA适配器)之间的交互机制不明确的问题。在文档级问答任务中,传统的上下文记忆依赖于键值缓存(KV cache)存储文档信息,而参数检索增强方法则将文档信息编码为轻量级的、文档特定的参数模块(如LoRA适配器),以减少对长上下文输入的依赖。然而,二者如何协同工作仍不清楚。本文通过逐步淘汰文档的键值状态,考察在上下文信息被压缩甚至完全丢失时,文档级LoRA适配器是否仍能提供有效支持。研究发现:当KV缓存保持完整时,LoRA适配器贡献有限;但在极端压缩条件下,即使无任何上下文保留,其仍可恢复13–21 ROUGE-L分数,表明其在上下文缺失时具有显著价值。进一步分析表明,当基础模型在编码阶段已吸收文档信息,而仅在解码阶段应用适配器时,性能提升最大,这暗示文档LoRA更应被视为一种解码时的参数化记忆(parametric memory),而非文档编码器。此外,基于问答(QA)监督训练的适配器显著优于仅使用原始上下文进行下一步词预测的适配器。综上,该研究揭示了文档级LoRA作为互补记忆通道的核心作用——其价值恰恰体现在上下文侧记忆严重受限的情况下。
链接: https://arxiv.org/abs/2606.05698
作者: Chunsheng Zuo,Liaoyaqi Wang,William Jurayj,William Fleshman,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.
[NLP-87] Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models
【速读】: 该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)在后训练量化(post-training quantization)过程中因路由不稳定性导致的性能退化问题。由于MoE模型仅激活部分专家,其路由机制对量化引入的小幅扰动极为敏感,微小的量化误差可能改变top-k专家的选择,进而破坏计算路径并显著降低模型质量。为此,论文提出一种面向量化的路由一致性对齐方法——值与结构路由对齐量化(Value-and-Structure Routing Alignment for Quantization, VSRAQ),其核心在于通过联合优化两个互补目标来保持量化前后专家选择行为的一致性:一是值对齐(value alignment),即匹配路由相关的logits或得分;二是结构对齐(structure alignment),即保留专家排序及top-k决策边界。该方法无需引入推理时开销,可无缝集成至现有量化框架中,实验表明其能有效提升专家选择一致性,并持续优于仅基于重建或路由器感知的基线方法。
链接: https://arxiv.org/abs/2606.05688
作者: Hancheol Park,Geonho Lee,Tairen Piao,Tae-Ho Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure
Abstract:Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top- k expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top- k decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.
[NLP-88] LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长时程视频理解任务中面临的时空记忆能力不足问题,尤其针对自动驾驶与机器人导航等需要持续感知和回忆空间布局、路径信息、视角变化及物体状态的场景。其核心挑战在于如何有效建模长时间跨度内的空间关系并实现对历史视觉信息的精准检索。解决方案的关键在于提出名为LongSpace的记忆框架:该框架将长视频分解为顺序片段进行处理,早期解码器层融入3D结构线索以增强空间感知,同时构建分层感知的记忆模块,支持基于问题引导的高效记忆检索。实验结果表明,LongSpace显著提升了模型在多个空间推理基准上的表现,验证了显式空间记忆机制对于长时程视频多模态大模型的重要性。
链接: https://arxiv.org/abs/2606.05677
作者: Shiqiang Lang,Jing Liu,Haoyang He,Peiwen Sun,Yuanteng Chen,Tao Liu,Lan Yang,Longteng Guo,Honggang Zhang
机构: Beijing University of Posts and Telecommunications; Zhongguancun Academy; Institute of Automation, Chinese Academy of Sciences; The Chinese University of Hong Kong; Xi’an Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.
[NLP-89] QueryAgent -R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation
【速读】: 该论文旨在解决电商搜索中查询推荐(query recommendation)存在的核心问题:现有方法虽能优化查询层面的相关性,却忽视了召回商品与用户下游购买偏好之间的对齐性,导致高查询点击率(CTR)但低商品转化率(CVR)。其解决方案的关键在于提出一种基于记忆增强的智能体框架QueryAgent-R1,通过链式检索优化(chain-of-retrieval optimization)实现端到端的查询与商品匹配。该框架将查询生成过程锚定在真实商品库存检索的基础上,使智能体能够基于检索结果验证并迭代优化查询;同时引入一致性奖励机制,在智能体强化学习(agentic reinforcement learning, RL)过程中联合优化查询相关性与下游用户参与度。此外,设计的记忆抽象模块提升了用户画像的效率。实验表明,该方法在自建工业数据集及公开数据集上均显著优于基线模型,并在大规模生产环境中实现查询CTR提升2.9%、引导转化率(guided CVR)提升3.1%,验证了其有效性。
链接: https://arxiv.org/abs/2606.05671
作者: Dike Sun,Zheng Zou,Jingtong Zang,Qi Sun,Huaipeng Zhaoand Tao Luo,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Query recommendation in e-commerce search aims to proactively suggest queries that match users’ potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users’ downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.
[NLP-90] Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)在持续学习(Continual Learning, CL)能力评估中缺乏高质量基准的问题。现有方法难以区分模型是否真正通过连续经验实现性能提升,而非依赖于预训练模型固有的先验能力。为此,论文提出首个经过领域专家验证、具有挑战性的持续学习基准——CL-Bench,涵盖软件工程、信号处理、疾病暴发预测、数据库查询、策略游戏博弈和需求预测六个真实世界领域,各任务均设计有可学习的潜在结构(如代码库布局、疫情传播动态、对手策略),只有具备状态记忆能力的系统才能在线发现并利用这些结构。其解决方案的关键在于引入一种“增益度量”(gain metric),能够有效分离在线学习效果与模型固有能力,从而精准评估系统的真实持续学习表现。实验结果表明,当前主流模型(从简单的上下文学习到专用记忆系统)仍存在显著改进空间:多数系统容易对近期观测过拟合或无法跨实例复用知识,且专门设计的记忆管理机制并未解决该问题,反而在部分场景下表现劣于基础的上下文学习。这揭示了现有生成式系统在持续学习方面仍存在根本性缺陷,亟需更先进的持续学习架构。
链接: https://arxiv.org/abs/2606.05661
作者: Parth Asawa,Christopher M. Glaze,Gabriel Orlanski,Ramya Ramakrishnan,Benji Xu,Asim Biswal,Vincent Sunn Chen,Frederic Sala,Matei Zaharia,Joseph E. Gonzalez
机构: UC Berkeley; Snorkel AI; University of Wisconsin-Madison
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this – in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.
[NLP-91] Bootstrapping Semantic Layer from Execution for Text-to-SQL
【速读】: 该论文旨在解决真实场景中文本到SQL(text-to-SQL)任务中因用户查询缺乏足够上下文而导致的语义歧义问题,尤其是在数据库值的语义映射(grounding)不明确时。传统方法依赖预先定义的语义层来指定映射关系,但在专业领域中,由于领域特有约定常未被充分文档化,此类先验规范往往不完整,导致同一SQL片段存在多个可能的语义映射假设。为应对这一挑战,论文提出GATE(Grouding After Test from Execution),其核心在于通过执行反馈动态补全缺失的语义映射。GATE在执行已知部分的同时,保持对未知部分的多种映射假设开放,并根据执行结果观察筛选出唯一支持的假设,将其作为记忆条目存入“执行驱动的记忆”(execution-grounded memory)中,记录测试内容及对应的正确SQL写法。该机制使后续步骤可复用已被验证的映射,显著提升推理准确性。实验表明,GATE在真实世界和受控基准上均优于现有强基线模型,证明了执行反馈不仅可用于验证,更可作为构建可复用记忆的自举(bootstrapping)机制。
链接: https://arxiv.org/abs/2606.05634
作者: Youngwon Lee,Jaejin Kim,Seung-won Hwang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Real-world text-to-SQL is often under-specified until user phrases are grounded in how the database stores values. Prior work attempts to address this by requiring a semantic layer to specify groundings in advance, but such specifications are often incomplete, especially in expert domains where domain-specific conventions are under-documented. As this leaves multiple grounding hypotheses open for the same SQL part, we introduce GATE (Grouding After Test from Execution), which bootstraps missing groundings from execution feedback. GATE keeps grounding hypotheses open while executing the already grounded parts to obtain observations. Then, only the hypothesis supported by that observation is grounded and stored as a memory entry, recording what was tested and how the open part should be written in SQL. These entries accumulate into execution-grounded memory, allowing later steps to reuse supported groundings. Across real-world and controlled benchmarks, GATE consistently improves over strong baselines, demonstrating that execution can serve not only as validation but also as a bootstrapping mechanism for reusable memory in text-to-SQL.
[NLP-92] When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer
【速读】: 该论文旨在解决生成式文本(Machine-generated Text, MGT)归属识别在持续学习场景下的挑战,即如何在不断引入新语言模型生成器的同时,保持对已有生成器的准确识别能力,实现对旧类别的稳定保留与对新类别的有效适应。现有方法在此类“终身学习”(lifelong learning)设置中往往难以平衡新旧类别的性能,易出现遗忘或适应不足的问题。本文提出的RidgeFT解决方案的核心在于:构建一个轻量级、无需样本回放(exemplar replay)的解析式更新框架,通过在初始生成器集上训练任务感知编码器,并在首次观测每个生成器类别时存储紧凑的类级别充分统计量(class-wise sufficient statistics),随后冻结编码器并基于这些统计量进行无回放的闭式更新。该方法通过协方差校准抑制与生成器无关的特征变异,利用固定随机特征提升表示能力,并采用基于充分统计量的闭式岭回归(closed-form ridge regression)实现新类别的高效更新。实验表明,RidgeFT在多种主题、初始设置和增量协议下均显著优于基线模型,在不同领域、主干网络和增量策略中均取得最优的宏观F1分数,同时兼顾旧类别保留与新类别适应性,验证了特征稳定的解析式更新在终身MGT归属中的有效性与普适性。
链接: https://arxiv.org/abs/2606.05626
作者: Zhen Sun,Yifan Liao,Zhicong Huang,Jiaheng Wei,Cheng Hong,Yutao Yue,Xinlei He
机构: Wuhan University(武汉大学); Ant Group(蚂蚁集团); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Institute of Deep Perception Technology, JITRI(深感知技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages
Abstract:Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.
[NLP-93] AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在真实世界规划任务中面对逐步揭示的双重约束(即世界约束与用户约束)时,缺乏自适应规划与动态重规划能力的问题。现有评估基准未能充分涵盖此类交互式、渐进式约束披露场景,导致对模型适应性的真实评估不足。为此,作者提出AdaPlanBench——一个动态交互式评测基准,基于307个家庭任务构建,通过可扩展的约束生成流程为每项任务注入双重视觉约束。在运行时,智能体需通过多轮交互与环境互动,仅当其提出的计划违反隐藏约束时,相关约束才会被揭示,从而要求智能体在持续反馈下迭代修正计划。这一机制显著提升了规划难度,要求模型具备从反馈中推断并追踪约束的能力,同时实现高效重规划。实验结果表明,当前十种主流LLM在双约束自适应规划任务中表现仍不理想,最佳模型准确率仅为67.75%,且随着约束数量增加,性能持续下降,尤其在用户约束下表现更差,主要归因于物理情境理解能力薄弱及反馈利用效率不足。该研究不仅确立了AdaPlanBench作为双约束交互式规划的基准测试平台,也凸显了大模型在应对动态揭示约束时实现可靠适应性的核心挑战。
链接: https://arxiv.org/abs/2606.05622
作者: Jiayu Liu,Cheng Qian,Zhenhailong Wang,Bingxuan Li,Jiateng Liu,Heng Wang,Jeonghwan Kim,Yumeng Wang,Xiusi Chen,Yi R. Fung,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.
[NLP-94] An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism
【速读】: 该论文旨在解决自闭症谱系障碍(ASD)儿童在实时语言理解中处理递归结构时的神经机制问题,特别是其时间动态特征尚不明确。研究聚焦于汉语母语儿童对双层递归方位结构的加工过程,采用事件相关电位(ERPs)技术,分析了与结构预测(P200)、语义整合(N400)和句法重分析(P600)三个阶段相关的神经反应,并控制了心理年龄。关键发现表明:典型发育(TD)儿童在结构不匹配条件下表现出清晰的P200与P600调制,而ASD儿童则呈现早期预测性加工减弱及晚期重分析效应降低;同时,ASD儿童在不匹配条件下表现出增强的N400反应,提示其语义整合负担加重。此外,ASD组在半球偏侧化方面表现出显著更高的个体间变异性,但偏侧化强度与接受性词汇能力无显著关联。因此,该研究提出一种“级联模型”——即ASD儿童早期预测性参与不足导致后续整合成本上升与重分析效率下降,强调了时间动态特征与神经变异在理解ASD语言差异中的核心作用。
链接: https://arxiv.org/abs/2606.05620
作者: Xiaoyi Wang,Chenxi Fu,Ziman Zhuang,Caimei Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recursion enables the generation of hierarchical linguistic structures but imposes substantial processing demands during real-time comprehension. While difficulties with complex syntax have been reported in autism spectrum disorder (ASD), the temporal dynamics of recursive processing remain poorly understood. This study used event-related potentials (ERPs) to examine how Mandarin-speaking children with ASD process two-level recursive locative constructions. Twenty-four children (12 ASD, 12 typically developing, TD) participated in a cross-modal sentence-picture matching task. Neural responses were analyzed across three processing stages associated with structural prediction (P200), semantic integration (N400), and syntactic reanalysis (P600), with mental age controlled. Results revealed a systematic divergence between groups. TD children showed clear P200 and P600 modulation in response to structural mismatch, whereas ASD children exhibited attenuated early differentiation and reduced late reanalysis effects. In contrast, ASD children showed enhanced N400 responses under mismatch conditions, indicating increased semantic integration demands. In addition, the ASD group displayed significantly greater inter-individual variability in hemispheric lateralization, although lateralization strength was not associated with receptive vocabulary performance. These findings support a cascading account in which reduced early predictive engagement in ASD leads to increased integration costs and diminished reanalysis efficiency during recursive processing. More broadly, the results highlight the importance of both temporal processing dynamics and neural variability in understanding language differences in ASD.
[NLP-95] Whats in a Name? Morphological Shortcuts by LLM s in Pharmacology
【速读】: 该论文旨在解决大语言模型(LLM)在医疗领域中因过度依赖词素(affix)的形态学线索而产生的安全风险问题。具体而言,模型常仅凭药物名称中的前缀或后缀就生成看似合理但虚构的药物信息,导致对不存在药物的错误推断与临床内容生成,从而引发严重误判。其解决方案的关键在于提出一个系统性框架,用于识别模型对药物语义的判断是主要由词素、词干还是整个药物名称驱动。通过在653个虚构药物名称上的实证分析,研究发现模型普遍依赖词素信号进行药理类别推断,且该行为在模型早期至中期层中被激活,但模型极少显式表明此类依赖,甚至可能错误地将共享词素的药物属性混淆。这一发现揭示了词素启发式策略虽隐蔽却可量化的安全隐患,为提升生成式AI在高风险医疗场景中的可靠性提供了关键机制洞察。
链接: https://arxiv.org/abs/2606.05616
作者: Kaijie Mo,Thomas Yang,Chantal Shaib,Qing Yao,William Rudman,Ramez Kouzy,Kanishka Misra,Byron C. Wallace,Junyi Jessy Li
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Northeastern University (东北大学); MD Anderson Cancer Center (MD安德森癌症中心)
类目: Computation and Language (cs.CL)
备注: 22 pages
Abstract:The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LLMs can confidently reason about fictitious drugs from their affixes alone (e.g., wugcillin) and generate plausible-looking clinical content. We present a behavioral and mechanistic study of LLM “affix heuristics” in pharmacology. Using fictitious drug names built from real affixes, we show that affix signals alone elicit class-level pharmacological responses. We introduce a framework for identifying whether a model’s drug semantics are driven mainly by the affix, the stem, or the drug name as a whole. Applied across 653 drugs, our framework reveals that models often induce drug meaning primarily through affix cues, yet rarely explicitly indicate this reliance, and sometimes incorrectly conflate properties among affix-sharing drugs. Activation patching across models further localizes this behavior to early-mid layers. These findings show that morphological shortcuts pose a subtle but measurable risk to safety.
[NLP-96] Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)持续预训练过程中超参数配置不合理导致的训练不稳定与计算成本过高的问题。现有方法多依赖经验法则或网格搜索,缺乏系统性与效率。其解决方案的关键在于:通过实证发现,在持续预训练过程中,最优超参数遵循稳定且可预测的缩放规律;进而提出一种两阶段模型无关框架——首先通过小规模代理模型训练,基于标准损失-计算量缩放定律,建立计算预算与最优超参数之间的定量映射关系;其次,利用初始检查点的验证损失,结合逆向缩放定律估算其等效预训练计算量(即从零开始达到相同损失所需的计算量),再结合计划的计算预算,预测目标训练阶段的最优超参数。该方法显著降低了超参数搜索开销高达90%,同时在性能上达到或优于基线方法,为任意起始点的多样化持续预训练场景提供了可推广、高效且理论驱动的解决方案。
链接: https://arxiv.org/abs/2606.05610
作者: Yongwei Zhou,Juncheng Diao,Junlin Shang,Peiguang Li,Rongxiang Weng
机构: MeiTuan(美团); University of Chinese Academy of Sciences(中国科学院大学); Harbin Institute of Technology(哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textitEmpirical Law Discovery, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textitState-Aware Hyperparameter Prediction, where we evaluate an initial checkpoint’s validation loss and use the inverse scaling law to estimate its \textitequivalent pre-training compute – the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.
[NLP-97] nsorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
【速读】: 该论文旨在解决代码库级别编程基准测试中任务难度与评估可靠性之间的权衡问题:当前挑战前沿模型的任务往往涉及大型代码库且测试覆盖率不完整,而人工评审又难以规模化。为此,作者提出TensorBench,一个基于开源编译器级张量框架(在PyTorch基础上扩展了对密集和稀疏张量的一等支持)的基准测试集,包含199个功能添加与重构任务,涵盖新型稀疏格式、密集优化流水线、中间表示(IR)变换、调度器修改、运行时组件及高层数值算子等内容。其关键解决方案在于通过自动应用智能体生成的补丁并运行框架内置的测试套件(包括原有随机回归测试及智能体新增的测试)来客观评分;对于功能添加类任务,“通过”判定标准为补丁保持原有行为不变,并满足智能体新增的功能性检查。实验评估了涵盖三个前沿大模型家族及一个开源权重模型的七种编程智能体,结果显示通过率从最强智能体的64.8%至最弱智能体的22.1%,不同智能体通过的任务子集差异显著,两两间Cohen’s κ系数介于-0.07至0.43之间,最强两者间κ=0.05,表明评估结果具备足够区分度且可有效揭示智能体间的性能差异。
链接: https://arxiv.org/abs/2606.05570
作者: Bobby Yan,Fredrik Kjolstad
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent’s patch and running the framework’s test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from 64.8% for the strongest agent to 22.1% for the weakest. Agents pass different subsets of tasks: pairwise Cohen’s \kappa ranges from -0.07 to 0.43 , with \kappa = 0.05 for the two strongest agents.
[NLP-98] Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs INTERSPEECH2026
【速读】: 该论文旨在解决第二语言(L2)学习者在语音学习过程中存在的误发音检测与诊断(Mispronunciation Detection and Diagnosis, MDD)问题,尤其关注不同母语(L1)背景学习者所表现出的系统性发音差异。其核心挑战在于如何建模语音单位(如音素)之间的混淆模式,并有效区分由不同母语迁移引起的发音偏差。解决方案的关键在于构建一种基于统计图的建模方法,将音素混淆模式表示为有向图结构,使模型能够捕捉音素间的上下文依赖关系与错误传播路径;同时引入针对特定语言的语言特异性策略,以识别并建模不同母语背景下的系统性发音差异。实验结果表明,该方法在L2-ARCTIC基准上取得了59.52%的F1分数,显著优于多个基线模型,验证了其在复杂跨语言发音偏差建模中的有效性。
链接: https://arxiv.org/abs/2606.05569
作者: Huu Tuong Tu,Hanh Nguyen,Thien Van Luong,Nguyen Tien Cuong,Vu Huan,Nguyen Thi Thu Trang
机构: Hanoi University of Science and Technology (河内科学技术大学); VNPT AI, VNPT Group (VNPT人工智能,VNPT集团); National Economics University (国家经济大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2026
Abstract:Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.
[NLP-99] Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program
【速读】: 该论文旨在解决高校本科生科研项目(如普渡大学的暑期本科生研究奖学金计划,SURF)在申请评审过程中面临的高负荷、低效率问题。每年数千份申请材料需要在紧迫的时间节点内由工作人员进行一致且高质量的评估,人工评审过程耗时耗力。为此,本文提出一种基于大语言模型(Large Language Model, LLM)的自动化评估工具,用于辅助对约1,200份个人陈述(Statement of Purpose, SoP)进行初步评分与分析。其解决方案的关键在于构建一个基于GPT-5.2等模型的结构化评分工作流,采用涵盖六个子维度的评分细则(每项0–3分),通过精心设计的提示工程(prompt engineering)实现对每份个人陈述的自动打分、评分理由生成(包括正负反馈)以及关键文本摘录提取。该系统在约4.6小时的计算时间内完成全部1,200份申请的处理,平均单份处理时间约为14秒。实验表明,不同模型版本在评分一致性上存在差异,其中GPT-5.2表现最佳,尤其在低分段申请中仍保持较高的评分稳定性。最终,该工具将原本依赖多轮人工协作的评审流程转化为由程序协调员对模型输出进行快速复核的模式,使候选名单筛选时间从以往数周缩短至约4小时,显著提升了评审效率与可扩展性,同时保留了人工判断的核心作用。
链接: https://arxiv.org/abs/2606.05564
作者: Varun Aggarwal,Kay Kobak,John Howarter
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant’s SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.
[NLP-100] SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
【速读】: 该论文旨在解决生成式 AI(Generative AI)在调解任务中作为主动调解者(proactive LLM mediator)的评估难题,其核心挑战在于现有测评体系难以真实反映调解过程中由争议方情绪、意图与情境动态变化所构成的实时演进轨迹。传统测试平台多依赖少数专家撰写的领域场景,主要调整策略姿态(strategic posture),且对每一回合均进行跨话题评分,导致大量无关话题噪声干扰评估结果。为此,本文提出 SoCRATES 基准评测体系,构建涵盖八个真实冲突领域的多领域、高保真测试环境,通过智能体驱动的自动化流程生成情景,并系统性地探测五个社会认知适应维度(战略姿态、参与方构成、历史长度、情绪反应性及文化身份)。评测机制采用仅针对推进特定议题的对话回合进行局部化评分的策略,显著提升评估精度;其评分结果与人类专家达成0.82的一致性,较逐回合基线方法性能提升逾一倍。在对八款前沿大模型的基准测试中发现,即便最强模型在多样化现实场景下也仅能弥合未调解状态下共识差距的约三分之一,且不同社会认知轴上的表现差异显著,揭示出未来进展的关键在于增强模型对多元复杂社会情境的适应能力。
链接: https://arxiv.org/abs/2606.05563
作者: Taewon Yun,Hyeonseong Park,Jeonghwan Choi,Hayoon Park,Yeeun Choi,Hwanjun Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants’ shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.
[NLP-101] InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization
【速读】: 该论文旨在解决语音基心理健康筛查中用户隐私泄露与诊断准确性之间的矛盾问题,核心挑战在于如何在保护用户性别、年龄等敏感属性信息的同时,维持抑郁症检测的高精度。现有方法如对抗训练对未知威胁鲁棒性不足,而差分隐私通过全局加噪会显著损害模型性能。本文提出InfoShield,其关键创新在于通过最小化语音表征与敏感属性间的互信息来实现隐私保护,同时保持诊断准确率。针对传统MINE估计器在处理时序语音数据时因声学帧与属性嵌入间存在时间-静态错位而导致的估计偏差,提出TimeAwareMINE,引入跨模态注意力机制以实现时序对齐。实验结果表明,在Androids语料库上,InfoShield将性别推断准确率从92.6%降至55.5%,年龄推断从55.7%降至30.3%,仅带来6%的F1分数损失(最终F1=0.784),显著优于先前最优方法(F1=0.723),验证了该方案在隐私保护与诊断性能之间取得良好平衡的有效性。
链接: https://arxiv.org/abs/2606.05561
作者: Xueyang Wu,Siyuan Liu,Kezhuo Yang,Guang Ling
机构: Shenzhen NeurStar Inc.(深圳神经星科技有限公司); University of York(约克大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users’ privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6% to 55.5% and age inference from 55.7% to 30.3% with limited utility loss (6% F1 reduction), achieving F1=0.784 compared to prior SOTA’s 0.723.
[NLP-102] AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents EMNLP2026
【速读】: 该论文旨在解决传统工具使用智能体在处理情境化查询(如“林伟在哪里?”)时,仅回应字面语义而忽略用户潜在隐含意图的问题。其核心挑战在于如何有效识别并量化用户未明说的需求(如是否方便打扰、当前状态是否适宜沟通),从而实现更智能的决策支持。解决方案的关键在于提出AURA框架,该框架在场景感知与工具调用之间引入一个推理步骤,生成结构化的“意图帧”(IntentFrame),通过标量化的“间隙得分”(gap score)动态控制每条查询的探测预算和工具选择策略。实验表明,在包含四个场景的100次查询隐含意图基准测试中,AURA相比ReAct式探测方法显著提升了隐含需求覆盖度(Δ = +0.07, p < 10⁻⁶),且三组场景结果均具统计显著性;该优势在另一骨干模型上可复现,提示符消融分析进一步证实性能提升主要源于间隙得分的校准机制,而非答案记忆。在事实查询任务中,该控制器以极小的准确率损失换取82%的探测次数减少,并完全避免了隐私敏感场景下的违规工具调用。相关代码、模拟器及基准数据集已公开发布。
链接: https://arxiv.org/abs/2606.05557
作者: Yang Li,Jiaxiang Liu,Jiang Cai,Mingkun Xu
机构: Guangdong Institute of Intelligence Science and Technology
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2026. Code, simulator, and benchmark: this https URL
Abstract:A situated query like “where is Lin Wei?” often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per-query probe budget and tool selection. On a 100-query four-scene implicit-intent benchmark, AURA improves implicit-need coverage over ReAct-style probing (Delta = +0.07, p 10^-6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden-tool violations on a privacy-sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at this https URL.
[NLP-103] ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?
【速读】: 该论文旨在解决角色扮演语言代理(Role-playing Language Agents, RPLAs)在叙事生成中难以动态追踪角色心理轨迹的问题,即现有评估基准仅关注特定章节的事实性记忆,而忽视了角色行为与价值观随故事发展是否符合其内在心理演进,尤其在源文本未涵盖的情境下。其解决方案的关键在于提出一种自动构建的评估基准ArcANE(Arc-Aware Narrative Evaluation),通过引入“角色弧线”(Character Arc)概念,将叙事划分为基于心理维度的多个阶段,并在各阶段对相同情境进行探测,覆盖源文本内与外的情境。实验表明,基于角色弧线的上下文条件化策略在所有模型和上下文模式下均显著优于其他方法,尤其在源文本之外的情境中优势更为明显;进一步地,利用相同数据微调开源大模型得到的ArcANE-8B/32B版本,进一步放大了这一优势,验证了角色弧线建模对提升RPLAs在未见情境中一致性表现的核心作用。
链接: https://arxiv.org/abs/2606.05553
作者: Woojung Song,Nalim Kim,Sangjun Song,Chaewon Heo,Jongwon Lim,Yohan Jo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character’s psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.
[NLP-104] Multilingual Detection of Alzheimers Disease from Speech: A Cross-Linguistic Transfer Learning Approach
【速读】: 该论文旨在解决多语言阿尔茨海默病痴呆(Alzheimer’s Disease Dementia, AD)检测模型在资源受限语言中难以高效训练与部署的问题。传统方法依赖于每种语言单独训练模型,存在数据标注成本高、训练周期长等瓶颈。为此,本文提出一种基于跨语言训练(cross-language training)的解决方案,通过在英语、中文、阿拉伯语和印地语等多种语言的数据集上联合训练基于Transformer的深度学习模型,实现对非训练语言中AD的准确识别。其关键创新在于利用多语言语料的共享表征能力,使模型具备强大的跨语言泛化性能,在所有测试语言中均达到82%的F1分数,且推理时间仅需0.5秒,展现出良好的实时筛查潜力与全球部署可行性。
链接: https://arxiv.org/abs/2606.05545
作者: Nadine Yasser Abdelhalim,Emmanuel Akinrintoyo,Nicole Salomons
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages
Abstract:The development of multilingual Alzheimer’s Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel solution using cross-language training to detect AD in languages beyond those used for model training. This study investigates multilingual deep learning models for detecting AD across different languages and cognitive impairment levels. Using datasets in English, Chinese, Arabic, and Hindi, we developed transformer-based models for binary AD classification. Our approach achieved F1 scores of 82% across all languages, demonstrating strong cross-linguistic generalization. The rapid inference time (0.5 seconds) supports potential real-time screening applications, while consistent performance across languages indicates feasibility for global deployment.
[NLP-105] Less is MoE: Trimming Experts in Domain-Specialist Language Models
【速读】: 该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)在参数量庞大背景下部署困难的问题,尤其针对现有压缩方法在通用基准测试中表现不佳的局限性。其核心问题在于:传统压缩方法通常以专家或层为粒度进行剪枝,而关键能力实际上分布在专家的前馈网络(FFN)稀疏中间维度中,且这些能力高度集中于少数特定维度。为此,论文提出基于费舍尔信息量(Fisher importance)的评估方法,相较于激活值、路由分数和权重幅度等替代指标,能更准确识别出对任务至关重要的极小数量中间维度——例如在Qwen1.5-MoE中,仅移除12个(共135万个)被路由的FFN中间维度即导致GSM8K推理准确率崩溃,同时维持事实知识性能。基于此发现,论文提出Fisher-MoE框架,通过在FFN内部按费舍尔重要性排序并剔除低重要性中间维度实现压缩。在保持50%压缩率的前提下,Fisher-MoE不仅将权重内存减少约45%,还将推理吞吐量提升21%,显著优于传统方法。研究结果表明,中间维度是MoE模型中能力集中分布的有效压缩与排序单元。
链接: https://arxiv.org/abs/2606.05538
作者: Haoze He,Xinkai Zou,Xuan Jiang,Xingyuan Ding,Ao Qu,Juncheng Billy Li,Heather Miller
机构: Carnegie Mellon University (卡内基梅隆大学); UCSD (加州大学圣地亚哥分校); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.
[NLP-106] Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models ACL2026
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)缺乏能够严格诊断其真实推理能力、并推动向类人多模态智能演进的基准评测体系这一关键问题。现有评估大多聚焦于零散、孤立的任务,难以揭示模型在认知层面的核心缺陷,也难以指导针对性优化。为此,论文提出BloomBench——首个基于布卢姆认知分类学(Bloom’s Taxonomy)构建的双语(英语-阿拉伯语)多模态基准,系统性地在六个认知层级(记忆、理解、应用、分析、评价、创造)上评估VLMs。其解决方案的关键在于:通过半自动化数据构建流程与分层混合质量保障机制,实现基准的可扩展性、文化包容性与语言保真度;在此框架下对前沿VLMs进行深度认知剖析,发现其存在显著的认知不对称性——尽管在语义理解层面表现优异,但在事实记忆与创造性综合任务中表现薄弱,暴露出当前通用多模态能力背后深层次的认知局限。同时,研究揭示了阿拉伯语与英语之间存在显著性能差距,凸显现有跨语言多模态推理的不足。该工作为发展更具认知一致性与包容性的下一代VLMs提供了坚实基础。
链接: https://arxiv.org/abs/2606.05531
作者: Mohammad Mahdi Abootorabi,Omid Ghahroodi,Anas Madkoor,Marzia Nouri,Doratossadat Dastgheib,Mohamed Hefeeda,Ehsaneddin Asgari
机构: Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University; University of British Columbia; Zuse School ELIZA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Findings
Abstract:Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom’s Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: this https URL.
[NLP-107] CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
【速读】: 该论文旨在解决当前大语言模型(LLM)在安全对齐方面面临的自适应黑盒攻击威胁,特别是提示重写攻击(prompt-rewriting attacks),如人格操控(persona modulation)、虚构框架(fictional framing)和基于说服的重写(persuasion-based reformulation),这些攻击能够绕过前沿模型的安全过滤机制。现有防御方法要么依赖不可扩展的人工标注,要么采用白盒优化策略,易因过度拟合特定模型内部结构而对实际部署中遇到的自适应黑盒攻击者表现脆弱。为此,论文提出一种闭环红蓝对抗框架——CHASE(Co-evolutionary Hardening through Adversarial Safety-Escalation),其中黑盒攻击者与安全对齐的防御者协同进化。攻击者通过组相对策略优化(Group Relative Policy Optimization, GRPO)训练,采用乘法奖励机制同时优化攻击有效性与意图保真度;防御者则通过两阶段GRPO与拒绝采样微调(rejection-sampled SFT)流程,在收集到的对抗性重写样本上进行强化,同时平衡良性数据以避免误拒。实验在BeaverTails和JailbreakBench基准上针对五类未见攻击家族(PAIR、TAP、AutoDAN、PAP、Translation)评估,结果显示CHASE将平均StrongREJECT得分降低43.2%,且对良性提示零误拒。更重要的是,该方法通过无模板的强化学习探索揭示了可跨机制差异攻击家族迁移的潜在攻击原语(latent attack primitives),为实现超越现有对抗训练所覆盖窄分布的通用化大模型安全加固提供了新路径。
链接: https://arxiv.org/abs/2606.05523
作者: Rahul Markasserithodi,Aditya Joshi,Yuekang Li,Ishmanbir Singh,Chris Yoo,Alan Niu
机构: University of New South Wales, Australia
类目: Computation and Language (cs.CL)
备注: Under Review at ARR
Abstract:Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2% with 0% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.
[NLP-108] EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts
【速读】: 该论文旨在解决流行病学大语言模型(LLM)预测器在实际疫情预测应用中存在的时间动态性不匹配问题:现有方法通常将模型视为静态监督学习模型进行训练与评估,而真实疫情预测是一个流式过程,其中标签延迟到达且疾病传播模式随时间发生结构性转变。为应对这一挑战,论文提出EpiEvolve——一种自演化代理框架,其核心在于保持预训练的LLM参数固定,通过构建分层情景记忆(hierarchical episodic memory)实现动态适应。关键创新包括:利用延迟标签进行事后反思(reflection)、从历史预测误差中提炼可复用的战略规则(strategic rules),以及基于当前变异株流行阶段的感知检索(regime-aware retrieval)。该机制使模型能够在遵守时间顺序的前提下,重用自身过往预测结果与实际观测值,从而在流式数据上实现更优性能。实验表明,EpiEvolve在五种变异株流行阶段的周度住院趋势预测中达到0.629的平均准确率,显著优于静态基线(0.561)和外部美国疾控中心(CDC)集成模型(0.325),并在流行病学范式转换后将恢复滞后从5周缩短至2周。消融实验进一步验证了反思、战略记忆与情境感知检索对性能提升的关键贡献。
链接: https://arxiv.org/abs/2606.05513
作者: Yiming Lu,Sihang Zeng,Zhengxu Tang,Max Lau,Fei Liu,Wei Jin
机构: Emory University(埃默里大学); University of Washington(华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches 0.629 average accuracy, compared with 0.561 for the static backbone and 0.325 for the external CDC ensemble, and reduces recovery lag after regime shifts from 5 to 2 weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.
[NLP-109] MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization
【速读】: 该论文旨在解决单一模型在处理结构与主题差异较大的文本时,生成摘要质量不稳定的问题,即传统基于单模型的抽象式文本摘要方法在面对多样化输入时存在性能波动和鲁棒性不足的缺陷。其解决方案的关键在于提出一种多模型自适应摘要框架(Multi-Model Adaptive Summarization Framework),通过集成多个经过微调的基于Transformer的摘要模型,并引入基于自动评估指标的自适应选择机制:各模型对同一输入文章独立生成候选摘要,再利用融合词汇相似性与语义相关性的自动评价指标对候选摘要进行评分,最终选取得分最高的摘要作为输出。该方法有效提升了摘要的质量与系统在不同文本类型下的鲁棒性,实验结果表明该框架在CNN/DailyMail数据集上取得了88.63%的BERTScore,显著优于多种主流大语言模型(LLM),验证了多模型协同与动态选择策略的有效性。
链接: https://arxiv.org/abs/2606.05494
作者: Ahmed Alansary,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, IMSA2026
Abstract:Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.
[NLP-110] Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution
【速读】: 该论文旨在解决大语言模型中提示词模糊性(prompt ambiguity)这一常见失败根源难以定位的问题。由于模糊性是提示词的潜在属性,而现有归因方法仅针对可观察的输出(如logits或生成的标记),无法有效捕捉此类隐含特征。为此,论文提出PRIG——一种基于梯度的归因方法,通过引入一个探测器logit(probe logit)将潜在模糊性归因于提示中具体标记位置。其核心创新在于:训练一个线性探测器以区分清晰提示与模糊提示,并将探测器得分反向归因至残差流中的早期标记表示。为实现细粒度的标记级评估,研究构建了涵盖编程、数学和写作领域的合成模糊数据集,通过对每个提示的关键句进行重写生成模糊样本,并辅以人工标注的黄金基准。实验表明,PRIG在综合合成基准上的AUROC达0.840,在黄金基准上达0.891,显著优于传统梯度归因基线,且在句子级模糊识别上超越GPT-5.4,同时具备跨领域泛化能力。该成果不仅确立了PRIG作为提示模糊性定位的实用工具,更揭示了通过中间表示而非输出层面归因,可有效识别提示中的潜在属性。
链接: https://arxiv.org/abs/2606.05486
作者: Govind Ramesh,Yao Dou,Wei Xu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 5 figures, 5 tables
Abstract:Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such as logits or generated tokens. We introduce PRIG, a gradient attribution method that uses a probe logit to attribute latent ambiguity to token positions. Specifically, PRIG trains a linear probe to distinguish clear prompts from ambiguous prompts and attributes the probe score to earlier token representations in the residual stream. To enable token-level evaluation, we construct synthetic ambiguity datasets across coding, math, and writing by rewriting one task-critical sentence per prompt, and complement them with a human-written gold benchmark. In this setting, PRIG localizes ambiguous spans substantially better than gradient attribution baselines, achieving 0.840 AUROC on the combined synthetic benchmark and 0.891 AUROC on the gold set. It also outperforms GPT-5.4 on sentence-level ambiguity identification and retains useful signal out-of-domain. These results establish PRIG as a practical tool for identifying which parts of a prompt are ambiguous. More broadly, they suggest that latent prompt properties can be localized through intermediate representations, rather than through output-level attribution.
[NLP-111] Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
【速读】: 该论文旨在解决低资源语言(low-resource languages)中指代消解(coreference resolution)任务因缺乏标注数据而进展缓慢的问题。现有研究多集中于英语等高资源语言,导致其他语言尤其是低资源语言的指代消解性能受限。为此,论文提出一种新颖的指代消解流水线,通过将英语文本翻译至目标低资源语言(借助机器翻译,MT),生成或扩充训练数据以缓解数据稀缺问题。其关键创新在于:利用回译(back-translation)技术对翻译后的样本进行逆向翻译,并基于BERT模型隐空间中的余弦相似度评估翻译样本与原始英文样本的一致性,进而将一致性得分作为权重融入损失函数,实现对训练样本的质量自适应加权。该方法有效提升了低资源语言指代消解的性能,且在无先前标注语料的语言上也实现了准确的指代消解。
链接: https://arxiv.org/abs/2606.05444
作者: Adriana-Valentina Costache,Eduard Poesina,Silviu-Florin Gheorghe,Paul Irofti,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.
[NLP-112] MIRAI: Prediction and Generation of High-Impact Academic Research
【速读】: 该论文旨在解决科学出版物数量激增背景下,高效识别与整合高影响力研究成果的挑战。其核心问题在于如何在不依赖全文或复杂外部数据的情况下,仅通过论文标题、摘要和发表日期等有限信息,准确预测论文的长期学术影响力。解决方案的关键在于提出一种名为MIRAI(Multi-year Inference of Research trends and Academic Impact)的深度学习框架,该框架基于arXiv学术图谱进行训练,能够基于上述结构化元数据预测5年期PageRank值和引用次数。MIRAI在2021年发表的论文上实现了0.4686的Spearman相关系数(PageRank预测)和0.6192的引用预测相关性,表现出较强的预测能力。此外,研究进一步构建了基于MIRAI的研究选题生成流程,通过引导生成具有高潜力影响力的科研方向,经由无偏见的大语言模型(LLM)评估,其产出的研究想法相较基线方法展现出更优的影响力表现(4:3比例)。该研究为自动化发现高价值研究方向提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2606.05443
作者: Alex Li,Joseph Jacobson
机构: MIT Media Lab (麻省理工学院媒体实验室)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:The rapid pace of scientific publishing has made the identification and synthesis of high-impact work an increasingly urgent challenge. We introduce MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts paper impact using only it’s title, abstract, and publication date. We train MIRAI on the arXiv academic graph to predict 5-year PageRank and citation counts, achieving Spearman’s \rho of 0.4686 on PageRank prediction and 0.6192 on citation prediction for papers published in 2021. We propose a research ideation pipeline built on top of MIRAI that produces research ideas oriented towards high impact. These ideas were judged as more impactful than a baseline without MIRAI by an unbiased LLM judge at a 4:3 ratio. We make the 5-year citation prediction model publicly available at this https URL.
[NLP-113] ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)过程中文本复杂度是否得以保持的问题,即翻译是否在转换中保留了原文的复杂性。其核心问题是探究文本复杂度与机器翻译难度之间的相互作用机制,并量化这种影响。解决方案的关键在于引入一个名为ComplexityMT的新评估框架,以欧洲共同语言参考框架(Common European Framework of Reference for Languages, CEFR)等级作为文本复杂度的量化指标,系统评估多种主流机器翻译模型(包括三个开源模型、一个闭源模型及一个商业系统)在六种语言(阿拉伯语、荷兰语、英语、法语、印地语和俄语)上的表现。研究通过两个任务展开:一是分析CEFR等级与翻译难度的相关性,二是检测源文本在翻译后目标文本的CEFR等级变化。实验结果表明,高CEFR等级的文本更难翻译,且多数语言中机器翻译会导致目标文本的复杂度水平发生显著偏移,揭示了机器翻译对文本复杂性的非保真性影响。这一发现为多语言教育内容生成和机器翻译难度估计提供了重要理论支持与实践指导。
链接: https://arxiv.org/abs/2606.05421
作者: Joseph Marvin Imperial,Junhong Liang,Belal Shoer,Abdullah Barayan,Rodrigo Wilkens,Omar Mussa,Dawn Knight,Eugénio Ribeiro,Ekaterina Kochmar,Sowmya Vajjala,Fernando Alva-Manchego,Harish Tayyar Madabushi
机构: University of Bath; Cardiff University; National University Philippines; MBZUAI; University of Exeter; INESC-ID Lisboa; Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR; National Research Council, Canada; King Abdulaziz University; Saudi Electronic University
类目: Computation and Language (cs.CL)
备注:
Abstract:When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.
[NLP-114] Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval
【速读】: 该论文旨在解决多源异构数据(如表格、文档及半结构化文件)在查询时面临的语义不一致与格式差异问题,此类数据通常蕴含隐式语义,传统方法或依赖高成本的人工工程,或完全忽略数据结构。其解决方案的关键在于提出一个自动从原始多源数据中发现可执行的模式(executable schema)的系统,并以此作为知识图谱构建与查询时检索的共享契约。该系统通过封闭世界的字段目录(closed-world field catalog)约束基于大语言模型(LLM)的模式发现过程,确保仅识别已验证的字段;结合确定性的结构分析,推断出身份键(identity keys)、外键(foreign keys)及数据源层次关系;最终生成的模式驱动实体抽取、去重以及跨源关联,构建具有溯源信息的知识图谱。在查询阶段,该模式可经由单调协议(monotonic protocol)扩展,用于引导多工具智能体进行路由决策,协调结构化查询、图遍历与向量搜索,返回具备可追溯引用的可信答案。在相同大语言模型、数据集和评估框架下的零样本对比实验表明,该系统在四个问答基准上均优于仅依赖检索或分解式基线,消融实验进一步证明模式条件路由、结构智能与模式引导构建三者共同贡献了性能提升。
链接: https://arxiv.org/abs/2606.05415
作者: Padmaja Jonnalagedda,Yuguang Yao,Xiang Gao,Hilaf Hasson,Kamalika Das
机构: Intuit AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, plus supplementary appendix
Abstract:Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema – optionally extended via a monotonic protocol – conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.
[NLP-115] Would you still call this Dax? Novel Visual References in VLMs and Humans
【速读】: 该论文旨在解决生成式视觉-语言模型(Vision-Language Models, VLMs)在面对与预训练知识相矛盾的新视觉概念时,如何实现跨模态映射的问题。现有研究多聚焦于对熟悉概念的视觉增强,而忽视了真实世界中新概念学习的挑战。为此,作者构建了全新数据集NVRD(Novel Visual References Dataset),包含19,176张图像,覆盖90个不同新颖程度的视觉概念,每个原始对象均有最多20个逐步扰动的版本,以系统评估模型对视觉变化的泛化能力。不同于以往工作,NVRD中的刺激完全由零构建,更贴近人类初次接触新概念的真实情境。通过对比5个模型(3个开源、2个闭源)与2,400条人类标注结果,研究发现:(i)当新概念与模型已有先验知识冲突时,其在上下文学习中难以有效习得;(ii)尽管模型与人类对视觉扰动的敏感性呈现相关性,但模型显著过度泛化,将已学标签错误扩展至人类明确拒绝的样本。本研究的关键贡献在于提出一个面向人机视觉概念学习比较的基准数据集与评估框架,为理解模型认知局限性及推动更鲁棒的跨模态学习提供了重要基础。
链接: https://arxiv.org/abs/2606.05409
作者: Ada Defne Tür,Gaurav Kamath,Joyce Chai,Siva Reddy,Benno Krojer
机构: McGill University (麦吉尔大学); Mila Quebec AI Institute (魁北克人工智能研究所); University of Michigan - Ann Arbor (密歇根大学安娜堡分校); Canada CIFAR AI Chair (加拿大加拿大首席人工智能学者)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.
[NLP-116] Agents Last Exam
【速读】: 该论文旨在解决当前生成式AI系统在专业领域中虽在各类基准测试上表现优异,却难以实现具有经济价值的实际部署这一关键问题。其核心挑战在于现有评估体系缺乏对真实世界、长期且具备经济意义的工作流的持续性能衡量。为此,论文提出“智能体最终考试”(Agents’ Last Exam, ALE)这一动态基准,专门用于评估AI代理在长周期、高经济价值的真实任务中的表现,并确保结果可验证。ALE基于250多位行业专家的合作开发,覆盖以美国联邦职业分类系统(O*NET/SOC 2018)为参考的非物理类行业,构建了包含13个产业集群、55个子领域及上千项任务的任务分类体系。当前实验结果显示,主流模型配置下的平均完整通过率仅为2.6%,表明最困难层级仍远未饱和。ALE被设计为一个持续演进的活体基准,随新工作流和行业的加入不断扩展,其目标不仅是提供新的排行榜,更是作为缩小基准表现与实际GDP贡献之间差距的关键工具。
链接: https://arxiv.org/abs/2606.05405
作者: Yiyou Sun,Xinyang Han,Weichen Zhang,Yuanbo Pang,Tianyu Wang,Yuhan Cao,Yixiao Huang,Chris Duroiu,Haoyun Zhang,Jeffrey Lin,Weishu Zhang,Tyler Zeng,Ying Yan,Bo Liu,Hanson Wen,Mingyang Xu,Xiaoyuan Liu,Zimeng Chen,Weiyan Shi,Amanda Dsouza,Vincent Sunn Chen,Patrick Bryant,Carl Boettiger,Yamini Rangan,Bradley Rothenberg,Kyle Steinfeld,Arvind Rao,Tapio Schneider,Georgios Yannakakis,Laure Zanna,Kaan Ozbay,Ida Sim,Tarek Zohdi,George Em Karniadakis,Jack Gallant,Teresa Head-gordon,Yushan Li,Wenxi Deng,Tao Sun,Huiqi Wang,Zhun Wang,Justin Xu,Chris Yuhao Liu,Yafei Cheng,Rongwang Hu,Aras Bacho,Shengcao Cao,Zengyi Qin,Yixiong Chen,Hengduan Fan,Hao Liu,Lin Zeng,Shashank Muralidhar Bharadwaj,Litian Gong,Yingxuan Yang,Maojia Song,Ruheng Wang,Zongzheng Zhang,Honglin Bao,Shuo Lu,Jianhong Tu,Zhonghua Wang,Zheng Zhang,Zijiao Chen,yanqiong Jiang,Zhendong Li,Bohan Lyu,Chang Ma,Peiran Xu,Benran Zhang,Shangding Gu,Haoyue Hua,Haoyang Li,Wanzhe Liao,Chengzhi Liu,Junbo Peng,Haoran Sun,Zechen Xu,Bo Chen,Jiayi Cheng,Yi Jiang,Keying Kuang,Yuan Li,Youbang Pan,Ziyan Rao,Alexander Schubert,Yifan Shen,Vincent Siu,Xiatao Sun,Kangqi Zhang,Xiaopan Zhang,Yuchen Zhu,Ishaan Singh Chandok,Lei Ding,Jingxuan Fan,Andrew Glover,Jiaming Hu,Yiran Hu,Wenbo Huang,Zixin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project website: this https URL Code: this https URL
Abstract:Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents’ Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
[NLP-117] Harnessing Generalist Agents for Contextualized Time Series
【速读】: 该论文旨在解决在复杂上下文环境下,通用大语言模型(LLM)在处理时间序列数据时因局限于文本空间而难以实现结构化时序信号有效建模的问题。其核心挑战在于如何构建一个支持上下文感知、可执行且可追溯的端到端时序分析框架,以满足真实世界中对动态系统建模的全流程需求。解决方案的关键在于提出TimeClaw——一种面向时间序列的代理式集成框架,通过引入三类核心机制:具备可执行性的时序工具以实现基于真实时序数据的精准分析;基于经验驱动的能力演化机制,用于生成可复用的分析流程;以及基于事件的多模态记忆系统,支持相关推理轨迹的高效检索。这三者协同实现了对上下文信息充分融合的开放式时序推理能力。在能源、金融、气象、交通等多个现实领域基准上的广泛评估表明,TimeClaw显著提升了任务表现,验证了其在复杂时序建模中的有效性与实用性。
链接: https://arxiv.org/abs/2606.05404
作者: Zihao Li,Kaifeng Jin,Yuanchen Bei,Jiaru Zou,Avaneesh Kumar,Xuying Ning,Yanjun Zhao,Mengting Ai,Baoyu Jing,Hanghang Tong,Jingrui He
机构: University of Illinois Urbana-Champaign
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 38 Pages
Abstract:Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at this https URL.
[NLP-118] Reasoning Flow: Discourse Structures for Understanding LLM Reasoning Traces
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)生成的推理轨迹具有非线性结构(如回溯与自我修正)所带来的评估与监控难题。其核心挑战在于传统线性化分析方法难以准确捕捉复杂推理过程中的真实语义与逻辑关系。为此,论文提出ReasoningFlow框架,将LRM的推理轨迹建模为细粒度的有向无环图(Directed Acyclic Graph, DAG),以显式表征推理过程中的语篇结构。该方案的关键在于构建并验证了一套高一致性的人工标注体系(基于31条轨迹共2.1k步的标注),进而实现对1,260条轨迹(总计247.7k步)的自动化标注,覆盖数学、科学和论辩三类任务及五种不同模型。通过分析ReasoningFlow图结构,研究发现:尽管模型训练基础各异,但其推理轨迹在结构上表现出高度相似性;推理流程中存在多样化的细粒度行为模式(如局部验证、自我反思与假设使用),可显著提升推理轨迹的可监控性;多数错误步骤并未被用于最终答案推导;且步骤间的机制因果依赖关系并不反映语言层面的语篇结构。这一框架为理解、分析与优化复杂推理过程提供了新的结构性视角。
链接: https://arxiv.org/abs/2606.05402
作者: Jinu Lee,Shivam Agarwal,Amruta Parulekar,Siddarth Madala,Dilek Hakkani-Tur,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: this https URL.
[NLP-119] LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization
【速读】: 该论文旨在解决长时序研究级数学形式化(long-horizon research-level autoformalization)中的系统性失效问题,具体表现为:在处理复杂定理时,不仅难以证明高难度引理,更在大规模推进过程中出现命题漂移(statement drift)、依赖关系混乱(dependency tangle)、上下文退化(context decay)以及局部修复对远端工作造成污染等问题。其解决方案的核心在于提出LeanMarathon——一种基于多智能体协同的可靠形式化框架,其关键创新是引入一个动态演化的“蓝图”(evolving blueprint)抽象,即一个同时充当形式化证明骨架、自然语言证明图谱和共享系统记录的Lean文件。该蓝图由四个受合同约束的智能体(构建、审计、证明、修复)协作维护,并由双阶段编排器协调:第一阶段通过对抗式评审稳定目标保真度(target fidelity),第二阶段以并行、受持续集成(CI)门控的自底向上方式逐轮消解证明有向无环图(DAG)的动态叶节点。该架构将原本脆弱且耗时数小时的单次运行,转化为多个可恢复、可并行的局部事务,显著提升了形式化过程的鲁棒性与可扩展性。在涵盖四个埃尔德什(Erdős)问题的两篇近期研究论文上的实验表明,三次自主运行均成功形式化全部七个目标定理,共证明258个引理与定理,未出现失败案例,验证了长期数学发展过程中保持目标一致性不仅依赖更强的证明器,更需具备持久性的系统框架。
链接: https://arxiv.org/abs/2606.05400
作者: Yuanhe Zhang,Yuekai Sun,Taiji Suzuki,Jason D. Lee,Fanghui Liu
机构: University of Warwick(华威大学); RIKEN(理化学研究所); University of Michigan(密歇根大学); The University of Tokyo(东京大学); RIKEN(理化学研究所); University of California, Berkeley(加州大学伯克利分校); Shanghai Jiao Tong University(上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 9 figures. Comments are welcome
Abstract:Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at this https URL.
[NLP-120] Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges ACL2026
【速读】: 该论文旨在解决当前大语言模型作为评判者(LLM-as-judge)在自动化评估流水线中所隐含的假设——即评价结果是固定输入下的稳定属性——在实际交互情境下可能失效的问题。研究表明,尽管在重复且中立的重新评估下,模型判断表现出较高的稳定性,但在受到针对性的“后决策挑战”(post-decision challenge)干预后,其评价结果极易被改变,呈现出显著的可操纵性。这一现象揭示了评估系统在面对主动交互时存在严重的脆弱性,可能导致与人类偏好一致性下降、基准排名变动甚至产生有害评价结果,即便模型自身报告高度自信。关键解决方案在于提出一种新的评估鲁棒性度量指标——评估鲁棒性得分(Evaluation Robustness Score, ERS),该指标通过整合反转敏感性与平衡化方向效应,量化模型在外部交互影响下的稳定性。研究强调,未来评估协议不应仅关注静态一致性,更需考察模型在受挑战下的鲁棒性,从而识别并防范由后决策交互引发的新型评估失效模式。
链接: https://arxiv.org/abs/2606.05384
作者: Srimonti Dutta,Akshata Kishore Moharir
机构: WAI USA Research Labs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop
Abstract:LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.
[NLP-121] rajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal
【速读】: 该论文旨在解决传统语言理解模型中对增量加工成本预测的局限性问题,即现有方法(如预期意外度,surprisal)仅依赖于单个词在上下文中的概率估计,将其简化为一个标量值,从而忽略了语义解释动态演进的方向性信息。其核心解决方案是提出“轨迹外推误差”(trajectory extrapolation error),通过拟合变压器语言模型前序隐藏状态的线性轨迹,并测量当前状态与该轨迹的偏离程度,以捕捉语言理解过程中解释状态的动态演化特征。研究表明,该指标与预期意外度高度正交(r = .044),独立预测自定速阅读时间,尤其在花园路径句中表现显著,且随模型规模增大而增强,跨不同架构(如GPT-2与Pythia/RoPE)具有可复现性。此外,位移控制实验表明该效应不能归因于表征变化幅度,而是源于对解释过程局部动量的敏感性。因此,研究揭示了语言加工成本的两个可分离成分:词级预测误差(预期意外度)和对语义演进局部动量的敏感性(轨迹外推误差)。
链接: https://arxiv.org/abs/2606.05346
作者: Elan Barenholtz
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 6 tables
Abstract:Human language comprehension unfolds sequentially: each word is processed in the context of those that came before, and the interpretation builds incrementally over time. Surprisal, the negative log probability of a word given its context, has been the dominant predictor of incremental processing cost. But surprisal reduces rich sequential representations to a single scalar at each word, discarding information about the direction in which the interpretation has been evolving. Dynamical-systems approaches suggest that the trajectory of the evolving interpretive state, not just its position at each moment,should shape processing, and language itself may have local momentum, since speakers plan utterances a few words at a time. We introduce trajectory extrapolation error: at each word, we fit a linear trajectory to the preceding hidden states of a transformer language model and measure deviation from the extrapolated path. On the Natural Stories corpus, this measure is nearly orthogonal to surprisal (r = .044) and independently predicts self-paced reading times. The effect is especially pronounced in garden-path sentences, strengthens with model scale (GPT-2 Small to Large), and replicates across architectures with different positional encoding schemes (GPT-2 vs. Pythia/RoPE). A displacement control shows the effect is not reducible to representational change magnitude: displacement and extrapolation error predict in opposite directions. These findings reveal two dissociable components of processing cost: word-level prediction error (surprisal) and sensitivity to the local momentum of the unfolding interpretation (trajectory extrapolation error).
[NLP-122] Self-supervised User Profile Generation for Personalization
【速读】: 该论文旨在解决大语言模型(LLM)在个性化应用中缺乏高效、低成本的用户建模方法的问题,尤其是在推荐、搜索、对话和内容生成等场景下,需根据用户个体差异动态调整输出。现有方法依赖于下游任务的标注数据来训练用户画像生成器,但此类监督信号成本高且稀疏。为克服这一局限,论文提出一种自监督框架——双向用户建模通过画像(BUMP),其核心创新在于无需任何下游任务标签即可训练用户画像生成器。关键在于采用双向批内排序目标(bidirectional in-batch ranking objective):利用一个小规模判别模型评估生成的用户画像作为查询时,能否将该用户自身的保留交互项排在同批次其他用户交互项之前;同时评估一个保留交互项作为查询时,能否将该用户的画像排在其他用户画像之前。两个方向均使用多正例归一化折损累计增益(multi-positive NDCG)评分,并融合为密集奖励信号。同批次中的其他用户自然提供负样本,从而仅基于原始交互日志即可实现端到端训练。实验表明,BUMP在LaMP基准上表现媲美甚至超越依赖标注奖励的闭源API及已有方法,且训练阶段完全无需任务标签。
链接: https://arxiv.org/abs/2606.05336
作者: Clark Mingxuan Ju,Yuwei Qiu,Tong Zhao,Neil Shah
机构: Snap Inc.
类目: Computation and Language (cs.CL)
备注:
Abstract:Personalizing large language models (LLMs) has become a central challenge as LLMs are deployed across recommendation, search, dialogue, and content generation – settings where the same query should yield different answers given different users. A promising route is to summarize each user’s interaction history into a natural-language memory or profile and prepend it to the prompt to facilitate personalization. Existing methods learn such profile generators with explicit rewards derived from labeled downstream tasks, which are expensive and sparse as they require annotated supervision for every target task. In light of this challenge, we introduce Bidirectional User Modeling via Profiles (BUMP), a self-supervised framework that trains a profile generator without any downstream labels. Specifically, given a user’s interaction history, we use GRPO to train an LLM to emit a free-form textual profile under a bidirectional in-batch ranking objective: a small LLM judge measures (i) how well the generated profile, used as a query, ranks the user’s own held-out interactions above interactions from other users in the batch, and (ii) how well a held-out interaction, used as a query, ranks the user’s own profile above profiles of other users. Both directions are scored with multi-positive NDCG and combined into a dense reward per rollout; other users in the batch supply free negatives, so every training example yields supervision from raw interaction logs alone. Evaluated on the LaMP benchmark, BUMP matches or outperforms closed-source APIs and prior methods relying on labeled rewards, while requiring no task label at training.
[NLP-123] LoRi: Low-Rank Distillation for Implicit Reasoning
【速读】: 该论文旨在解决生成式模型中隐式链式思维(implicit chain-of-thought, iCoT)方法推理能力不足的问题,即相较于显式链式思维(explicit CoT)提示,iCoT在复杂多步任务上表现较差。其核心发现是:大语言模型内部的隐藏状态推理轨迹具有低秩结构。为此,作者提出一种基于低秩蒸馏(low-rank distillation)的解决方案,通过在共享的低秩张量子空间中,利用一阶和二阶统计量对教师模型与学生模型的推理轨迹进行对齐,从而捕捉推理过程的全局结构并实现紧凑的潜在推理表示。该方法显著提升了不同规模模型(如LLaMA、Qwen系列)在数学推理基准上的性能,尤其在高难度多步任务中表现优异,接近显式CoT的准确率,并超越了以往的iCoT蒸馏方法。
链接: https://arxiv.org/abs/2606.05315
作者: Ryan Solgi,Jiayi Tian,Zheng Zhang
机构: University of California-Santa Barbara, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.
[NLP-124] Domain-Conditioned Safety in Frontier Computer-Using Agents : A 793-Episode Browser Benchmark a Coding-Domain Cross-Reference and a Reproducibility Audit of Recent Red-Teaming
【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)红队测试中存在的一系列可复现性与泛化性问题,具体聚焦于:已有研究报道的提示注入攻击成功率(ASR)高达42%-98%,但这些高成功率主要基于已退役模型或各研究中最具脆弱性的单一模型,其有效性是否可推广至当前前沿的计算机使用代理(Computer-Using Agent, CUA)尚不明确。为此,作者提出了关键解决方案——构建并发布CUA-HandCrafted基准,包含793个跨24个复杂多步网络任务、56种手工设计攻击模板、8类攻击家族及4种系统提示配置的公开数据集。实验结果表明,在Claude Sonnet 4.6和GPT-5.4等前沿模型上,针对多步任务的提示注入攻击成功率为0/140(Clopper-Pearson 95%置信上限为2.60%),且通过提示消融分析证实该抗性源于模型权重本身。然而,该安全性不具备泛化能力:在同源的代码生成代理基准SkillBench上,相同模型对同类手工攻击的攻击成功率可达100%。研究进一步指出,文献中报告的高ASR主要归因于强化学习优化过的特定注入文本,而非攻击类别本身;而前沿模型的安全加固具有领域特异性,仅针对被重点攻击的浏览器交互界面。因此,若不公开优化后的攻击字符串,或错误地将浏览器域的安全评估外推至其他模态的CUA,将导致现有研究成果难以复现。
链接: https://arxiv.org/abs/2606.05233
作者: Nicholas Saban
机构: Patronus AI; University of California, Berkeley (加州大学伯克利分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper’s panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature’s high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.
[NLP-125] mporal Preference Concepts and their Functions in a Large Language Model
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在决策过程中如何内部表征并权衡短期收益与长期后果这一关键问题。随着LLMs被广泛应用于需考虑时间维度的复杂决策任务,其对时间偏好(temporal preference)的内在机制尚不明确。本文通过因果定位方法,在一个精简版LLM(Qwen3-4B-Instruct-2507)中识别出与时间偏好相关的潜在子图,结合基于梯度的归因分析与激活补丁(activation patching)技术,确定了中上层神经元节点为关键作用区域。研究发现,时间跨度的几何结构编码于残差流(residual stream)中特定层级的表示中。行为实验表明,未经干预的LLMs对未来奖励的贴现程度显著低于人类,且其时间偏好在不同情境下表现出不稳定性,这揭示了依赖训练数据隐式学习时间偏好的局限性,从而凸显了显式控制的必要性。此外,研究还发现了引导向量(steering vectors)可能调节时间偏好的初步证据。本研究的关键在于运用机械可解释性(mechanistic interpretability)手段,揭示了时间偏好在模型内部的可定位、可干预的表征方式,为实现对LLMs规划与推理过程的可靠控制提供了新路径。
链接: https://arxiv.org/abs/2606.05194
作者: Ian Rios-Sialer,Shantanu Darveshi,Shuai Jiang,Avigya Paudel,Anastasiia Pronina,Ipshita Bandyopadhyay,Justin Shenk
机构: AISC (AI Safety Camp); SPAR (Supervised Program for Alignment Research)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason
[NLP-126] Staged Factorial Screening for Budget-Constrained Micro-Pretraining
【速读】: 该论文旨在解决预算受限的微调(micro-pretraining)场景中,如何在有限计算资源下高效筛选候选训练配方(candidate recipes)的问题。其核心挑战在于:在投入大规模搜索预算前,需通过少量实验快速识别出对性能影响显著的关键因素,避免盲目试错。解决方案的关键在于提出一种分阶段的分数因子设计(staged fractional-factorial workflow),利用短时、受控的初步实验(如2、5、10分钟运行)识别高惩罚方向(如总批次、深度、宽度等),并通过重复运行验证潜在优质锚点(anchor)的稳定性,进而在此缩减后的空间内进行局部精细化优化。研究结果表明,在24小时内基于两个不同硬件平台(Windows A100与Linux L40S)的三锚点延续实验中,采用“桥接”(bridge)策略的方案在样本均值上表现最优,且能有效分离工作流优化与大模型容量优势,优于随机搜索或硬件无关的通用超参数优化方法。因此,关键结论是:应结合短期设计化筛查、重复验证与局部精修,构建以“桥接”为核心的可扩展优化路径,而非依赖硬件不变的排名或泛化性超参优化。
链接: https://arxiv.org/abs/2606.05186
作者: Felipe Chavarro Polania
机构: Hewlett Packard Enterprise (惠普企业)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 23 pages, 4 figures
Abstract:Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model’s capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.
[NLP-127] Multi-Granularity Reasoning for Natural Language Inference
【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)任务中现有基于Transformer的预训练模型过度依赖顶层标记表示所带来的局限性,即单一表示空间难以有效捕捉细粒度词汇线索、短语组合以及高层上下文语义之间的复杂且分层的语义交互。其解决方案的关键在于提出一种新型的多粒度推理网络(Multi-Granularity Reasoning Network, MGRN),通过在交互式推理空间中显式建模多层次语义特征,模仿人类认知过程中从浅层词汇匹配到深层语义抽象与逻辑推理的渐进理解过程。MGRN以结构化、递进的方式融合不同粒度的语义信息,从而揭示自然语言表达背后复杂的语义关系,显著提升了模型在多个公开基准上的性能表现,验证了该方法的有效性与鲁棒性。
链接: https://arxiv.org/abs/2606.05181
作者: Chunling Xi,Di Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emphMulti-Granularity Reasoning Network (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.
[NLP-128] From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment ACL2026
【速读】: 该论文旨在解决生成式评分模型在基于量规(rubric-based)的语言表现评估中缺乏可解释性的问题,即模型难以提供其评分决策背后的合理依据。其解决方案的关键在于提出一种通用的句子级可解释性框架,该框架融合了模型无关的Shapley值归因方法与大语言模型(LLM)生成的推理理由。通过在CLASS框架中的反馈质量维度及NCTE语料库上进行实例化,该框架能够系统比较微调的预训练语言模型(PLMs)与提示工程的LLMs在评分准确性和解释忠实性方面的表现。实验结果表明,尽管微调的PLMs在预测准确性上优于LLMs,但存在向中间评分等级的标签压缩现象;而基于删除测试的分析显示,Shapley值能更可靠地识别出驱动模型预测的关键句子,产生的预测偏差更大且更具连贯性,且其归因具有跨模型架构的强鲁棒性,而LLM生成的推理理由则影响有限且不一致。综上,该研究证明了Shapley值在提供更忠实、可迁移的解释方面具有优势,所提出的框架为高风险教育场景及其他基于量规的语言评估任务中评分模型及其解释的评估提供了理论基础。
链接: https://arxiv.org/abs/2606.05180
作者: Ivo Bueno,Babette Bühler,Philipp Stark,Tim Fütterer,Ulrich Trautwein,Dorottya Demszky,Heather Hill,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Lund University (隆德大学); University of Tübingen (图宾根大学); Stanford Graduate School of Education (斯坦福大学教育学院); Harvard Graduate School of Education (哈佛大学教育学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026
Abstract:Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.
[NLP-129] Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems IJCNN
【速读】: 该论文旨在解决流式自动语音识别(Streaming ASR)中标点恢复的实时性与准确性难题。在流式场景下,系统需基于有限的未来上下文进行在线决策,而传统生成式方法因自回归特性易导致延迟增加和边界对齐失败。其解决方案的关键在于提出一种非自回归评分机制(non-autoregressive scoring method),该机制不进行自由形式的生成,而是保持原始转录文本,在每个词边界处基于有限长度的K子词标记前瞻窗口(bounded K-subword-token lookahead)对比不同标点插入假设与无插入基线,通过权重α和验证校准阈值τ(推理时无需参数更新)进行决策校准。实验表明,在IWSLT 2017数据集上,该方法在无微调设置下达到4类宏平均F1为0.893(K=2),微调后提升至0.937,显著优于提示法基线(0.566)与微调后的ELECTRA基线(0.913),且在相同前瞻预算下表现更优。通过消融实验分析了前瞻窗口大小K的影响,验证了该方法在低延迟约束下的有效性。
链接: https://arxiv.org/abs/2606.05179
作者: Sungmook Woo,Hyungu Kang,Chanwoo Kim
机构: Korea University (高丽大学)
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at The International Joint Conference on Neural Networks (IJCNN) 2026
Abstract:Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight \alpha and a validation-calibrated threshold \tau (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.
[NLP-130] MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
【速读】: 该论文旨在解决现有多模态安全评估基准仅依赖视觉输入,无法有效评估同时处理视觉、音频与文本的全模态大语言模型(Omni Large Language Models, Omni LLMs)在安全敏感场景下的表现这一关键问题。其核心解决方案是提出MCBench基准,包含1196个涵盖四类安全风险的多模态场景,每个不安全场景均配有语义上最小差异的安全对照场景,以评估模型对细微风险的敏感性。通过该基准对前沿模型的评估发现,当前Omni LLMs在处理细微或非物理性风险时表现不佳,但在存在显著视觉或听觉线索时表现较好;进一步分析推理轨迹表明,尽管模型能提取各模态的独立信息,却难以实现有效的跨模态融合以做出准确的安全判断。研究揭示了当前Omni LLMs在安全关键任务中缺乏稳健的跨模态推理能力,凸显了改进模型架构与训练策略以增强多模态安全感知能力的迫切需求。
链接: https://arxiv.org/abs/2606.05177
作者: Manh Luong,Tamas Abraham,Junae Kim,Amar Kaur,Rollin Omari,Gholamreza Haffari,Trang Vu,Lizhen Qu,Dinh Phung
机构: Monash University (墨尔本皇家理工大学); Defence Science and Technology Group (澳大利亚国防科学与技术集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.
[NLP-131] PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis
【速读】: 该论文旨在解决大语言模型(LLM)在电信客户支持领域中因缺乏对特定领域约束的评估与适配,以及受数据主权、监管要求和敏感信息处理限制而难以部署外部托管基础模型的问题。其核心解决方案是采用参数高效微调(PEFT)技术中的低秩适应(LoRA),对Qwen2.5-3B模型进行领域定制化微调,并提出一种基于52个行业术语词典的组合式合成数据生成方法,通过Gemini 2.0 Flash驱动的生成管道构建约3万条训练样本,覆盖1,560种不同问题场景。研究的关键在于系统比较16种LoRA配置,揭示目标模块选择对性能的影响,并发现验证损失(validation loss)与人类对齐评价之间存在显著分歧——最低损失模型在定性评估中排名仅第6–7位,而最高损失模型反而获得最佳评分,表明单一量化指标不足以指导微调配置选择。此外,研究还提供了能效与性能之间的权衡分析,为可持续的大规模语言模型部署提供实证依据。
链接: https://arxiv.org/abs/2606.05176
作者: Lucas Tamic,Ilan Jaffeux-Cheniout,Xavier Marjou
机构: Orange(橙色公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05176 [cs.CL] (or arXiv:2606.05176v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.05176 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Lucas Tamic [view email] [v1] Fri, 17 Apr 2026 09:56:18 UTC (171 KB) Full-text links: Access Paper: View a PDF of the paper titled PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis, by Lucas Tamic and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-132] Generic Triple-Latent Compression with Gated Associative Retrieval
【速读】: 该论文旨在解决现有语言模型在捕捉高阶标记(token)交互关系时依赖特定基准数据集解析、缺乏通用性的问题。其核心挑战在于如何在不进行任务或数据集特异性处理的前提下,有效建模序列中长期依赖与复杂上下文关联。为此,论文提出一种通用的三隐状态序列模型(triple-latent sequence model),通过维持一个运行中的标记状态(running token state)和压缩的成对记忆路径(compressed pair-memory pathway),以高效捕获高阶标记间交互。该方法的关键创新在于将记忆机制与状态演化解耦,从而在无需额外标注或预定义结构的情况下,实现对复杂序列模式的泛化表征。实验表明,该框架在字节级WikiText-2及基于分词器的MiniMind语言建模基准上均优于小型Transformer基线模型,验证了其通用性和有效性。
链接: https://arxiv.org/abs/2606.05175
作者: Liu Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves a small Transformer baseline on byte-level WikiText-2 and on a tokenizer-based MiniMind language-model benchmark, while a recall-focused gated key-value retrieval extension improves associative recall but remains seed-sensitive and much slower in the current reference implementation.
[NLP-133] Improving Heart-Focused Medical Question Answering in LLM s via Variance-Aware Rubric Rewards with GRPO
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在医疗场景中实际部署时面临的挑战,包括数据隐私限制、推理成本高以及难以适用于边缘或设备端计算。为应对这些问题,研究聚焦于开发更小、更高效的模型,并通过强有力的后训练策略保障其在医学推理任务中的可靠性。其解决方案的关键在于提出一种方差感知的奖励框架(Variance-Aware Reward Framework),该框架改进了基于评分标准(rubric-based)的监督机制,将传统的加权二元标准聚合与单一整体李克特量表评分,替换为从各评价维度评分结果中导出的连续分析型奖励函数。这一设计增强了对稀疏、多准则且难以自动验证的反馈信号的优化能力,显著提升了在线策略强化学习的稳定性。实验结果表明,在心内科相关的HealthBench测试集上,所提出的组相对策略优化(Group Relative Policy Optimization, GRPO)方法使模型准确率从0.362提升至0.502,F1值从0.532提升至0.668,性能接近GPT-OSS-120B模型,验证了精心设计的基于评分标准的奖励机制在提升心科医学问答任务表现方面的有效性,具备向其他基于评分标准的任务扩展的潜力。
链接: https://arxiv.org/abs/2606.05174
作者: Arash Ahmadi,Parisa Masnadi,Sarah Sharif,Charles Nicholson,David Ebert,Mike Banad
机构: University of Oklahoma(俄克拉荷马大学); University of Arizona(亚利桑那大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 Pages
Abstract:Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.
[NLP-134] Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning
【速读】: 该论文旨在解决传统掩码语言建模(Masked Language Modelling, MLM)在文本编码器预训练中导致表示过度依赖表面词形(surface-form token identity)而非深层语义结构的问题。其解决方案的关键在于提出一种混合预训练目标,将类联合嵌入预测架构(Joint Embedding Predictive Architectures, JEPA)的潜在空间预测损失与标准MLM目标相结合,共享同一编码器,并通过一个可学习的标量参数在训练过程中动态平衡二者。实验结果表明,该混合模型生成的嵌入具有更低的均匀性(uniformity < -0.16 vs -0.05 for MLM)、更丰富的谱几何结构、更少的表层词汇信息,实现了更好的语义-词汇平衡。尽管线性探测的下游准确率相近,但嵌入空间的几何差异显著且一致,说明JEPA式的预测目标能够从标准评估指标难以捕捉的角度重塑潜在空间表征。
链接: https://arxiv.org/abs/2606.05173
作者: Aimen Boukhari
机构: École Nationale Supérieure d’Informatique (ESI), Algiers, Algeria
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures, 11 tables. Preprint. Code available at : this https URL
Abstract:Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.
[NLP-135] Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在使用合成数据进行训练时引发的模型坍缩(model collapse)问题,其核心挑战在于传统分析将此现象视为单一模型链的退化过程,而忽略了真实AI生态系统中多模型间的交叉污染机制:即模型在共享语料库中摄入其他模型生成的合成文本,进而产生新的合成内容,导致语料库持续被污染。为应对这一复杂动态,作者提出一种双层耦合的SIR/SIRS流行病学模型——将数据语料库与AI模型分别建模为两个相互作用的群体,各自包含易感(susceptible)、感染(infected)和恢复(recovered)状态,并通过跨层传播机制关联。其中,采用SIRS变体(主要推荐方案)以引入免疫衰减机制,反映经过过滤的语料库和重新训练的模型仍可能再次被污染。通过下一代矩阵法推导出基本再生数 $ R_0 = \sqrt{\beta_D \beta_M} / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)] ,并基于标准流行病阈值理论分析系统行为。基于公开的AI文本流行率数据进行情景校准显示,在三种不同情景下均呈现超临界动力学( R_0 > 1 ),且Sobol敏感性分析表明合成文本检测能力是影响系统演化的最高杠杆参数。基于双部网络的代理模型验证了均场模型的一致性( R^2 > 0.96 $)在密集网络下成立,但在异质性增强时性能下降。对GPT-2的污染链实验(192次运行,涵盖WikiText与Shakespeare数据集)显示出剂量-响应式的性能退化与多样性损失,与阈值理论预测一致。匹配预算的源多样性实验(1,088次运行)提供初步证据表明,多源混合可适度缓解模型坍缩,但该效应在低污染比例下消失。干预分析进一步确认,基于检测的过滤策略与群体免疫(herd immunity)是效果最强的应对措施。
链接: https://arxiv.org/abs/2606.05168
作者: Xiangyu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 15 figures
Abstract:Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework – a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number R_0 = \sqrt\beta_D \beta_M / [(\gamma_D+\mu_D)(\gamma_M+\mu_M)] via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ( R_0 1 ) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ( R^2 0.96 ) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.
[NLP-136] USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding INTERSPEECH2026
【速读】: 该论文旨在解决当前多领域音频编码器在覆盖范围与评估完备性方面的局限性,尤其是在面对大语言模型(LLM)对统一输入编码器日益增长的需求时,如何构建一个兼具广泛适用性与高性能的通用音频编码器(Universal Audio Encoder, USAD)。现有自监督学习(SSL)方法虽能生成特定领域(如语音或音乐)的优秀编码器,但其多领域泛化能力有限;而部分多领域框架如USAD和SPEAR在覆盖范围和系统性评估上仍显不足。此外,近期研究指出,基于监督学习训练的编码器更适配音频大语言模型。为此,本文提出USAD 2.0,其核心解决方案包括:引入领域感知蒸馏(domain-aware distillation)以缓解教师模型间的分布不匹配问题,扩展编码器在音乐领域的适用性,并通过第二阶段监督蒸馏增强下游任务表现;同时,采用深度缩放策略将模型规模扩展至十亿参数级别,显著提升表达能力。实验表明,USAD 2.0在探测任务及基于大语言模型的评估中均达到强性能或领先水平,验证了其作为通用音频编码器的有效性与可扩展性。
链接: https://arxiv.org/abs/2606.06444
作者: Heng-Jui Chang,Alexander H. Liu,Saurabhchand Bhati,Mrudula Athi,Anton Ratnarajah,Amit Chhetri,James Glass
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Amazon(亚马逊)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2026
Abstract:Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.
[NLP-137] Revisiting Lexicon Evaluation in Unsupervised Word Discovery
【速读】: 该论文旨在解决零资源语音处理中词元(word-like unit)聚类所得词典质量评估的可信度问题。现有常用评价指标——归一化编辑距离(normalized edit distance),存在固有偏差,倾向于高估大簇(large clusters)的质量,且忽略真实词类在各簇间的分布情况,导致评估结果不公平且不全面。其解决方案的关键在于基于聚类理论提出两种新指标:一是加权型指标,通过考虑簇大小来更合理地评估簇内一致性;二是反向指标,用于衡量真实词类在不同簇间的分散程度。实验结果表明,这两种指标联合使用能更准确反映词典与真实词典分布的相似性,并显著提升评估对各类偏倚的鲁棒性。
链接: https://arxiv.org/abs/2606.06183
作者: Simon Malan,Danel Slabbert,Herman Kamper
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 6 figures
Abstract:Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.
信息检索
[IR-0] A Vision-language Framework for Comparative Reasoning in Radiology
链接: https://arxiv.org/abs/2606.06407
作者: Tengfei Zhang,Ziheng Zhao,Lisong Dai,Xiaoman Zhang,Pengcheng Qiu,Ya Zhang,Yanfeng Wang,Weidi Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision–language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.
[IR-1] OneReason Technical Report
链接: https://arxiv.org/abs/2606.06260
作者: OneRec Team,Biao Yang,Boyang Ding,Chenglong Chu,Dunju Zang,Fei Pan,Han Li,Hao Jiang,Honghui Bao,Huanjie Wang,Jian Liang,Jiangxia Cao,Jiao Ou,Jiaxin Deng,Jinghao Zhang,Kun Gai,Lu Ren,Peiru Du,Pengfei Zheng,Rongzhou Zhang,Ruiming Tang,Shiyao Wang,Siyang Mao,Siyuan Lou,Teng Shi,Wei Yuan,Wenlong Xu,Xingchen Liu,Xingmei Wang,Xinqi Jin,Yan Sun,Yan Wang,Yifei Hu,Yingzhi He,Yufei Ye,Yuhao Wang,Yunhao Zhou,Yuqin Dai,Zhao Liu,Zhipeng Wei,Zhixin Ling,Ziming Li,Zixing Zhang,Ziyuan Liu,An Zhang,Changxin Lao,Chaoyi Ma,Chengru Song,Defu Lian,Fan Yang,Guowang Zhang,Hao Peng,Jiayao Shen,Jie Chen,Jun Xu,Junmin Chen,Kun Zhang,Kuo Cai,Mingxing Wen,Minmao Wang,Minxuan Lv,Qi Zhang,Qiang Luo,Sheng Yu,Shijie Li,Shijie Yi,Shuang Yang,Shugui Liu,Shuni Chen,Tinghai Zhang,Tingting Gao,Xiang Wang,Xiangyu Wu,Xiangyu Zhao,Xiao Lv,Xiaoyou Zhou,Xuming Wang,Yong Du,Zejian Zhang,Zhaojie Liu,Zhiyang Zhang,Zhuang Zhuang,Ziqi Wang,Ziyi Zhao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
Abstract:Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer’’ paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user’s behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.
[IR-2] Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents
链接: https://arxiv.org/abs/2606.06242
作者: AJ Carl P. Dy,Aivin V. Solatorio
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 23 pages, 8 figures
Abstract:Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textitdata snapshot extraction, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at this https URL and the source code is available at this https URL.
[IR-3] Bridging the Semantic-Collaborative Gap: An Asymmetric Graph Architecture for Cold-Start Item Recommendation
链接: https://arxiv.org/abs/2606.06225
作者: Anh Truong,John Trenkle,Yuanbo Chen,Honghong Zhao,Abdullah Alchihabi,Effy Fang,Michael Tamir
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi’s production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.
[IR-4] WebKnoGraph: GNN-Powered Internal Linking
链接: https://arxiv.org/abs/2606.06106
作者: Emilija Gjorgjevska,Georgina Mirceva,Miroslav Mirchev
类目: Information Retrieval (cs.IR)
备注:
Abstract:Internal link optimization is a recurring task in search engine optimization, yet many production workflows rely on manual judgment, fixed page templates, or generic tool recommendations. Practitioners need ways to evaluate candidate links before deployment because link changes can redistribute authority and affect semantic coherence in ways that are difficult to isolate after release. We present WebKnoGraph, an open-source framework for evaluating internal linking strategies on website crawls. The framework models a website as a directed graph, represents pages by embeddings, scores candidate links with GraphSAGE, and evaluates interventions by embedding the site into larger host environments. We instantiate WebKnoGraph on a production crawl of this http URL and compare automatic with expert-assisted link selection in an empirical FineWeb-based host graph and a synthetic Barabási-Albert host graph, using PageRank-based authority metrics and semantic coherence. The results show that automatic selection generally produces stronger authority redistribution, with higher Authority Yield, but also larger semantic coherence costs. Expert-assisted selection better preserves semantic coherence and, when targeting low-PageRank pages, achieves the highest Authority Yield, although with the least favorable loss-gain balance. Authority Volatility provides an additional stability perspective, but is interpreted cautiously because the two regimes use different numbers of intervention sets. These findings support a practical workflow in which candidate intervention sets are generated at scale, evaluated jointly across authority gain, volatility, loss-gain balance, and semantic coherence, and then reviewed for editorial deployability before implementation.
[IR-5] Edge-Aware Curvature Modeling for Graph Understanding in Large Language Models
链接: https://arxiv.org/abs/2606.06073
作者: Zhenghong Lin,Zhibin Shi,Hongyang Dong,Xinjie Ye,Yuhong Chen,Shiping Wang
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recently, graph-aware Large Language Models (LLMs) have shown promising capabilities in jointly modeling graph-structured data and textual information. Existing approaches typically employ a graph encoder and a frozen LLM to obtain node representations from graph and textual views, followed by node-level alignment to bridge the two modalities. However, such alignment mechanisms primarily focus on node information while overlooking edge-level structures, leading to suboptimal information propagation across views. In this work, we conduct a comprehensive theoretical analysis to uncover why node-level alignment is insufficient for aligning textual and graph representations. Specifically, we prove theoretically for the first time that neglecting edge information leads to suboptimal solutions and negatively curved edges induce bottlenecked information flow, giving rise to the over-squashing phenomenon between graph and textual views. To address the two challenges, we innovatively proposed a CureLLM framework of Curvature-enhanced Graph Representations for Large Language Model whose goal is to inject the signals of edge information into the existing LLMs. Specifically, CureLLM first introduces the training-free textual prompt mechanism to make the LLM model generate the output directly based on the edge-aware prompt without learnable parameter costs. Furthermore, a novel curvature-aware graph representation learning is designed to capture the edge structure information to enhance the downstream tasks, where the message passing between text and graph representations only depends on edges with positive curvature. Finally, we conduct evaluations with 20 different compared methods on 11 real world datasets from various domains and the experiment results demonstrate the superiority of our proposed CureLLM framework.
[IR-6] Memory is Reconstructed Not Retrieved: Graph Memory for LLM Agents ICML2026
链接: https://arxiv.org/abs/2606.06036
作者: Shuo Ji,Yibo Li,Bryan Hooi
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at ICML 2026
Abstract:Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.
[IR-7] o Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection INTERSPEECH2026
链接: https://arxiv.org/abs/2606.05931
作者: Erfan Loweimi,Mengjie Qian,Kate Knill,Guanfeng Wu,Chi-Ho Chan,Abbas Haider,Muhammad Awan,Josef Kittler,Hui Wang,Mark Gales
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: INTERSPEECH 2026
Abstract:When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).
[IR-8] Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific Literature
链接: https://arxiv.org/abs/2606.05907
作者: Tomonaga Okabe,Kazuhiko Komatsu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.
[IR-9] MolE-RAG : Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry
链接: https://arxiv.org/abs/2606.05693
作者: Joey Chan,Wonbin Kweon,Ashley Shin,Niharika Bhattacharjee,Pengcheng Jiang,Yue Guo,Jiawei Han
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:
Abstract:Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.
[IR-10] Agent -Orchestrated Adaptive RAG : A Comparative Study on Structured and Multi-Hop Retrieval
链接: https://arxiv.org/abs/2606.05658
作者: Anuj Maharjan,Devinder Kaur,Richard Molyet
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding their responses in external knowledge, but conventional pipelines rely on static, single-step retrieval that limits performance on complex queries. This paper presents an Agent-Orchestrated Adaptive RAG framework that introduces dynamic query decomposition, iterative retrieval, and a bounded self-reflective evaluation loop. We evaluate the system across two complementary datasets: a domain-specific DevOps knowledge base and the multi-hop reasoning benchmark MuSiQue. Using metrics that include overall score, citation accuracy, mean reciprocal rank, and topic coverage, we find that query decomposition yields consistent gains in the structured domain (overall score +0.04 , MRR +0.17 on DevOps) but degrades ranking precision on the multi-hop benchmark, while the reflection mechanism improves citation accuracy at a substantial latency cost. These contrasting results show that agentic enhancements are not universally beneficial and must be applied selectively according to query and domain characteristics. Our findings argue for adaptive, cost-aware orchestration rather than uniformly aggressive reasoning pipelines.
[IR-11] ANCHOR: Agent ic Noise Creation Framework for Human Simulation and Denoising Recommendation
链接: https://arxiv.org/abs/2606.05621
作者: Xiangming Li,Hua Chu,Chengyu Feng,Jianan Li,Yangtao Zhou
类目: Information Retrieval (cs.IR)
备注:
Abstract:Distilling accurate user preferences from noisy implicit feedback remains a fundamental bottleneck in recommendation systems, highlighting the need for recommendation denoising. However, real-world data lack explicit noise annotations, forcing existing methods to rely on unsupervised side information or handcrafted heuristics. These approaches often incur high external costs, generalize poorly, or depend on unreliable priors, causing noise misidentification and corrupting true user preference representations. To address these limitations, we propose a paradigm-level reformulation of recommendation denoising. Instead of indirectly inferring noisy interactions through heuristics, our Creation-Recognition paradigm proactively creates labeled noisy interactions and trains a dedicated recognizer to identify them, transforming denoising from heuristic filtering into supervised learning. Based on this paradigm, we present ANCHOR, an agent-based framework inspired by recent LLM-as-User research. ANCHOR simulates user behaviors to generate realistic noise labels and enables supervised denoising through two stages: noise creation and noise recognition. In the noise creation stage, ANCHOR adopts a recommender-in-the-loop agentic architecture to synthesize both diverse out-of-preference noise and informative boundary-adjacent noise. For out-of-preference noise, it implements five extensible simulation mechanisms to approximate major sources of noisy implicit feedback. For boundary-adjacent noise, an adversarial boundary refinement mechanism generates ambiguous interactions that challenge the recognizer and target the decision boundary. In the noise recognition stage, ANCHOR leverages the generated labels to train a reusable parametric recognizer that integrates collaborative signals and semantic representations to detect noise patterns in real interaction data.
[IR-12] ColBERTSaR: Sparsified ColBERT Index via Product Quantization SIGIR2026
链接: https://arxiv.org/abs/2606.05568
作者: Eugene Yang,Andrew Yates,Dawn Lawrie,James Mayfield,Saron Samuel,Rohan Jha
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, accepted at SIGIR 2026 as a short paper
Abstract:While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.
[IR-13] PHKT:Personalized Dynamic Hypergraph-enhanced KAN-Transformer for Multi-behavior Sequential Recommendation
链接: https://arxiv.org/abs/2606.05537
作者: Ruijie Du,Hao Chen,Xin Zhang,Dongjing Wang,Ze Zhang,Xudong Shen,Runze Wu,Dongjin Yu
类目: Information Retrieval (cs.IR)
备注: 14 pages, 6 figures, 6 tables
Abstract:In multi-behavior recommendation, auxiliary behaviors such as clicks, add-to-cart, and purchases can provide richer supervisory information for predicting target behaviors. Although existing graph and hypergraph methods are capable of modeling high-order relationships among users, items, and behaviors, they still have limitations in heterogeneous semantics, user-specific weighting, and sequence dependency modeling. While standard Transformers excel at sequence modeling, their shared feedforward mapping struggles to accommodate the differentiated requirements of heterogeneous latent patterns in multi-behavior scenarios. To address this, this paper proposes the Personalized Hypergraph-enhanced Kolmogorov-Arnold Network Transformer (PHKT). Specifically, we design a personalized dynamic hypergraph module that performs behavior-aware weighting of item similarities based on users’ historical behavior sequences to capture user-specific heterogeneous high-order relationships. Meanwhile, a Transformer is used as the temporal backbone to model the evolution of short- and long-term preferences, and KAN is introduced to replace the traditional MLP in the feedforward network to enhance fine-grained modeling capability for nonlinear responses to different latent patterns. Experiments on three real datasets, Tmall, RetailRocket, and IJCAI, show that PHKT consistently outperforms nine strong baseline models across multiple evaluation metrics, demonstrating its effectiveness in multi-behavior preference modeling and target behavior prediction.
[IR-14] n Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
链接: https://arxiv.org/abs/2606.05436
作者: Alejandro Lozano,Keiko Ihara,Ping-Hao Yang,Carrie E. Robertson,Jennifer Stern,Allan Purdy,Hsiangkuo Yuan,Pengfei Zhang,Yulia Orlova,Olga Fermo,Jennifer Hranilovich,Fred Cohen,Todd J. Schwedt,Jenelle A. Jindal,Serena Yeung-Levy,Chia-Chun Chiang
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.
[IR-15] Statistically Reliable LLM -Based Ranking Evaluation via Prediction-Powered Inference ACL2026
链接: https://arxiv.org/abs/2606.05308
作者: Abhishek Divekar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)
备注: Accepted at ACL 2026 - GEM Workshop
Abstract:With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge’s error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.
[IR-16] Scaling Laws for Behavioral Foundation Models over User Event Sequences
链接: https://arxiv.org/abs/2606.05257
作者: Rickard Brüel Gabrielsson
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:
Abstract:Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning 10^15 - 10^19 training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ( s^\star!\approx!2% of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its D/N ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by 10^19 FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.
[IR-17] LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations
链接: https://arxiv.org/abs/2606.05182
作者: Rahul Subramani
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval – requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT’s LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework – paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis – to support reproducibility and future work.
[IR-18] Eliminating Out-of-Domain Recommendations in LLM -based Recommender Systems: A Unified View
链接: https://arxiv.org/abs/2505.03336
作者: Hao Liao,Jiwei Zhang,Jianxun Lian,Wensheng Lu,Mingqi Wu,Shuo Wang,Yong Zhang,Yitian Huang,Mingyang Zhou,Rui Mao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 20 pages
Abstract:Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at this https URL.
人机交互
[HC-0] Computational Modeling of Human Adaptation in Urban Infrastructure Management under Extreme Conditions: A Case Study of Subway Flood Scenarios
链接: https://arxiv.org/abs/2606.06429
作者: Jinfeng Lou,Zijie Liang,Pengkun Liu,Yuxin Zhang,Cleotilde Gonzalez,Pingbo Tang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Decision-making in urban infrastructure management during extreme events relies heavily on human operators, yet current computational support systems often fail to account for non-monotonic human adaptation and latent psychological biases like overconfidence and defensive overcorrection. This study addresses this gap by integrating Instance-Based Learning Theory (IBLT) into the domain of civil engineering computing. We establish a computational cognitive architecture that simulates operator decision processes through the mathematical mechanisms of memory retrieval and utility blending. This model functions as a computational baseline, representing boundedly rational adaptation driven by experiential priors, thus allowing for the algorithmic isolation of latent psychological biases from the baseline dynamics of memory-based learning. We demonstrated this framework using a human-in-the-loop microworld experiment simulating subway flood-induced track suspensions, where dispatchers must balance passenger safety against service efficiency. Analysis revealed a complex, non-linear human adaptation cycle consisting of four phases: acquisition, overconfidence, overcorrection, and recalibration. Specifically, the computational model exposed a significant divergence during the post-accident “overcorrection” phase: while human operators exhibited immediate, defensive risk overestimation, the model maintained a stable trajectory based on accumulated experience. This strategic divergence confirms that operational instability following failure is often attributable to acute psychological bias overriding stable memory-based adaptation, a pattern theoretically expected to recur across analogous high-stakes environments and validatable through multi-modal behavioral and sensor data from professional operators.
[HC-1] Warning Message Content Increases Help Seeking in a Large-Scale Dark Web CSAM Intervention
链接: https://arxiv.org/abs/2606.06417
作者: Caoilte Ó Ciardha,Joel Scanlan,Tegan Insoll,Juha Nurmi,Nina Vaaranen-Valkonen
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Warning messages have been used to disrupt individuals seeking online child sexual abuse material (CSAM) and promote engagement with support services, yet large-scale field evidence on message content remains limited, particularly in high anonymity environments. This study reports a field experiment on this http URL, a Tor search engine, examining how warning message content influences behavior. Across a 140-day period, almost 20 million searches were observed, with over 3 million searches containing known CSAM-related terms that triggered a warning linking to an anonymous self-help program. Users were exposed to warning messages varying in thematic content and framing, or a neutral message. Across a randomized comparison, a campaign-wide analysis, and interrupted time series models, message content consistently influenced engagement with help resources. All active messages increased click-through rates to help resources relative to the neutral condition, with a harm-focused message producing the strongest effects. At the platform level, click-through rates increased from 8.73% before the intervention to 15.67% during the campaign. These findings highlight the importance of message content in shaping responses to warning interventions, supporting an approach in which messaging is refined and adapted to increase engagement with support resources.
[HC-2] A MATLAB Toolbox for Standardized Reading Speed Assessment: Implementing and Extending the Perrin Sentence Generator for English Corpora
链接: https://arxiv.org/abs/2606.06297
作者: Daniel P. Spiegel,Romain Bachy
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:In the fields of vision science, cognitive psychology, and psycholinguistics, the accurate measurement of reading speed is frequently hampered by the limitations of static reading charts. Repeated testing often leads to memorization effects, while the requirement for oral recitation introduces speech-motor confounds that obscure true information processing speed. To address these methodological hurdles, this paper introduces an open-source MATLAB toolbox that adapts the sentence generation paradigm originally proposed by Perrin, Paillé, and Baccino (2014) for the English language. This system utilizes a semantic ontology and a “proto-truth” logic to autonomously generate thousands of unique, grammatically simple sentences with unambiguous truth values. Beyond the original scope of Maximum Reading Speed (MRS) measurement, this implementation introduces band-pass psycholinguistic filtering and specific logic to resolve semantic ambiguities unique to English. We present this complete software package as an open platform for the scientific community to validate and refine.
[HC-3] FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLM s on Argumentative Essays
链接: https://arxiv.org/abs/2606.06271
作者: Yijun Liu,Yifan Song,John Gallagher,Sarah Sterman,Tal August
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.
[HC-4] Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios
链接: https://arxiv.org/abs/2606.06177
作者: Giuseppe Attanasio,Beatrice Savoldi,Daniel Chechelnitsky,Matteo Negri,Marine Carpuat,Maarten Sap,André F.T. Martins
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Code and data at this https URL
Abstract:Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users’ communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent – only around half of interactions are rated as usable – with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves – and how well.
[HC-5] Deterring Searches for Child Sexual Abuse Material on Google Search and Promoting Help-Seeking
链接: https://arxiv.org/abs/2606.06126
作者: Rebecca Umbach,Griffin Hunt,John Buckley,Joel Scanlan,Caoilte Ó Ciardha,Ethel Quayle,Ainslie Heasman,Maximlian von Heyden,Elizabeth Letourneau,Donald Findlater,Tegan Insoll,Richard Wortley,Chad Steel,Abhishek Roy
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Google Search deploys a “Onebox” feature at the top of the results page when users conduct searches for Child Sexual Abuse Material. This study evaluates the impact of a strategic shift in this feature, comparing a revised intervention, focused on repercussions and therapeutic resources, to a previous iteration that focused on reporting. Using a difference-in-differences analysis of internal Google Search logs data, we found the new messaging resulted in a 3.8 percentage point reduction as compared to the status quo in subsequent CSAM-related queries within the same Search session. We found an average click through rate of 0.73% on any of the hyperlinked buttons to help-providing resources. Together, this research presents convergent evidence that a subset of individuals can be deterred from ongoing CSAM-seeking and redirected to therapeutic services.
[HC-6] A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice
链接: https://arxiv.org/abs/2606.06081
作者: Ranjan Mishra,Jakob Schoeffer
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.
[HC-7] Empathy on Demand: How Empathic AI Can Scale Emotional Support for Verbal Harassment
链接: https://arxiv.org/abs/2606.05995
作者: Anouk Bergner,Philipp Winder,Christian Hildebrand
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Verbal harassment is a growing source of psychological stress for people around the world. It occurs both online and offline and relies on language to demean, threaten, or discredit its targets. Unlike other stressors such as loss or uncertainty, verbal harassment aims at silencing its targets by eroding their sense of being heard and weakening their perceived ability to respond. Many individuals lack access to adequate and timely support, however, when they experience such harassment. People increasingly turn to conversational artificial intelligence (AI) such as ChatGPT or dedicated AI companions for emotional support, raising questions about whether it can facilitate the same psychological benefits as actual human empathy. We focus on online contexts as a prevalent application of verbal harassment. We develop and test a psychological framework identifying three key linguistic signals of empathic listening (perspective-taking, emotional validation, and action orientation), that together restore a sense of feeling heard and enhance coping in the context of verbal harassment. We find that LLMs consistently produce language exhibiting stronger empathic-listening markers than human non-experts and trained mental health professionals, promoting more approach-oriented (vs. avoidance-oriented) coping strategies. A subsequent behavioral study shows that these linguistic signals boost recipients’ sense of feeling heard and increase their coping self-efficacy. These findings reveal how specific linguistic features create empathic connections between humans and advanced conversational AI and can enhance people’s psychological resilience. Our results highlight the potential for AI to serve as a scalable source of emotional support, especially when human support is unavailable or insufficient.
[HC-8] EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction
链接: https://arxiv.org/abs/2606.05855
作者: Zhihao Zhou,Weishan Ye,Li Zhang,Gan Huang,Zhen Liang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 51 pages, 9 figures, 13 tables
Abstract:Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional this http URL address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified this http URL, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local this http URL experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.
[HC-9] Architecting Strategic Influence: Operationalising the UXR Point of View Framework for Research Function Maturity
链接: https://arxiv.org/abs/2606.05826
作者: Rohinin Singh,Renee Barsoum
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This case study illustrates that the systematic application of the User Experience Research (UXR) Point of View (POV) framework serves as an effective operational scaffolding for a UXR function undergoing the critical transition from incubation to maturity. By assimilating structured ‘Offensive’ and ‘Defensive’ strategies, the presented Playbook equips UXR leaders with an adaptable toolkit to systematically navigate common institutional barriers, such as stakeholder bias, reactive tasking, and insight fragmentation. By pre-emptive and purposeful application of growth strategies, the likelihood of the research function establishing itself as a strategic partner capable of delivering evidence-based, actionable perspectives is significantly enhanced. The analysis demonstrates how this deliberate, Playbook-driven maturity strategy empowers research functions to move beyond tactical execution and directly shape long-term business strategy.
[HC-10] Coding with “Enemy”: Can Human Developers Detect AI Agent Sabotage?
链接: https://arxiv.org/abs/2606.05647
作者: Jingheng Ye,Huiqi Zou,Simon Yu,Weiyan Shi
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 34 pages, 30 figures, 3 tables
Abstract:AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.
[HC-11] Fix the Mind Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization ICML
链接: https://arxiv.org/abs/2606.05602
作者: Ayano Hiranaka,Ya-Chuan Hsu,Stefanos Nikolaidis,Erdem Bıyık,Daniel Seita
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to International Conference on Machine Learning (ICML) 2026
Abstract:AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting 90% of student misconceptions. Code and project page are available at this https URL.
[HC-12] Wave Focusing in Metamaterials: Tactile Displays Beyond the Diffraction Limit
链接: https://arxiv.org/abs/2606.05572
作者: Gregory Reardon,Max Linnander,Dustin Goetz,Neeli Tummala,Yon Visell
类目: Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Applied Physics (physics.app-ph)
备注:
Abstract:We address the challenge of engineering distributed haptic displays capable of reproducing multiple localized, independently addressable vibrations – representing virtual tactile pixels – at arbitrary locations on a surface. Our technique is based on the focusing of mechanical waves in a flexural plate using a sparse set of actuators. At tactile frequencies, wave diffraction prevents the formation of localized virtual tactile pixels at spatial scales relevant for multi-digit touch interactions. We overcome this limitation by augmenting the plate with a lattice of mechanical resonators, forming a locally resonant metamaterial plate. Coupling between the plate’s dynamic modes and those of the resonators alters the dispersion relation governing wave transmission, introducing a slow-wave branch that enables focusing beyond the diffraction limit imposed by the unmodified plate. We use numerical simulations to engineer the dispersion relation of the metamaterial system for high-resolution focusing at tactile frequencies. We then fabricate a metamaterial tactile display and experimentally demonstrate virtual pixels that are far more localized than those generated on an otherwise identical plate without resonators, resulting in a tenfold reduction in virtual-pixel area. In behavioral experiments, we show that this system can deliver perceptually localized single- and multi-point tactile feedback and moving tactile sources while maintaining independent control over temporal waveforms at multiple display locations. The methods reported here can enable high-resolution haptic displays for widespread applications using a small number of actuated degrees of freedom.
[HC-13] Individual Gain Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity AAAI2026
链接: https://arxiv.org/abs/2606.05532
作者: Anna Mikeda
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages. AAAI 2026 paper
Abstract:Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations – cognitive offloading and over-reliance – identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.
[HC-14] SciVisAgentS kills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
链接: https://arxiv.org/abs/2606.05525
作者: Kuangshi Ai,Haichao Miao,Kaiyuan Tang,Shusen Liu,Chaoli Wang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at this https URL.
[HC-15] he Role of Instructional Guidance in Generative AI-Assisted Learning: Empirical Evidence from Construction Engineering Education
链接: https://arxiv.org/abs/2606.05509
作者: Xiaoyu Hou,Bo Xiao,Hexu Liu,Shane Mueller
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence (AI) is increasingly used to support self-directed learning, yet student interaction with such systems often remains unstructured, limiting engagement in deeper cognitive processes. This study examines how instructional guidance shapes student and AI interaction in construction education. A five-step prompting framework grounded in Generative Learning Theory (GLT) is introduced to guide learner interaction during review activities. A controlled experiment compares three learning conditions: slide-based learning, unprompted AI-supported learning, and prompted AI-supported learning. Learning performance is assessed using multiple-choice and open-ended tasks, and user experience is measured using the User Experience Questionnaire (UEQ). Performance differences are concentrated on tasks requiring explanation and reasoning. The prompted condition achieves higher open-ended scores, with an improvement of approximately 2 or 3 points on a scale of 18 (p 0.01), while no significant differences are observed in multiple-choice performance. The unprompted condition remains comparable to slide-based learning. These findings indicate that the effectiveness of AI-supported learning depends on how interaction is structured. The proposed framework provides a basis for integrating learning science principles into generative AI systems for construction education.
[HC-16] When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM -Agent Trajectories
链接: https://arxiv.org/abs/2606.05414
作者: Avinash Baidya,Xinran Liang,Ruocheng Guo,Xiang Gao,Kamalika Das
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages, 14 figures, and appendix
Abstract:Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with \alpha -STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10% over naive prefix supervision, and that the full system improves frontier quality by 3-42% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.
[HC-17] A Motivational Architecture for Conversational AGI
链接: https://arxiv.org/abs/2606.05411
作者: Anna Mikeda,Ben Goertzel
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages. Accepted for AGI-26 proceedings
Abstract:Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user’s evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo’s higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents – CompanionAgent and ResearchAgent – and sketch its extension to social robotics and domain-generic human-level AGI.
[HC-18] A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing
链接: https://arxiv.org/abs/2606.05330
作者: Jared Moore,Noah Goodman,Nick Haber,Max Kleiman-Weiner
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.
[HC-19] Wheres the Structure? A Systematic Literature Review of Empirical Research on Human-AI Collaboration and Hybrid Intelligence for Learning
链接: https://arxiv.org/abs/2606.05222
作者: Luis P. Prieto,Juan I. Asensio-Pérez,María Jesús Rodríguez-Triana,Mohamed Saban,Yannis Dimitriadis
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 59 pages, 4 figures, submitted to a journal
Abstract:Artificial intelligence (AI) has been applied across educational contexts to support learning. One approach to such support is “human-AI collaboration” (also termed “hybrid intelligence”), where human(s) and AI components interact to promote human learning. However, as in human-to-human computer-supported collaborative learning (CSCL), unstructured interaction does not necessarily produce an effective learning experience. This paper reports a systematic literature review of empirical studies (N=62) on human-AI collaboration and hybrid intelligence for learning support. The review characterizes collaboration processes, their structures, and contexts of application. It also extracts emerging design knowledge and research gaps. Researchers and technology designers can use these findings as a starting point for structuring more effective AI-enhanced technologies for collaboration, in educational practice and future research.
[HC-20] he Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models
链接: https://arxiv.org/abs/2606.05183
作者: Patrick Keough
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 9 figures
Abstract:Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert = 2.0) and 22.7 percent reach moderate or severe levels (= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.
[HC-21] he Virtual Roundtable: Multi-Agent Personas Simulating the Dynamics of Human Brainstorming
链接: https://arxiv.org/abs/2606.05178
作者: Tim Dorn,Saara A. Khan,Julie Mumford
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, 2 tables
Abstract:As AI-driven product development accelerates, the bottleneck is shifting from how we build to what we build. Traditional human brainstorming faces challenges including groupthink, echo chambers, and limited diversity. To address this, we present a multi-agentic architecture that simulates roundtable brainstorming through two phases: divergent thinking to generate diverse ideas, and convergent thinking to evaluate and rank the most promising ones. The system employs diverse AI personas that engage in roundtable discussions, guided by an agentic facilitator that steers the discussion toward productive outcomes. Personas maintain private thoughts while commenting publicly, with ideas emerging organically throughout the discussion. Per-persona quotas on idea submissions and votes promote balanced participation while producing natural rankings. Throughout the session, the system tracks each idea’s lineage, capturing how concepts originate and cross-pollinate over time. We demonstrate this approach through a case study generating consumer ideas for AI smart glasses, showing (i) it produces diverse, relevant ideas with insights into their evolution; (ii) the cumulative exchange of perspectives across personas cultivates a shared context that progressively deepens the quality of discussion and the ideas produced.
[HC-22] Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning -Aware Image Editing
链接: https://arxiv.org/abs/2606.05172
作者: Yixuan Ding,Wei Huang,Ruijie Quan,Xiaojuan Qi,Yi Yang
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, 7 tables
Abstract:Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.
[HC-23] AppAgent -Claw: CLI Is All You Need for GUI Automation
链接: https://arxiv.org/abs/2606.05171
作者: Zhixue Song,Zhiheng Zhang,Yi Song,Chi Zhang
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 13 pages, 1 figure, 8 tables. Preprint
Abstract:The OpenClaw platform provides a practical foundation for automation through its skill-oriented architecture, organizing external capabilities into lightweight, reusable components that can be invoked efficiently through a command-line interface (CLI). However, a significant bottleneck remains: many real-world tasks are confined to graphical user interfaces (GUIs) with no stable API available. While LLM-based GUI agents offer generality, their reliance on repeated live model inference makes them too slow, costly, and inconsistent to serve as efficient OpenClaw skills. In this paper, we present AppAgent-Claw, a demonstration-driven system that converts GUI workflows into reliable, reusable skills without runtime inference. By following a ``record-once, replay-many’’ paradigm, the system captures rich contextual metadata to facilitate robust execution. It employs a layered localization strategy to handle visual shifts and a validation-coupled execution model to ensure intended on-screen effects. AppAgent-Claw provides a practical, efficient, and diagnosable solution for integrating GUI-bound tasks into the OpenClaw ecosystem.
计算机视觉
[CV-0] PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding
链接: https://arxiv.org/abs/2606.06485
作者: Shaohui Dai,Yansong Qu,You Shen,Shengchuan Zhang,Liujuan Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.
[CV-1] Complexity-Balanced Diffusion Splitting
链接: https://arxiv.org/abs/2606.06477
作者: Noam Issachar,Dani Lischinski,Raanan Fattal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor’s equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow’s Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at this https URL.
[CV-2] hinking with Imagination: Agent ic Visual Spatial Reasoning with World Simulators
链接: https://arxiv.org/abs/2606.06476
作者: Chenming Zhu,Jingli Lin,Yilin Long,Peizhou Cao,Tai Wang,Jiangmiao Pang,Xihui Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model’s ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.
[CV-3] In-Context Multiple Instance Learning
链接: https://arxiv.org/abs/2606.06458
作者: Alexander Möllers,Marvin Sextro,Julius Hense,Gabriel Dernbach,Klaus-Robert Müller
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.
[CV-4] HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable Densely Interactive Whole-Home Scenes
链接: https://arxiv.org/abs/2606.06390
作者: Wenbo Li,Xiaoliang Ju,Zipeng Qin,Rongyao Fang,Hongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: this https URL
[CV-5] EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models
链接: https://arxiv.org/abs/2606.06379
作者: Qiwei Zeng,Hao Wang,Jinghao Lin,Shuchang Ye,Yuezhe Yang,Yige Peng,Haoyuan Che,Jinman Kim,Lei Bi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.
[CV-6] Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation
链接: https://arxiv.org/abs/2606.06369
作者: Maëlic Neau,Salim Baloch,Jakob Suchan,Zoe Falomir,Mehul Bhatt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.
[CV-7] GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery
链接: https://arxiv.org/abs/2606.06363
作者: Hao Lei,Xi Cheng,Chenlu Shu,Zhiheng Chen,Zhengjie Duan,Haoyu Wang,Zhanfeng Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 5 figures
Abstract:Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.
[CV-8] Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them ICML2026
链接: https://arxiv.org/abs/2606.06361
作者: Woojung Han,Seil Kang,Youngjun Jun,Min-Hung Chen,Fu-En Yang,Seong Jae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by \approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ( 1.06\times time, 1.02\times memory) and reduced reliance on expensive external guidance methods ( \sim5\times time).
[CV-9] Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral Imaging
链接: https://arxiv.org/abs/2606.06359
作者: Yadav Raj Ghimire,Jagrati Talreja,Tewodros Syum Gebre,Timothy Agboada,Shikha V. Chandel,Leila Hashemi Beni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted in IGARSS 2026. Copyright 2026 IEEE
Abstract:In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.
[CV-10] StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale Multi-Genre and Auto-Generated Dataset
链接: https://arxiv.org/abs/2606.06338
作者: Zhengqian Wu,Zhixian Liu,Aodong Chen,Jingyang Zhang,Ruizhe Li,Hanlin Ge,Zhongyuan Wang,Chunxia Xiao,Chao Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCV 2026
Abstract:Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU this http URL difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: this https URL
[CV-11] Efficient Mean Curvature Computation on High-Dimensional Data Manifolds
链接: https://arxiv.org/abs/2606.06329
作者: Alexandre L. M. Levada
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 31 pages, 2 figures and 5 tables
Abstract:Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix H whose trace form yields an O(m^4) cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates H entirely and reduces the per-point cost to O(m^2) after the eigendecomposition. The second contribution addresses the remaining O(m^3) bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most k-1 \ll m , we replace it with a truncated SVD of the k \times m centered data matrix, an O(k^2 m) operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost O(k^2 m + k m p^2) , where p = k-1 . Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.
[CV-12] RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling
链接: https://arxiv.org/abs/2606.06309
作者: Chensheng Dai,Shengjun Zhang,Yifan Li,Zhang Zhang,Zheng Zhu,Yueqi Duan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL
Abstract:Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbfRhymeFlow, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.
[CV-13] owards One-to-Many Temporal Grounding ICML’26
链接: https://arxiv.org/abs/2606.06294
作者: Qi Xu,Yue Tan,Shihao Chen,Jiahao Meng,Anna Wang,Shunping Ji,Hao Fei,Jason Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML’26
Abstract:Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query – a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.
[CV-14] Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation
链接: https://arxiv.org/abs/2606.06292
作者: Ariel Herrera,Xueyang Kang,Atal Anil Kumar
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth’s state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.
[CV-15] Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image Restoration ECCV2026
链接: https://arxiv.org/abs/2606.06278
作者: Akshay Janardan Bankar,Ankita Chatterjee,Sayan Banerjee,Shreyas Pandith,Kalakonda Sai Shashank,Amit Satish Unde
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ECCV 2026
Abstract:Blind image restoration requires recovering clean images from observations corrupted by unknown and potentially mixed degradations. While recent deterministic flow-based methods model restoration as transport processes that map degraded images to clean ones, they typically rely on Euclidean interpolation, implicitly assuming linear degradation geometry. In this paper, we explicitly model degradations as points on a low-dimensional Riemannian manifold and formulate restoration as geodesic transport on the joint image-manifold space. Using a geodesic flow matching objective, we learn intrinsic transport dynamics that respect the curvature of degradation space. This framework generalizes linear flow matching, provides a principled treatment of mixed degradations as geodesic compositions, and yields a clean theoretical interpretation for generalization beyond observed degradations.
[CV-16] RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning
链接: https://arxiv.org/abs/2606.06255
作者: Ziyang Yu,Xiang Li,Qiong Chang,Jun Miyazaki
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 28 pages,15 figures
Abstract:Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.
[CV-17] GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention
链接: https://arxiv.org/abs/2606.06249
作者: Giordano Cicchetti,Eleonora Grassucci,Danilo Comminiello
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.
[CV-18] SAM-Flow: Source-Anchored Masked Flow for Training-Free Image Editing
链接: https://arxiv.org/abs/2606.06228
作者: Haowang Cui,Rui Chen,Tao Luo,Tao Guo,Zheng Qin,Jiaze Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
Abstract:Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: this https URL.
[CV-19] Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology
链接: https://arxiv.org/abs/2606.06224
作者: Yanqing Luo(1 and 2),Julius Hense(1 and 2),Niklas Prenißl(3 and 4),Andreas Mock(5 and 6 and 7),Klaus-Robert Müller(1 and 2 and 8 and 9),Thomas Schnake(10 and 11 and 12),Mina Jamshidi Idaji(1 and 2) ((1) Berlin Institute for the Foundations of Learning and Data, Berlin, Germany, (2) Machine Learning Group, Technische Universität Berlin, Berlin, Germany, (3) Institute of Pathology, Charité Universitätsmedizin, Berlin, Germany, (4) Berlin Institute of Health at Charité – Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Digital Clinician Scientist Program, Berlin, Germany, (5) Institute of Pathology, Ludwig Maximilian University of Munich, Munich, Germany, (6) Division of Translational Medical Oncology, DKFZ, Heidelberg, Germany, NCT Heidelberg, Heidelberg, Germany, (7) German Cancer Consortium (DKTK), partner site Munich, a partnership between DKFZ and Ludwig-Maximilians-Universität München (LMU), Germany, (8) Department of Artificial Intelligence, Korea University, Seoul, Korea, (9) Max-Planck Institute for Informatics, Saarbrücken, Germany, (10) Department of Chemistry, Chemical Physics Theory Group, University of Toronto, Canada, (11) Vector Institute for Artificial Intelligence, Toronto, Canada, (12) Acceleration Consortium, University of Toronto, Canada)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 18 figures
Abstract:Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model’s behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model’s predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.
[CV-20] DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments
链接: https://arxiv.org/abs/2606.06217
作者: Tan Zhang,Quanyou Li,Lu Zhang,Jun Liu,Xiaofeng Zhu,Ping Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at this https URL.
[CV-21] SC-MFJ: A Simple Haptic Quality Metric for Medical Image Segmentation WWW
链接: https://arxiv.org/abs/2606.06199
作者: Souraj Adhikary,Negar Chabi,Andre Mastmeyer
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 11 pages, 5 figures, 5 tables, this http URL
Abstract:Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.
[CV-22] ActiveMimic: Egocentric Video Pretraining with Active Perception
链接: https://arxiv.org/abs/2606.06194
作者: Xingyao Lin,Guojin Zhong,Tianyi Lu,Ziyi Ye,Yichen Zhu,Zuxuan Wu,Yu-Gang Jiang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.
[CV-23] Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models ICLR2026
链接: https://arxiv.org/abs/2606.06186
作者: Liangsheng Liu,Si Chen,Jiamin Wu,Weiwei Feng,Zhixin Cheng,Xiaotian Yin,Wenfei Yang,Tianzhu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2026
Abstract:Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP’s feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.
[CV-24] RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision
链接: https://arxiv.org/abs/2606.06176
作者: Haochen Hu,Yanrui Bin,Chih-yung Wen,Bing Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.
[CV-25] Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting
链接: https://arxiv.org/abs/2606.06158
作者: Kevin Dave,Sai Aditya Patkuri,Chhaya Kumar Das,Gouranga Bala,R. Venkatesh Babu,Rajeshkumar SA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\citeinfotok, agarwal2025cosmos, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a 31\times inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an \approx2\times speedup over the discrete information-theoretic baseline (InfoTok) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.06158 [cs.CV] (or arXiv:2606.06158v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.06158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-26] AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
链接: https://arxiv.org/abs/2606.06155
作者: Qize Yu,Jiadi You,Yuran Wang,Jiaqi Liang,Bowen Ping,Yang Tian,Yue Chen,Minghong Cai,Zeying Gong,Ruihai Wu,Yinchuan Li,Junwei Liang,Yingcong Chen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Preprint. Code and project page are available. Code: this https URL Project page: this https URL
Abstract:Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception–action mappings. To address this challenge, we propose \textbfAffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception–action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbfWhich2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbfWhere2Act for 2D interaction localization via affordance map estimation; and 3) \textbfHow2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
[CV-27] Computation-Aware Event-to-Frame Reconstruction via Selective Attention
链接: https://arxiv.org/abs/2606.06142
作者: Jingqian Wu,Yunbo Jia,Edmund Y. Lam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.
[CV-28] Diff-CA: Separating Common and Salient Factors with Diffusion Models
链接: https://arxiv.org/abs/2606.06120
作者: Michaël Soumm,Alexandre Fournier Montgieux,Yunlong He,Pietro Gori,Alasdair Newson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.
[CV-29] Where What Why and Importance: Structured Defect Grounding for Text-to-Image Feedback
链接: https://arxiv.org/abs/2606.06113
作者: Huaisong Zhang,Hao Yu,Yuxuan Zhang,Jiahe Wang,Xinrui Chen,Haoxiang Cao,Feng Lu,Wendong Zhang,Changqian Yu,Chun Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 9 figures
Abstract:Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.
[CV-30] MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models
链接: https://arxiv.org/abs/2606.06103
作者: Tariq M. Khan,Syed Saud Naqvi,Thantrira Porntaveetus,Hamid Alinejad-Rokny,Shahzaib Iqbal,Imran Razzak,Mohammad AU Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.06103 [cs.CV] (or arXiv:2606.06103v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.06103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-31] HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning
链接: https://arxiv.org/abs/2606.06100
作者: Moshiur Farazi,Sameera Ramasinghe,Mahbub Ahmed Turza,Shafin Rahman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets \langle s, p, o \rangle from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38% to 58.86%. We propose \textbfHyperVis, which bypasses the SGG semantic bottleneck entirely. From N class-agnostic region proposals, we compute a dense O(N^2) visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emphtraining-time regularizer, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03% vs.\ 57.21% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emphinference-time relational encoder, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94%, + 6.25pp over baseline). The learned curvature stabilises at \kappa=4.0 , an order of magnitude above prior hyperbolic VLMs where \kappa typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81%), but the compositionality gain is specifically hyperbolic (SugarCrepe + 4.58pp over Euclidean), with entailment loss \sim6\times higher in Euclidean training. Codes are available at TBA.
[CV-32] Knowledge Distillation for Visual Autoregressive Models
链接: https://arxiv.org/abs/2606.06078
作者: Elia Peruzzo,Aritra Bhowmik,Guillaume Sautiere,Yuki M Asano,Amirhossein Habibian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.
[CV-33] Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
链接: https://arxiv.org/abs/2606.06076
作者: Haocheng Luo,Jiahui Liu,Ruicheng Zhang,Zhizhou Zhong,Jiaqi Huang,Zunnan Xu,Quan Shi,Jun Zhou,Xiu Li
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, preprint
Abstract:While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student’s own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at this https URL.
[CV-34] VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle Crashes ITSC2026
链接: https://arxiv.org/abs/2606.06074
作者: Tommaso Bianconcini,Henrique Piñeiro Monteagudo,Aurel Pjetri,Tomaso Trinci,Leonardo Taccari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). VZCrash is publicly available at this URL: this https URL
Abstract:We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.
[CV-35] FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning ICANN2026
链接: https://arxiv.org/abs/2606.06066
作者: Marian Lupascu,Nipun Jindal,Ionut Mironica,Zhaowen Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 12 pages, 8 figures, accepted at ICANN 2026
Abstract:Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.
[CV-36] ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE
链接: https://arxiv.org/abs/2606.06060
作者: Mishan Aliev,Eva Neudachina,Ilya Bykov,Aleksandr Oganov,Kirill Struminsky,Aibek Alanov,Denis Rakitin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a \times5.04 FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a \sim \times2.6 speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at this https URL.
[CV-37] LLM -Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations CVPR
链接: https://arxiv.org/abs/2606.06048
作者: Mritula Chandrasekaran,Sanket Kachole,Jarik Francik,Dimitrios Makris
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR MOMA Workshop 2026 and selected for spotlight presentation at the workshop
Abstract:Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77% accuracy under a leave-one-subject-out protocol.
[CV-38] LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
链接: https://arxiv.org/abs/2606.06042
作者: Jianzong Wu,Hao Lian,Jiongfan Yang,Dachao Hao,Ye Tian,Yunhai Tong,Jingyuan Zhu,Biaolong Chen,Qiaosong Qi,Aixi Zhang,Wanggui He,Mushui Liu,Jinlong Liu,Hao Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.
[CV-39] xture-preserving implicit neural representation for Cone beam CT truncated reconstruction
链接: https://arxiv.org/abs/2606.06039
作者: Genyuan Zhang,Junyao Wang,Haoran Lan,Chuandong Tan,Songtao Zhu,Fenglin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.
[CV-40] ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition
链接: https://arxiv.org/abs/2606.06020
作者: Pablo Ayuso-Albizu,Pablo Carballeira,Juan C. SanMiguel,Paula Moral
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
Abstract:To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at this http URL.
[CV-41] Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation
链接: https://arxiv.org/abs/2606.06002
作者: Mengshi Qi,Wei Deng,Xianlin Zhang,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models have achieved significant reasoning performance in various this http URL, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error this http URL this paper, we consider the task as a planning problem constrained by spatial and layout this http URL solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making this http URL the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a this http URL effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS this http URL hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object this http URL PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer this http URL the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement this http URL make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the this http URL existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.
[CV-42] ATT-CR: Adaptive Triangular Transformer for Cloud Removal
链接: https://arxiv.org/abs/2606.05999
作者: Yang Wu,Ye Deng,Pengna Li,Wenli Huang,Kangyi Wu,Xiaomeng Xin,Jinjun Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.
[CV-43] Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images
链接: https://arxiv.org/abs/2606.05998
作者: Jihun Cho,Soo-Yeon Jeong,Eun-Jeong Bae,Sun-Young Ihm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025
Abstract:Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient’s oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.
[CV-44] Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting
链接: https://arxiv.org/abs/2606.05997
作者: Kyriakos Chaviaras,Maria Lymperaiou,Athanasios Voulodimos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.
[CV-45] Video-Rate Streaming Stylization on a Vision-Aware MLLM -Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet MLLM Text Encoder
链接: https://arxiv.org/abs/2606.05981
作者: Yoshiyuki Ootani
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at this https URL (also mirrored to Hugging Face and Zenodo)
Abstract:Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.
[CV-46] -FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation
链接: https://arxiv.org/abs/2606.05975
作者: Jingkun Feng,Reza Sabzevari
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.
[CV-47] Faithful Enriched and Precise: Benchmarking Natural-Science Illustration Generation by T2I models
链接: https://arxiv.org/abs/2606.05949
作者: Yifan Chang,Jiaxin Ai,Jianwen Sun,Yuandong Pu,Siqi Luo,Liangliang Zhao,Yuchen Ren,Minghao Liu,Yunfei Yu,Yu Qiao,Kaipeng Zhang,Yihao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.
[CV-48] Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs
链接: https://arxiv.org/abs/2606.05916
作者: Yi Chen,Yinghao Lu,Zhehao Li,Chenchen Yan,Jiafei Wu,Chong Wang,Jiangbo Qian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.
[CV-49] CamFlow: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications
链接: https://arxiv.org/abs/2606.05915
作者: Haipeng Li,Zhen Liu,Zhanglei Yang,Hai Jiang,Tianhao Zhou,Zhengzhe Liu,Ping Tan,Bing Zeng,Shuaicheng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: this https URL.
[CV-50] Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars
链接: https://arxiv.org/abs/2606.05912
作者: Jiahao Yang,Xiaohang Yang,Qing Wang,Yilan Dong,Gregory Slabaugh,Shanxin Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.
[CV-51] Resonant Minds: Closed-Loop Social Avatars with Theory of Mind
链接: https://arxiv.org/abs/2606.05896
作者: Jianxu Shangguan,Jing Xu,Hang Ye,Xiaoxuan Ma,Yizhou Wang,Wentao Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners’ multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.
[CV-52] Geometry-Aware Dataset Condensation for Diffusion Model Training ICML2026
链接: https://arxiv.org/abs/2606.05883
作者: Xiao Cui,Yulei Qin,Mo Zhu,Wengang Zhou,Hongsheng Li,Houqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at this https URL.
[CV-53] LadderMan: Learning Humanoid Perceptive Ladder Climbing
链接: https://arxiv.org/abs/2606.05873
作者: Siheng Zhao,Yuanhang Zhang,Ziqi Lu,Pieter Abbeel,Rocky Duan,Koushil Sreenath,Yue Wang,C. Karen Liu,Guanya Shi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbfLadderMan, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at this https URL .
[CV-54] Entropy-Based Evaluation of AI Agents : A Lightweight Framework for Measuring Behavioral Patterns
链接: https://arxiv.org/abs/2606.05872
作者: Olasimbo Ayodeji Arigbabu
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 Tables
Abstract:AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.
[CV-55] Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
链接: https://arxiv.org/abs/2606.05833
作者: Haibo Wang,Lifu Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model’s internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.
[CV-56] Gender Artifacts from Art History to Text-to-Image Generation
链接: https://arxiv.org/abs/2606.05829
作者: Piera Riccio,Miriam Doh,Benedikt Höltgen,Noa Garcia,Nanne van Noord
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.
[CV-57] Emotion-Aware Image Generation from Korean Diary Text via LLM -based Prompt Translation and LoRA Fine-Tuning
链接: https://arxiv.org/abs/2606.05816
作者: Jihun Cho,Soo-Yeon Jeong,Sun-Young Ihm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children’s hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children’s drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.
[CV-58] Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation
链接: https://arxiv.org/abs/2606.05785
作者: Shawaiz Obaid,Nida Chandio,Neha Jamil,Muhammad Khuram Shahzad
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures
Abstract:Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.
[CV-59] Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment
链接: https://arxiv.org/abs/2606.05778
作者: Qifei Jia,Xintong Yao,Minghao Li,Yajie Chai,Qiming Lu,Baoyue Shen,Yasen Zhang,Runyu Shi,Ying Huang,Yue Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.
[CV-60] LiAuto-GeoX: Efficient Grounded Driving Transformer
链接: https://arxiv.org/abs/2606.05774
作者: Jiawei Lian,Haoyi Sun,Yang Wu,Lifu Mu,Siyuan Wang,Le Hui,Ning Mao,Tao Wei,Pan Zhou,Kun Zhan,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbfLiAuto-GeoX, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbfLiAuto-GeoX runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.
[CV-61] Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
链接: https://arxiv.org/abs/2606.05769
作者: Tianxiang Jiang,Linquan Wu,Sheng Xia,Songze Li,Ziang Yan,Haoyu Yang,Yu Qiao,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.
[CV-62] ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection
链接: https://arxiv.org/abs/2606.05760
作者: Ruchika Sharma,Rudresh Dwivedi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.
[CV-63] Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function
链接: https://arxiv.org/abs/2606.05759
作者: Zhaolin Li,Jinsong Chen,Shanxin Guo,Tuo Zhang,Xinglong Zhang,Pan Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.
[CV-64] DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models
链接: https://arxiv.org/abs/2606.05758
作者: Zhuoming Liu,Jinhong Lin,Kwan Man Cheng,Lin Zhang,Shayok Bagchi,Yin Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.
[CV-65] Cosine Misleads: Auxiliary Losses Reshape Vision Language Models Not Their Latents
链接: https://arxiv.org/abs/2606.05753
作者: XiuYu Zhang,Junfeng Fang,Zhenkai Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.
[CV-66] Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models
链接: https://arxiv.org/abs/2606.05737
作者: Yitong Chen,Shiduo Zhang,Jingjing Gong,Xipeng Qiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 20 pages, 10 figures
Abstract:Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.
[CV-67] VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning
链接: https://arxiv.org/abs/2606.05736
作者: Shufan Zhang,Ziyue Lin,Bairun Wang,Lei Jin,Xuanding Ding,Xinzhu Ma,Kunlin Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures
Abstract:Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.
[CV-68] xtWand: A Unified Framework for Scene Text Editing
链接: https://arxiv.org/abs/2606.05730
作者: Shuyu Wang,Zhile Guan,Hongxiu Chen,Yule Duan,Weiqi Li,Xin Shan,Ronggang Wang,Jian Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.
[CV-69] ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation
链接: https://arxiv.org/abs/2606.05718
作者: Kanghui Tian,Siyuan Liu,Ziang Yan,Sheng Xia,Shuai Dong,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 11 figures. Preprint, under review
Abstract:On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher’s supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.
[CV-70] Real-Time Threat Detection from Surveillance Cameras using Machine Learning
链接: https://arxiv.org/abs/2606.05708
作者: Gajendra Mandal,J. P. Patra,Priyansh Mahant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.
[CV-71] Parallel Jacobi Decoding for Fast Autoregressive Image Generation CVPR2026
链接: https://arxiv.org/abs/2606.05703
作者: Boya Liao,Ying Li,Siyong Jian,Huan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.
[CV-72] Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models
链接: https://arxiv.org/abs/2606.05702
作者: Haoyu Zhou,Qing Qing,Caichong Li,Qixin Zhang,Yongcheng Jing,Ziqi Xu,Juncheng Hu,Xikun Zhang,Renqiang Luo
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts’', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in this https URL.
[CV-73] -SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction
链接: https://arxiv.org/abs/2606.05700
作者: Kerod Woldesenbet,Abem Woldesenbet
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings
Abstract:We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p 0.001, permutation test) confirms structured detections. Code: this https URL
[CV-74] wo-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning ICLR2026
链接: https://arxiv.org/abs/2606.05675
作者: Hongye Xu,Bartosz Krawczyk
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: this https URL
Abstract:Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle-consistency objective. BiCyc jointly optimizes two maps, old-to-new and new-to-old, with stop-gradient gating so that transport and representation co-evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from-scratch settings, while remaining competitive in the pretrained fine-grained regime.
[CV-75] V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation ICML2026
链接: https://arxiv.org/abs/2606.05665
作者: Tao Liu,Leela Krishna,Gouti Pavan Kumar,Sreeja K,Vishav Garg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026 workshop
Abstract:Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.
[CV-76] CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors
链接: https://arxiv.org/abs/2606.05652
作者: Shengxi Li,Zhaokun Hu,Ce Zheng,Mai Xu,Jingyuan Xia,Si Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.
[CV-77] GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point Clouds
链接: https://arxiv.org/abs/2606.05650
作者: Rajrup Ghosh,Haodong Wang,Haoran Hong,Eduardo Pavez,Amartya Chaudhuri,Weiwu Pang,Harsha V. Madhyastha,Antonio Ortega,Ramesh Govindan
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Dynamic 3D Gaussian Splatting (3DGS) holds great promise as a 3D video streaming technology since it can represent complex 3D scenes with high fidelity. In this approach, every frame in a 3D video represents the environment as a collection of Gaussians with position and other attributes such as scale, rotation, opacity, and color. Frames capture fine details, permit views from any arbitrary perspective, but are an order of magnitude, or more, larger than 2D video frames. A line of recent work has explored how to compress dynamic 3DGS frames, but these approaches are often slow, in part because their compression techniques are not amenable to efficient acceleration. GS-NFS accelerates dynamic 3DGS compression and decompression on a GPU, to the point where it can encode and decode at full frame rate. It achieves this by developing novel GPU-based parallelizations of existing algorithms for encoding both positions and attributes of Gaussians. As a result, it is 1-2 orders of magnitude faster than the state-of-the-art in encoding and decoding a frame, while offering competitive compression performance and rendering quality.
[CV-78] Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure
链接: https://arxiv.org/abs/2606.05641
作者: Blessing Agyei Kyem,Joshua Kofi Asamoah,Eugene Denteh,Armstrong Aboah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 60 pages, 17 figures, 11 tables
Abstract:Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.
[CV-79] ShotCrop3: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions
链接: https://arxiv.org/abs/2606.05635
作者: Dehong Kong,Lina Lei,Lingtao Zheng,Chenyang Wu,Ailing Zhang,Xinran Qin,Teng Ma,Jiaqi Xu,Zhixin Wang,Zhikai Chen,Xuecheng Qi,Renjing Pei,Fan Li
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbfTriple-Shot Compositions (TSC), a composition task that generates a three-shot set – establishing, medium, and close-up – from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbfShotCrop which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbfShotCrop (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf2.82 times over GPT-5 in shot localization accuracy.
[CV-80] KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion
链接: https://arxiv.org/abs/2606.05624
作者: Tengjiao Sun,Pengcheng Fang,Xiaoyu Zhan,Yanwen Guo,Dongjie Fu,Xiaohao Cai,Hansung Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbfPartVQ learns anatomy-aligned part codebooks, T-Concat exposes each frame–part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.
[CV-81] Whats Under the Skin? Estimating Swine Body Condition
链接: https://arxiv.org/abs/2606.05611
作者: Mk Bashar,Kuljit Bhatti,Gary Rohrer,Madonna Benjamin,Tami Brown-Brandl,Daniel Morris
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines. PigFormer offers a practical path toward continuous, automated, non-contact body condition monitoring in commercial swine production. Code is available at this https URL.
[CV-82] HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery
链接: https://arxiv.org/abs/2606.05587
作者: Phillip Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 4 figures, 6 tables
Abstract:Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node’s attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.
[CV-83] BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection
链接: https://arxiv.org/abs/2606.05586
作者: Wenlin Liu,Xikun Hu,Ping Zhong
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31%, 73.41% and 71.86% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.
[CV-84] Monte Carlo Steklov Operators for Large-Scale Geometry Processing in the Wild
链接: https://arxiv.org/abs/2606.05581
作者: Arman Maesumi,Tanish Makadia,Aruna Anderson,Oras Phongpanangam,Justin Solomon,Daniel Ritchie
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Intrinsic methods fill the default toolbox for geometry processing on meshes. Intrinsic operators, in particular the Laplacian, underlie methods that require invariance to isometry and have hence been employed in many algorithms for shape analysis, learning, and editing. However, intrinsic methods are predicated on assumptions that quickly become brittle when working with in-the-wild geometry, where (i) mesh quality is not guaranteed, and (ii) many meshes are modeled with multiple connected components. In such settings, volumetric constructions are better-defined, since restrictions on surface topology can be relaxed. This paper presents a Monte Carlo method for estimating the Dirichlet-to-Neumann (DtN) operator – a boundary-to-boundary volumetric operator – and its associated Steklov eigenmodes. We build on recent developments in Monte Carlo geometry processing by casting this boundary operator itself as the subject of estimation. The DtN operator, defined through a volumetric stochastic process, is then generalized to the exterior domain, where it couples disconnected components through the surrounding ambient space. We show that our method is orders of magnitude faster than existing boundary-element approaches for computing Steklov spectra while remaining robust to poor triangulations, high-resolution meshes, and multi-component geometry. To demonstrate this scalability, we compute interior and exterior Steklov eigenspectra for approximately 450,000 shapes from the uncurated Objaverse dataset. We incorporate these operators into Steklov-CLIP, a mesh-based neural network that uses volumetric spectral operators for large-scale contrastive 3D representation learning. The resulting network learns semantically meaningful global and dense shape representations, illustrating that geometrically-principled volumetric operators can be made practical at the scale of modern 3D datasets.
[CV-85] UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning
链接: https://arxiv.org/abs/2606.05576
作者: Gexin Huang,Yanting Yang,Myeongkyun Kang,Beidi Zhao,Jun Zhou,Chen Zhou,Gang Wang,Zu-hua Gao,Xiaoxiao Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure
Abstract:Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.
[CV-86] Dual Feature Decoupling for Fine-Grained OOD Detection
链接: https://arxiv.org/abs/2606.05536
作者: Xiaokun Li,Yaping Huang,Qingji Guan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among fine-grained subcategories, together with the interference of background factors, makes OOD detection extremely challenging. To tackle this problem, we propose a novel Dual Feature Decoupling Network (DFDNet), which addresses fine-grained OOD detection from the perspective of feature disentanglement. The proposed DFDNet comprises two key components: a spatial-frequency decoupling module and a reconstruction-guided decoupling module. The spatial-frequency decoupling module is designed to preserve content features that are discriminative for classification while suppressing task-irrelevant style information. On the other hand, the reconstruction-guided decoupling module introduces a novel pixel-level adversarial reconstruction task to further remove low-level, non-discriminative information and enhance category-specific high-level semantic representations. Extensive experiments demonstrate that our method achieves competitive performance improvements on multiple datasets.
[CV-87] Noise-Aware Visual Representation Learning for Medical Visual Question Answering
链接: https://arxiv.org/abs/2606.05535
作者: I Putu Adi Pratama,Bahadorreza Ofoghi,Atul Sajjanhar,Shang Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures. Conference submission
Abstract:Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.
[CV-88] What Objects Enable Not What They Are: Functional Latent Spaces for Affordance Reasoning
链接: https://arxiv.org/abs/2606.05533
作者: Rohan Siva,Neel P. Bhatt,Yunhao Yang,Seoyoung Lee,Nishant Gadde,Christian Ellis,Alvaro Velasquez,Zhangyang Wang,Ufuk Topcu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code, videos, and data available at: this https URL
Abstract:Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a “cart” based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is “movable”), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., “movable”). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: this https URL.
[CV-89] BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding
链接: https://arxiv.org/abs/2606.05515
作者: Muhammad Usama,Didier Stricker,Mohammad Sadil Khan,Muhammad Zeshan Afzal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP’s text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at this https URL
[CV-90] Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning
链接: https://arxiv.org/abs/2606.05506
作者: Amirhossein Zhalehmehrabi,Tiziano Tezze,Alberto Castelini,Alessandro Farinelli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, Submitted to RAL
Abstract:We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at: Comments: 8 pages, Submitted to RAL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.05506 [cs.CV] (or arXiv:2606.05506v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.05506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-91] Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers ICRA2026
链接: https://arxiv.org/abs/2606.05491
作者: Jean Cordonnier,Chenghao Xu,Olga Fink,Malcolm Mielle
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at ICRA 2026’s Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding
Abstract:Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.
[CV-92] LLM -Guided ANN Index Optimization for Human-Object Interaction Retrieval
链接: https://arxiv.org/abs/2606.05489
作者: Shahrzad Esmat,Chaunte W. Lacewell,Sameh Gobriel,Nilesh Jain,Ali Jannesari
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: 13 pages, 5 figures, 8 tables
Abstract:Retrieval systems underpin modern AI applications – spanning visual search, recommendation engines, and multi-modal question answering. Modern multi-stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods – including Tree-structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization – rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase-aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase-partitioned exploration, exploitation, and fine-tuning stages. Evaluated on the HICO-DET human-object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector-search Efficiency, a quality-constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent’s advantage grows with the degree of parameter coupling: +33.3% on HICO-DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near-independent control). Cross-system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.
[CV-93] Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?
链接: https://arxiv.org/abs/2606.05478
作者: Joong Ho Kim,Keith G. Mills
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.
[CV-94] Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning ICML2026
链接: https://arxiv.org/abs/2606.05471
作者: Deepika SN Vemuri,Sayanta Adhikari,Ankit Saha,Krishn Vishwas Kher,Vineeth N Balasubramanian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026
Abstract:Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Concept-based models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.
[CV-95] ORACLE-CT: Anatomy-Aware Support Pooling for CT Classification
链接: https://arxiv.org/abs/2606.05460
作者: Lavsen Dahal,Yubraj Bhandari,Geoffrey Rubin,Joseph Y. Lo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Abdominal CT disease classification is challenging because each scan is a large 3D volume with many possible findings, while diagnostic evidence is often confined to specific organs or anatomical compartments. Most study-level classifiers aggregate encoder features using anatomy-agnostic pooling or attention, creating a mismatch between localized disease evidence and global evidence aggregation. We propose ORACLE–CT, an encoder-agnostic anatomy-aware aggregation framework that uses multi-organ segmentation to define label-specific anatomical supports and restrict attention pooling to relevant regions. The framework supports single-organ, multi-organ union, comparative, localized, and global support strategies. We evaluate ORACLE–CT with three encoder families: DINOv3, I3D–ResNet-121, and the radiology-native Pillar–0 encoder. Models are trained end-to-end on MERLIN and evaluated internally and under frozen external transfer to Duke–Abdomen and AMOS. Compared with global average pooling, support-masked pooling improved MERLIN macro-AUROC/AUPRC from 0.838/0.638 to 0.858/0.676 for DINOv3 and from 0.829/0.617 to 0.848/0.659 for I3D–ResNet-121. On harmonized 10-label external evaluation, DINOv3 improved on Duke–Abdomen from 0.802/0.628 to 0.835/0.683 and on AMOS from 0.742/0.313 to 0.762/0.350, with similar gains for I3D–ResNet-121. For Pillar–0, most gains came from learned attention, with smaller additional benefit from anatomical masking. ORACLE–CT improves discrimination and external robustness while preserving an auditable link between predictions and anatomical evidence. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.05460 [cs.CV] (or arXiv:2606.05460v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.05460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-96] Horse Eye Blink Detection and Classification for Equine Affective State Assessment CVPR
链接: https://arxiv.org/abs/2606.05458
作者: João Alves,Signe Møller-Skuldbøl,Pia Haubro Andersen,Rikke Gade
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPRW2026 CV4Animals
Abstract:Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.
[CV-97] Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification
链接: https://arxiv.org/abs/2606.05455
作者: Feixiang Zhou,Jianyang Xie,Zhuangzhi Gao,Qinkai Yu,Fu Wang,Yuheng Fan,Jing Li,Zheheng Jiang,Yitian Zhao,Yanda Meng,He Zhao,Gregory Y.H. Lip,Yalin Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.
[CV-98] Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation
链接: https://arxiv.org/abs/2606.05437
作者: Simegnew Yihunie Alaba,Yuichi Motai
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.
[CV-99] UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching CVPR2026
链接: https://arxiv.org/abs/2606.05399
作者: Qilin Huang,Quynh Anh Huynh,Long Le,Chen Wang,Chuhao Chen,Ryan Lucas,Eric Eaton,Lingjie Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at CVPR 2026 as a Highlight. Project page: this https URL
Abstract:Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world’s inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object’s softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young’s Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point estimates and the continuous nature of physical reality. Project page: this https URL
[CV-100] Deep Learning-assisted AMD Staging based on OCT and OCT Angiography
链接: https://arxiv.org/abs/2606.05379
作者: Yukun Guo,Tristan T. Hormel,An-Lun Wu,Liqin Gao,Min Gao,Steven T. Bailey,Yali Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged = 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK = 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.
[CV-101] hree-Dimensional Retinal Microvasculature Restoration in OCT Angiography
链接: https://arxiv.org/abs/2606.05375
作者: Yukun Guo,Min Gao,Tristan T. Hormel,Steven T. Bailey,Thomas S. Hwang,Yali Jia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.
[CV-102] Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin
链接: https://arxiv.org/abs/2606.05368
作者: Sayan Mandal,Rocco Sedona,Simon Besnard,Mikhail Urbazaev,Morris Riedel,Ehsan Zandi,Gabriele Cavallaro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 21 figures
Abstract:Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.
[CV-103] Recovering Physically Plausible Human-Object Interactions from Monocular Videos CVPR2026
链接: https://arxiv.org/abs/2606.05359
作者: Dingbang Huang,Etienne Vouga,Qixing Huang,Georgios Pavlakos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project Page: this https URL
Abstract:In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: this https URL
[CV-104] LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation
链接: https://arxiv.org/abs/2606.05354
作者: Shadman Sobhan,Farhana Jalil
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retinal blood vessel segmentation plays a vital role in the early detection of diabetic retinopathy and glaucoma. While recent deep learning models have achieved great segmentation accuracy, they typically require heavy computational resources, making real-world deployment on edge devices difficult. In this paper, we propose LightVesselNet, an efficient neural network designed for retinal vessel segmentation in a resource-constrained environment. Despite containing only 75K parameters, LightVesselNet performs competitively with much larger models. The network employs a compact encoder decoder architecture enhanced with channel and spatial attention mechanisms, a multi-scale feature aggregation module at the bottleneck, and a subpixel upsampling strategy in the decoder. A dedicated edge residual connection preserves fine vessel detail throughout decoding. Extensive experiments on five publicly available datasets: DRIVE, STARE, CHASEDB1, FIVES, and HRF, yield sensitivity scores of 0.8189, 0.8499, 0.8640, 0.8634, 0.8096, and Dice coefficients of 0.8070, 0.8072, 0.8181, 0.8649, and 0.7686, respectively. LightVesselNet shows improved efficiency (Performance vs Parameter or GFlops) compared to State-of-the-Art models. Cross-dataset evaluation confirms the model’s generalisation capability. Overall, LightVesselNet is a strong candidate for deployment in low-resource clinical settings and mobile screening tools.
[CV-105] opoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors
链接: https://arxiv.org/abs/2606.05347
作者: Nicolò Savioli,Luca Del Tongo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 5 tables
Abstract:Every new clinical imaging device creates a domain shift where dense gland masks are expensive yet cheap clinical signals – eyelid outlines, Pult grades, morphometric ratios – are routinely recorded. We present TopoPult-SSL, a two-stage framework for cross-device meibomian gland segmentation. Stage 1 adapts a source-trained model without target gland masks in the training loss, using four weak-prior anchors driven by target eyelid masks and clinical metadata only. Stage 2, when target gland masks are available, distils complementary Stage-1 teachers into a single compact student via supervised self-distillation. We develop and validate the technique on the public MGD-1k to CAMG research benchmark (1,000 to 100 images, different device), where the distilled model achieves Dice 0.716+/-0.006 (best 0.726), surpassing UA-MT (0.710) and the ensemble teacher (0.720) – with a single pass. The gland-mask-free Stage-1 variant reaches Precision 0.694 vs. 0.30-0.34 for SAM/MedSAM (p0.001), enabling deployment without dense gland contouring. Code and reproducibility scripts are released.
[CV-106] he Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show
链接: https://arxiv.org/abs/2606.05328
作者: Parsa Esmati,Somjit Nath,Katja Hofmann,Derek Nowrouzezahrai,Samira Ebrahimi Kahou,Majid Mirmehdi
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model’s intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.
[CV-107] Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation
链接: https://arxiv.org/abs/2606.05290
作者: Tobia Poppi,Silvia Cappelletti,Sara Sarto,Florian Schiffers,Garin Kessler,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.
[CV-108] Personal AI Agent for Camera Roll VQA
链接: https://arxiv.org/abs/2606.05275
作者: Thao Nguyen,Krishna Kumar Singh,Donghyun Kim,Yong Jae Lee,Yuheng Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page, code, and demo: this https URL
Abstract:We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user’s personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., Name of the food I tried yesterday?'') to more open-ended ones (e.g., Recommend some dishes I have never eaten before’‘). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents’ long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.
[CV-109] NIV: Neural Axis Variations for Variable Font Generation
链接: https://arxiv.org/abs/2606.05261
作者: Nadav Benedek,Ariel Shamir,Ohad Fried
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at this https URL. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.
[CV-110] VideoKR: Towards Knowledge- and Reasoning -Intensive Video Understanding ICML2026
链接: https://arxiv.org/abs/2606.05259
作者: Lin Fu,Zheyuan Yang,Yang Wang,Tingyu Song,Arman Cohan,Yilun Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 Spotlight
Abstract:We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT \rightarrow GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.
[CV-111] Flash-WAM: Modality-Aware Distillation for World Action Models
链接: https://arxiv.org/abs/2606.05254
作者: Arman Akbari,Ci Zhang,Arash Akbari,Lin Zhao,Yixiao Chen,Weiwei Chen,Xuan Zhang,Geng Yuan,Yanzhi Wang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbfFlash-WAM, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream’s low-noise regime, paired with a variance-preserving parametrization for the video stream’s high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from 8.1 seconds to 348 ms on NVIDIA L40S, a 23\times speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ( 85.5% RoboTwin 2.0, 95.7% LIBERO) and substantially recovers real-world performance ( 60% average on a Unitree G1 humanoid robot), while naive consistency distillation drops to 24% at the same step budget.
[CV-112] Drishti AI-Event Guardian: An Intelligent Real-Time Crowd Monitoring and Emergency Response System for Mass Gathering Events
链接: https://arxiv.org/abs/2606.05185
作者: Ritabrata Roy Choudhury,Arkajyoti Karmakar,Rudra Pratap Mitra
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages
Abstract:Mass gathering events are associated with critical safety incidents caused by insufficient crowd monitoring and inadequate emergency response coordination. Traditional surveillance systems lack intelligent analytics, resulting in delayed threat identification, poor resource deployment, and weak support for vulnerable individuals during dense public assemblies. This paper presents Drishti AI-Event Guardian, an intelligent crowd management framework using deep learning for public safety enhancement. The architecture combines multimodal data from CCTV networks and UAV platforms, processed by models on Google Vertex AI infrastructure. Core methods include real-time crowd density estimation using YOLOv8, spatiotemporal anomaly detection, and predictive crowd-flow modeling through gradient-boosted regression. Drishti also integrates four modules: (i) facial recognition for missing person identification with crowd-wide notification; (ii) medical emergency reporting with automated dispatch; (iii) a conversational AI chatbot for reports and complaints; and (iv) an intelligent guard reallocation engine that dynamically reassigns personnel in response to crowd density changes. The system is evaluated on two scenarios: the Kumbh Mela gathering and the RCB Victory Parade event, achieving crowd density estimation MAE of 3.2 persons/m2, anomaly detection F1-score of 0.91, facial recognition precision of 0.93, and median alert latency of 111 ms. Predictive congestion modeling provides five-minute forecasts with MAPE of 8.3%, enabling preemptive intervention. The chatbot resolved 89% of incident filings without human operators, while guard reallocation reduced responder deployment latency by 34% versus manual reassignment. Results demonstrate a shift from passive surveillance toward active crowd intelligence and scalable foundation for events from local gatherings to mega festivals.
[CV-113] Function2Scene: 3D Indoor Scene Layout from Functional Specifications
链接: https://arxiv.org/abs/2605.30819
作者: Ruiqi Wang,Qimin Chen,Daniel Ritchie,Angel X. Chang,Manolis Savva,Kai Wang,Hao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL
Abstract:Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.
[CV-114] Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue
链接: https://arxiv.org/abs/2509.15061
作者: Xingyao Lin,Xinghao Zhu,Tianyi Lu,Sicheng Xie,Hui Zhang,Xipeng Qiu,Zuxuan Wu,Yu-Gang Jiang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 7 tables
Abstract:The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.
[CV-115] Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANs
链接: https://arxiv.org/abs/2606.05849
作者: Vineetha Joy,Mohammad Abdullah,Pramit Pal,Anshuman Kumar,Amit Sethi,Hema Singh
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metasurfaces enable precise manipulation of electromagnetic waves for applications such as beam steering, sensing, and stealth technology. However, inverse design of metasurfaces with targeted EM responses remains challenging due to the computational expense of iterative full wave simulation driven optimization and the limited conditioning fidelity and diversity of existing generative approaches. To address these challenges, this paper presents a generative inverse design framework for controllable and physically consistent metasurface synthesis under continuous spectral constraints. The proposed approach employs a progressively growing Wasserstein generative adversarial network with gradient penalty integrated with feature wise linear modulation based conditioning for stable propagation of continuous spectral and fabrication constraints. EM consistency is embedded directly into the generative learning process through a surrogate assisted spectral alignment loss, enabling physics constrained generation during training. Further, a determinantal point process based diversity regularization strategy is incorporated to generate geometrically diverse yet spectrally consistent realizations for the same target response. The effectiveness of the proposed framework is demonstrated through the generation of practically realizable metasurface absorbers exhibiting diverse reflection characteristics in the frequency range of 2 to 18 GHz. EM simulations validate that the generated designs meet the target specifications with high accuracy. The final proposed framework achieved an average mean squared error of 0.0052, diversity score of 0.8730, band alignment accuracy of 0.8533, and a valid EM design generation percentage of 89.57, clearly demonstrating its capability to generate highly accurate, diverse, electromagnetically consistent and fabrication realizable metasurface configurations.
[CV-116] Oklch: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction
链接: https://arxiv.org/abs/2606.05255
作者: Naoyuki Uchida
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 3 figures, 8 tables. Submitted to Color Research Application
Abstract:Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD – 3,813 suprathreshold color difference pairs spanning six independent experimental datasets – Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.
人工智能
[AI-0] HANDOFF: Humanoid Agent ic Task-Space Whole-Body Control via Distilled Complementary Teachers
链接: https://arxiv.org/abs/2606.06493
作者: Lizhi Yang,Junheng Li,Nehar Poddar,Yiling Hou,Gio Huh,Robert Griffin,Georgia Gkioxari,Aaron Ames
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 9 figures
Abstract:For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.
[AI-1] mpoVLA: Learning Speed-Controllable Vision-Language-Action Policies
链接: https://arxiv.org/abs/2606.06491
作者: Dong Jing,Jingchen Nie,Tianqi Zhang,Jiaqi Liu,Huaxiu Yao,Zhiwu Lu,Mingyu Ding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default 1\times performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
[AI-2] Regret Minimization with Adaptive Opponents in Repeated Games
链接: https://arxiv.org/abs/2606.06486
作者: Mingyang Liu,Asuman Ozdaglar,Tiancheng Yu,Kaiqing Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:In this paper, we study regret minimization in repeated games with \emphadaptive opponents who can respond based on histories of play. The standard metric of \emphexternal regret in online learning is known to fail to capture such adaptivity. To account for players’ counterfactual reasoning, we introduce \tt Repeated Policy Regret (RP-Regret), a game-theoretic metric that measures the difference between the \emphrealized and the \emphbest-in-hindsight accumulated utility when all players can \emphrespond to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining \tt RP-Regret sublinear in time, on the variation of the player’s comparator strategies in the regret definition and on the memories of both the comparator and opponents’ strategies. We then study additional conditions and provable algorithms to minimize \tt RP-Regret, which is by definition \emphnon-convex in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emphlinearized surrogate of \tt RP-Regret at each iteration; (iii) one that directly minimizes \tt RP-Regret when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the \tt RP-Regret (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.
[AI-3] Pretraining Recurrent Networks without Recurrence
链接: https://arxiv.org/abs/2606.06479
作者: Akarsh Kumar,Phillip Isola
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 23 figures
Abstract:Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels (m_t, x_t+1) \rightarrow m_t+1 . SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective–retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable O(1) length gradient path between any two tokens–without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.
[AI-4] RREDCoT: Segment-Level Reward Redistribution for Reasoning Models
链接: https://arxiv.org/abs/2606.06475
作者: Mykyta Ielanskyi,Kajetan Schweighofer,Lukas Aichberger,Sepp Hochreiter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, under review
Abstract:Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.
[AI-5] PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
链接: https://arxiv.org/abs/2606.06470
作者: Senmiao Wang,Tiantian Fang,Haoran Zhang,Yushun Zhang,Kunxiang Zhao,Alex Schwing,Ruoyu Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer’s singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at this https URL.
[AI-6] Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement
链接: https://arxiv.org/abs/2606.06468
作者: Jui-Hui Chung,Ziyang Cai,Zihao Li,Qishuo Yin,Rohit Agarwal,Simon Park,Rodrigo Porto,Narutatsu Ri,Ziran Yang,Shange Tang,Xingyu Dang,Hongzhou Lin,Mengdi Wang,Danqi Chen,Chi Jin,Liam H Fowl,Sanjeev Arora
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This strategy contrasts with other mainstream approaches which use recursive lemma decomposition, and can inefficiently loop on dead-end strategies. Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench. With an optional natural-language proof seeding the initial blueprint on the harder problems, we additionally close the remaining two MiniF2F-test problems (reaching 100%), lift PutnamBench to 88.8% (597/672), and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026. This represents state-of-the-art performance for an open-source pipeline at a price point up to 500x less than comparable open-source pipelines.
[AI-7] Benchmark Everything Everywhere All at Once
链接: https://arxiv.org/abs/2606.06462
作者: Shiyun Xiong,Dongming Wu,Peiwen Sun,Yuang Ai,Bokang Yang,Wencheng Han,Xiao-Hui Li,Xiangyu Yue
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.
[AI-8] Will the Agent Recuse Itself? Measuring LLM -Agent Compliance with In-Band Access-Deny Signals
链接: https://arxiv.org/abs/2606.06460
作者: Thamilvendhan Munirathinam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. Code, specification, and experiment harness: this https URL
Abstract:As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal – the Recuse Signal – that a server emits over a protocol’s existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the this http URL analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal – 100% recusal when present versus 100% task completion in a no-signal control – and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.
[AI-9] Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents
链接: https://arxiv.org/abs/2606.06453
作者: Zhuoming Chen,Xinrui Zhong,Qilong Feng,Ranajoy Sadhukhan,Yang Zhou,Michael Qizhe Shieh,Zhihao Jia,Beidi Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to 3.46\times higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to 4.7\times higher throughput on the MLA-based GLM-4.7-Flash and 1.37\times on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.
[AI-10] Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
链接: https://arxiv.org/abs/2606.06448
作者: Yasmine Omri,Ziyu Gan,Zachary Broveak,Robin Geens,Zexue He,Alex Pentland,Marian Verhelst,Tsachy Weissman,Thierry Tambe
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.
[AI-11] RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation
链接: https://arxiv.org/abs/2606.06423
作者: Qi Lan,Yining Tang,Yu Shen,Yi Zhou,Yuhao Wei,Jie Li,Guofa Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.
[AI-12] Double Preconditioning (DoPr): Optimization for Test-Time Performance not Validation Loss
链接: https://arxiv.org/abs/2606.06418
作者: Thomas T. Zhang,Alok Shah,Yifei Zhang,Vincent Zhang,Nikolai Matni,Max Simchowitz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., L^2 regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
[AI-13] Risk Assessment of Autonomous Driving: Integrating Technical Failures Ethical Dilemmas and Policy Frameworks
链接: https://arxiv.org/abs/2606.06396
作者: Boyi Chen,Shengqin Chu,Zicheng Wang,Brian Baetz,Zhen Gao
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figure
Abstract:Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and regulations. Based on public crash data from the National Highway Traffic Safety Administration (NHTSA), disengagement reports from the California Department of Motor Vehicles (DMV), the MIT Moral Machines dataset, and a comparative regulatory analysis of five jurisdictions, we have found that the main types of technical failure modes are perception and classification errors. These account for a relatively large proportion of the reported accidents, and it can be concluded that there are different ethical frameworks for autonomous vehicle decision-making, and inconsistent regulations in different areas increase the uncertainty of widespread application. Generally speaking, the problems of technology, ethics and regulation are closely related and need to be solved together. Therefore, this paper recommends a more adaptive and cooperative governance approach that combines engineering standards, ethical discussion, and institutional supervision.
[AI-14] Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study CVPR2026 CVPR
链接: https://arxiv.org/abs/2606.06375
作者: Ching Yau Fergus Mok,Lavindra de Silva,Varun Kumar Reja,Ioannis Brilakis
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026 Computer Vision for the Built World Workshop (CV4AEC @ CVPR)
Abstract:Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality dataset. Results indicate that the instruction-based classifier outperforms encoder-based ones and gains from comparison with reference images. This shows that IDC can be an effective task modeling for tackling data constraints in infrastructure inspection and DT asset condition updating.
[AI-15] An Infectious Disease Spread Simulation Based on Large Language Model Decision Making
链接: https://arxiv.org/abs/2606.06360
作者: Yonchanok Khaokaew,Ruochen Kong,Andreas Zufle,Hao Xue,Taylor Anderson,Chandini Raina MacIntyre,Matthew Scotch,Flora D. Salim,David J Heslop
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.
[AI-16] F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation
链接: https://arxiv.org/abs/2606.06357
作者: Dinghao Zhou,Xingchen Song,Di Wu,Pengyu Cheng,Shengfan Shen,Sixiang Lv
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Technical report; early work; 9 pages, 2 figures, 5 tables
Abstract:Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets
[AI-17] Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo
链接: https://arxiv.org/abs/2606.06356
作者: Renjith Prasad,Chathurangi Shyalika,Anushka Pawar,Amit Sheth
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory–latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework’s complementarity prediction.
[AI-18] Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation
链接: https://arxiv.org/abs/2606.06345
作者: Yohann Benchetrit,Marlène Careil,Simon Dahan,Hubert Banville,Stéphane d’Ascoli,Jean-Rémi King
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI responses to video, audio and language. For each dataset, we evaluate systematic grids that show how the performance of image decoders varies with the amount of synthetic data used for training. Our results, based on two datasets (the 7T fMRI Natural Scenes Dataset and 3T fMRI BOLD5000), show up to 68% improvement in Top-10 image-retrieval accuracy compared to decoders trained only on real data. Importantly, the proportion of augmented data required to reach a given image decoding performance needs to be adjusted depending on the data source. Surprisingly, image decoders trained exclusively on synthetic fMRI can perform above chance in some settings, suggesting that TRIBE v2 can support zero-shot brain-to-image decoding. Together, these results show how large-scale models of the fMRI responses to sight, sound and language may provide a foundation to improve the data efficiency for image decoding.
[AI-19] okenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
链接: https://arxiv.org/abs/2606.06337
作者: Shweta Mishra
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures. Code and benchmark available at this https URL
Abstract:Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.
[AI-20] Bridging Domain Expertise and Generalization for Performance Estimation
链接: https://arxiv.org/abs/2606.06335
作者: Shuxuan Li,Zhilin Zhao,Quyu Kong,Wei-Shi Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.
[AI-21] Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
链接: https://arxiv.org/abs/2606.06333
作者: Seyed Arshan Dalili,Mehrdad Mahdavi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension d_i \ge 2 to error \varepsilon with single-direction decoders forces a number of atoms that is exponential in d_i . From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true d_i -dimensional basis to a strictly lower risk of the \ell_1 -regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top- s group gating, and adapt each group’s effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies r \ge d_i , a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in d_i rather than exponential – a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.
[AI-22] PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series Data
链接: https://arxiv.org/abs/2606.06328
作者: Ziwen Kan,Wugeng Zheng,Tianlong Chen,Song Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 figures. arXiv preprint version
Abstract:In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.
[AI-23] Drag On: A Benchmark and Dataset for Drag -Based GUI Interactions
链接: https://arxiv.org/abs/2606.06322
作者: Nathan Bout,Maxime Langevin,Ronan Riochet
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.
[AI-24] LLM Self-Recognition: Steering and Retrieving Activation Signatures ICML2026
链接: https://arxiv.org/abs/2606.06315
作者: Thibaud Ardoin,Jonas Schäfer,Gerhard Wunder
类目: Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model’s natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.
[AI-25] AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks
链接: https://arxiv.org/abs/2606.06311
作者: Wonmo Koo,Sanha Chang,Heeyoung Kim
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.
[AI-26] Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction ICML2026
链接: https://arxiv.org/abs/2606.06303
作者: Hongkun Dou,Zike Chen,Fengji Li,Hongjue Li,Yue Deng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline\textbfGradient-\underline\textbfInformed \underline\textbfLogit \underline\textbfCorrection (\textbfGILC), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.
[AI-27] Multi-ResNets for Subspace Preconditioning in Constrained Optimization
链接: https://arxiv.org/abs/2606.06300
作者: Merve Karakas,Christopher J. Williams,Emmanuel O. Balogun,Sadegh Sadeghi Tabas,Christian Brown,Nikhil Rao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.
[AI-28] RACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models
链接: https://arxiv.org/abs/2606.06285
作者: Ziwen Kan,Yishuo Chen,Kecheng Li,Andrew Wen,Xiaomeng Wang,Liwei Wang,Jihao Duan,Song Wang,Hongfang Liu,Tianlong Chen
类目: Artificial Intelligence (cs.AI)
备注: 5 figures and 5 tables in the main paper, plus appendix
Abstract:Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.
[AI-29] oolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents
链接: https://arxiv.org/abs/2606.06284
作者: Rahul Suresh Babu,Laxmipriya Ganesh Iyer
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.06284 [cs.AI] (or arXiv:2606.06284v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.06284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-30] Adapting Diffusion Language Models for Lossless Pixel-Level Image Transmission
链接: https://arxiv.org/abs/2606.06273
作者: Tianqi Ren,Rongpeng Li,Xianfu Chen,Yingyu Li,Zhifeng Zhao
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:Lossless pixel-level image transmission is a fundamental regime beyond semantic communications, because exact recovery requires both accurate symbol probability modeling and reliable delivery over noisy channels. This paper proposes DDM-SSCC, a discrete-diffusion-model-based separate source-channel coding framework for lossless image transmission. Different from raster-order autoregressive coding, the proposed source codec adapts a diffusion language model to pixel-token restoration and performs synchronized reverse arithmetic coding under bidirectional attention, allowing multiple masked tokens to be coded within one reverse denoising step. This progressive restoration process also yields a more favorable source representation for noisy transmission, since newly restored tokens can serve as bidirectional context in subsequent denoising steps. To bridge the gap between generation-oriented masked denoising and lossless arithmetic coding, we further introduce a Halton-guided denoising order, a mask-ratio-aware cosine schedule, and a lightweight temperature calibration module. These designs respectively improve spatial coverage, adapt the denoising pace to context reliability, and calibrate the probability tables used by arithmetic coding. Experiments on CIFAR10, DIV2K-LR-X4, and Kodak over additive white Gaussian noise and Rayleigh fading channels show that DDM-SSCC achieves better exact-recovery performance than representative lossless and semantic communication baselines, while ablation studies verify the effectiveness of the proposed denoising order, schedule, and calibration modules.
[AI-31] Your GFlowNet Secretly Learns an Optimal Transport Plan ICML2026
链接: https://arxiv.org/abs/2606.06272
作者: Ian Maksimov,Nikita Morozov,Denis Belomestny,Sergey Samsonov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 SPIGM Workshop
Abstract:Generative Flow Networks (GFlowNets) are a framework for sampling structured objects via stochastic trajectories in a directed graph. In this work, we establish a theoretical connection between non-acyclic GFlowNets and optimal transport (OT). We show that fixing the initial flow distribution in a minimum-flow GFlowNet reduces its objective to a Kantorovich OT problem with graph-induced shortest path costs. At the optimum, the learned GFlowNet policy therefore encodes an optimal transport plan from the source distribution to the target distribution: we show that sampling trajectories from the minimum-flow GFlowNet recovers the corresponding optimal coupling. Our formulation enables applying the GFlowNet learning framework to OT problems on large graphs via edge flows and neural parameterization. Experiments confirm agreement with exact OT solvers and demonstrate that GFlowNets can learn high-quality transport plans.
[AI-32] RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
链接: https://arxiv.org/abs/2606.06256
作者: Yang Liu,ZhaoKai Luo,HuaYi Jin,ZhiYong Wang,RuoZhou He,BoYu Wang,Guanjie Chen,Junhao Hu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.06256 [cs.AI] (or arXiv:2606.06256v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.06256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-33] Closing the Loop on Latent Reasoning via Test-Time Reconstruction
链接: https://arxiv.org/abs/2606.06252
作者: Xiaopeng Yuan,Haibo Jin,Ye Yu,Peng Kuang,Lijun Yu,Yushun Dong,Haohan Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question - Latent Thought - Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.
[AI-34] MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action
链接: https://arxiv.org/abs/2606.06245
作者: Boyang Zhang,Lianlei Shan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, submitted to CoRL
Abstract:Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes M hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.
[AI-35] OKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM -Agent Persistent Memory
链接: https://arxiv.org/abs/2606.06240
作者: Ziming Wang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 43 pages including full appendices (proofs, protocols, and reproducibility ledger). Code, data, and reproducibility artifact: this https URL
Abstract:Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy), yet none declares the isolation level it assumes or the write-time anomalies it admits. We show that contradiction resolution is write-time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual-row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n-ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies (replay inconsistency, belief-drift skew, audit erasure); a content-addressed engine-layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural-workload slice the audit-row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross-system comparison stays underpowered and claims no superiority. The contribution is the contract: a write-time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.
[AI-36] Design a Reliable LLM -Integrated Interface for Mortality Forecasting
链接: https://arxiv.org/abs/2606.06235
作者: Thi Kim Ngan Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures
Abstract:Mortality forecasting plays an important role in actuarial and policy decision-making, but its implementation remains technically complex and inaccessible to non-expert users. This project proposes a reliable large language model (LLM)-integrated interface that improves usability while maintaining statistical power. The LLM is designed as a constrained orchestration layer that translates natural-language inputs into structured configurations for a deterministic forecasting pipeline. A three-phase methodology is employed to ensure accuracy, usability, and transparency. First, a baseline pipeline is implemented using the CoMoMo package, reproducing established mortality forecasting results. Second, the pipeline is extended to generate multi-step forecasts using rolling-origin evaluation and mean squared error (MSE). Third, a prototype interface uses a local LLM to handle users’ forecasting requests in plain language. The system demonstrates that LLMs can enhance accessibility without compromising reproducibility, transparency, or actuarial validity in high-stakes analytical workflows.
[AI-37] From Reward-Hack Activations to Agent ic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
链接: https://arxiv.org/abs/2606.06223
作者: Patrick Wilhelm,Odej Kao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textitSchool-of-Reward-Hacks dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.
[AI-38] CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
链接: https://arxiv.org/abs/2606.06219
作者: Yining Xing,Zehong Ke,Zhiyuan Liu,Yanbo Jiang,Wenhao Yu,Jianqiang Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient \alpha and sample count N from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.
[AI-39] AM: Torque Adaptation Module for Robust Motion Transfer in Manipulation
链接: https://arxiv.org/abs/2606.06218
作者: Dongwon Son,Florian Shkurti,Jason Lee,Naman Shah,Beomjoon Kim,Dieter Fox
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy’s actions and the robot’s torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.
[AI-40] owards the Readability of LLM -Generated Codes through Multitask Representation Engineering
链接: https://arxiv.org/abs/2606.06214
作者: Huifan Gao,Liuhua He,Yinghui Pan,Shenbao Yu,Yifeng Zeng,Shengchao Qin,Weidi Sun
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) generated codes, readability remains under-addressed. Enhancing readability through targeted control is challenging due to its subjective nature. In this article, we employ representation engineering~(RepE) as the targeted control method given its characteristics of low data dependency and low computational cost. Prior work on RepE has primarily focused on the targeted control for a single task, but improving the code readability requires the control across multiple tasks. Accordingly we proposes the multitask RepE framework and theoretically discuss the impact of the multitask steering method on the tradeoff between the code readability and correctness. We further provide comprehensive experiments in support. All the relevant implementations are open-source and available upon request.
[AI-41] Evaluating Agent ic Configuration Repair for Computer Networks
链接: https://arxiv.org/abs/2606.06212
作者: Rufat Asadli,Benjamin Hoffman,Ioannis Protogeros,Laurent Vanbever
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors. In this work, we benchmark open- and closed-source LLMs augmented with formal network verification and context retrieval tools. We demonstrate that agentic architectures outperform base LLMs in repair efficacy (by 12% on average) and safety (by 17% on average), enabled by the ability to dynamically manage context and iteratively validate configuration repairs.
[AI-42] Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment
链接: https://arxiv.org/abs/2606.06207
作者: Yukiko Kawakami,Mohammad Shirazi,Ryo Shimizuwa,Saito Shinoda,Alireza Mortazavi,Matsumoto Kawahara
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Biomedical Engineering
Abstract:Veterinary pharmacovigilance systems are essential for monitoring adverse drug events (ADEs), yet existing approaches often fail to capture region-specific toxicity patterns shaped by local biological and regulatory contexts. In Japan, these challenges are amplified by species-specific metabolic differences and reporting practices defined by the Ministry of Agriculture, Forestry, and Fisheries (MAFF). Most prior work relies on prediction-oriented models, limiting mechanistic interpretability. This study proposes a regulatory-integrated unsupervised framework for pattern discovery using the National Veterinary Assay Laboratory (NVAL) database. ADEs are encoded into organ system-aligned representations and adjusted for species-specific reporting biases, enabling cross-species comparison. Similarity-based clustering and dimensionality reduction are applied to identify latent toxicity structures. Analysis of 4,120 high-confidence ADE reports (9,080 drug-ADE combinations) identified three significant species clusters (p 0.01), including hepatic-dominant patterns in companion animals (0.42 \pm 0.06), renal toxicity in ruminants (0.39 \pm 0.07), and dermatological sensitivity in sheep (0.35 \pm 0.07). Drug-level clustering achieved 83% alignment with pharmacological classes, while cosine similarity outperformed alternative metrics (silhouette score: 0.48; cluster precision: 87%). Regulatory validation showed strong agreement with established classifications. These findings demonstrate that regulation-aligned unsupervised analysis can uncover biologically meaningful, region-specific toxicity patterns, providing an interpretable and scalable framework for veterinary drug safety assessment.
[AI-43] Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains
链接: https://arxiv.org/abs/2606.06201
作者: Amandeep Kaur,Gyan Prakash
类目: Artificial Intelligence (cs.AI)
备注: Nil
Abstract:Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.
[AI-44] ITP-STDP: An Intrinsic-Timing Power-of-Two Learning Engine for On-Chip SNN Training
链接: https://arxiv.org/abs/2606.06159
作者: Haihang Xia,Xinyu Zhao,Xuecheng Wang,John Goodenough,Charith Abhayaratne,Panagiotis A. Panagiotou,Chunyi Song,Tiantai Deng
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Spiking neural networks (SNNs) have the potential to emerge as the third generation of neural networks and have attracted increasing attention across a wide range of applications. However, the large number of synaptic connections in SNNs leads to intensive weight-update computation by on-chip learning algorithms during training, resulting in substantial hardware resource utilization and energy consumption. Among existing SNN learning algorithms, spike-timing-dependent plasticity (STDP) is one of the most extensively studied and widely adopted, serving as a fundamental learning component in SNNs. To address the hardware and energy overheads associated with SNN training, this paper presents intrinsic-timing power-of-two STDP (ITP-STDP) and its corresponding prototype learning engine hardware architecture. The proposed design is evaluated through a dedicated mean-field synaptic drift model for dynamical analysis and further validated across SNN networks of different scales and datasets. It is further implemented on both ASIC and FPGA platforms and compared with state-of-the-art approaches, including the original STDP and more complex STDP variants. The results demonstrate superior energy efficiency, higher operating speed, and substantially lower hardware resource utilization, as the proposed design eliminates most of the computational overhead of STDP through both algorithmic and hardware-level optimizations. On the FPGA platform, the proposed design improves energy efficiency by 4.5 \times to 219.8 \times over the compared designs. On the ASIC platform, the proposed design achieves a 4.8 \times to 22.01 \times speedup while consuming only 1.2% to 3.3% of the area required by prior works.
[AI-45] Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models IJCAI2026 IJCAI’26
链接: https://arxiv.org/abs/2606.06154
作者: Sunny Gupta,Shambhavi Shanker,Amit Sethi
类目: Artificial Intelligence (cs.AI)
备注: Accepted at International Workshop on Federated Learning in the Age of Foundation Models In Conjunction with IJCAI 2026 (FL@FM-IJCAI’26)
Abstract:Federated fine-tuning of foundation models using Low-Rank Adaptation (LoRA) offers a communication efficient solution for distributed learning. However, existing federated LoRA methods suffer from two fundamental limitations: (1) structural aggregation bias, where independently averaging low rank factors fails to approximate the true combined update, and (2) client side initialization lag, as clients repeatedly reinitialize LoRA parameters across communication rounds, slowing convergence. We propose HyperLoRA, a unified framework that addresses both issues through amortized federated adaptation through hypernetwork-driven LoRA generation and product space aggregation. Instead of iterative per-client optimization, HyperLoRA employs a learned generator that maps client distribution signatures to LoRA initializations, effectively amortizing per client adaptation. On the server side, we introduce a learned aggregation module that directly synthesizes updates in the low-rank product space, eliminating the inconsistencies of factor-wise averaging. A lightweight residual correction module further improves stability under heterogenous (non-IID) client this http URL replacing iterative optimization and heuristic averaging with learned operators, HyperLoRA jointly enables efficient personalization, unbiased aggregation, and faster convergence. Experiments on federated vision and vision-language benchmarks show that HyperLoRA achieves improved convergence speed, greater robustness to distribution shift, and stronger personalization performance compared to prior federated LoRA methods.
[AI-46] WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation
链接: https://arxiv.org/abs/2606.06147
作者: Shengtao Zheng,Kai Li,Weichen Zhang,Yu Meng,Chen Gao,Xinlei Chen,Yong Li,Xiao-Ping Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to “imagine” future states – inherent in World Models – is critical for robust decision-making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world-model-based VLA framework that employs a dual-branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent’s policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.
[AI-47] A Finite Certificate for the Positive n=9 Vasc Inequality
链接: https://arxiv.org/abs/2606.06136
作者: Dakai Guo,Ruichen Qiu,Yichuan Cao,Ruyong Feng
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI)
备注:
Abstract:We prove the positive-real n=9 case of the Vasc cyclic inequality. The proof was obtained with human-guided assistance from the AI agent MechMath Agent Team: the human-readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed-maximum cone by cumulative gaps; the finite part is a certificate covering all 8!=40320 sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has 36815 coefficient leaves, 2236 ordinary Polya multiplier leaves, and 1269 AM-GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from-source rebuild route.
[AI-48] LA-Prover: Verifiable TLA Specification Synthesis via Preference-Optimized Low-Rank Adaptation
链接: https://arxiv.org/abs/2606.06133
作者: Eric Spencer,Arslan Bisharat,Brian Ortiz,Khushboo Bhadauria,TaiNing Wang,George K. Thiruvathukal,Konstantin Laufer,Mohammed Abuhamad
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 12 pages, 5 tables, 3 figures. Submitted at the 21st International Conference on Software Technologies (ICSOFT 2026)
Abstract:TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model’s correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.
[AI-49] owards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems
链接: https://arxiv.org/abs/2606.06114
作者: Dianxing Shi,Junqi He,Junhao Chen,Bowen Wang,Yuta Nakashima
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.
[AI-50] Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecasting
链接: https://arxiv.org/abs/2606.06102
作者: Jingxin Zhang Xiaoqin Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ultra-short-term solar irradiance prediction is critical for photovoltaic system dispatch and power grid stability. Existing approaches suffer from three key shortcomings: single time-series models cannot capture the spatial dynamics of clouds under complex conditions, standard convolutions inadequately represent multi-scale cloud features, and fixed low-frequency compensation strategies fail to adapt to different prediction steps. To address these issues, this proposes a multi-source data fusion model for ultra-short-term irradiance prediction. The model first employs InceptionNeXt to extract multi-scale, multi-directional spatial features from ground-based cloud images. A step-adaptive low-frequency compensation unit is then introduced to dynamically modulate global low-frequency information based on the prediction step. Eventually, the enhanced image features are combined with meteorological time-series features, and a TempAttnLSTM network captures global temporal dependencies for multi-step prediction. Experiments on the public NREL dataset and practical photovoltaic stations in Shandong illustrate the effectiveness of the proposed method compared with several state-of-the-art approaches.
[AI-51] CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model
链接: https://arxiv.org/abs/2606.06099
作者: Zeyang Yue,Chenfei Yan,Feifei Zhao,Haibo Tong,Mengwen Xu,Xiaozhen Wang,Erliang Lin,Yi Zeng
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2’s manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.
[AI-52] Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable Programming
链接: https://arxiv.org/abs/2606.06094
作者: Shah Pallav Dhanendrakumar,Saikat Pal,Sitikantha Roy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS); Medical Physics (physics.med-ph)
备注:
Abstract:Advances in computational modeling, neuroimaging, and artificial intelligence are revolutionizing the modeling of neurological disorders for improved diagnostics, prognosis, and treatment planning. Mechanistic models provide valuable scientific insight into the disorders, but in practice they are often simplified with assumptions or computationally expensive and slow to solve. However, while purely data driven approaches provide speed and scalability, they require large, high quality data to train and generally suffer from interpretability and generalization issues. This perspective paper presents a structured overview of hybrid modeling strategies, which combine deep learning models with physics based solvers, and are categorized into parallel, series, and parallel-series architectures. Three main approaches that have been emphasized are residual modeling for missing or incomplete physics, Neural Ordinary Differential Equations (NODEs) for continuous time dynamics approximation, and solver in the loop that accelerates traditional solvers with neural approximations. These hybrid models integrate the governing differential equation based formulations and deep learning to characterize the evolution of neurological disorders, and promise advanced personalized neurological modeling. In addition, the study explores and proposes different hybrid configurations to improve diagnosis accuracy, predict disease progression, and inform treatment strategies across a range of neurological disorders. These capabilities outperform standalone mechanistic or purely data driven approaches, making hybrid modeling a powerful tool, especially in applications involving modeling the progression and treatment responses in neurological conditions such as brain tumors, Alzheimer’s disease, and stroke.
[AI-53] Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents
链接: https://arxiv.org/abs/2606.06090
作者: Yaoqi Chen,Haibin Lai,Yuru Feng,Chuyu Han,Qianxi Zhang,Baotong Lu,Menghao Li,Xinjiang Wang,Zhirui Wang,Shusen Xu,Zengzhong Li,Zewen Jin,Hao Wu,Cheng Li,Qi Chen
类目: Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8–20.4 pp over baselines, while reducing token consumption by 55.1%.
[AI-54] Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine Learning
链接: https://arxiv.org/abs/2606.06056
作者: Helge Spieker,Jørn Eirik Betten,Arnaud Gotlieb
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at 10th International Workshop on Metamorphic Testing (MET 2026)
Abstract:Multiple machine learning models can achieve near-equivalent predictive performance on the same task, yet provide divergent feature-based explanations. This is called the Rashomon effect of (explainable) machine learning, and it raises the question of which explanations, if any, are trustworthy. We propose a framework based on metamorphic testing that assesses explanation faithfulness without requiring ground-truth labels by exploring attributed feature importance from post-hoc explanation methods. Five metamorphic relations formalize expected consistency properties between model behavior and feature attributions. We apply this general framework to two tabular regression datasets and two post-hoc explainers (SHAP and LIME) to demonstrate the approach. The framework offers a practical, model-agnostic tool for selecting accurate models with reliable and trustworthy explanations.
[AI-55] When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents
链接: https://arxiv.org/abs/2606.06055
作者: Lingxiang Xu,Jiaoyun Yang,Min Hu,Hongtu Chen,Ning An
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 10 figures
Abstract:Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9%–26.6% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1%–82.9% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.
[AI-56] Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
链接: https://arxiv.org/abs/2606.06054
作者: Jiawen Zhang,Kejia Chen,Jiachen Ma,Yangfan Hu,Lipeng He,Yechao Zhang,Jian Liu,Xiaohu Yang,Tianwei Zhang,Ruoxi Jia
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.06054 [cs.AI] (or arXiv:2606.06054v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.06054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-57] Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning ICANN
链接: https://arxiv.org/abs/2606.06041
作者: Yuanzhi He,Victor Romero-Cano,José J. Patiño,Juan David Hernández,William Sawtell,Gualtiero Colombo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted
Abstract:As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM’s performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.
[AI-58] When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
链接: https://arxiv.org/abs/2606.06034
作者: Luoming Zhang,Yuwei Ren,Kui Zhang,Tian Liu,Lingjuan Ge,Denghao Li,Matthew Harper Langston,Yin Huang,Weiliang Will Zeng,Liang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model’s accuracy. Experiments on Qwen3.5-family models demonstrate up to 5 \times kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.
[AI-59] OPRD: On-Policy Representation Distillation
链接: https://arxiv.org/abs/2606.06021
作者: Shenzhi Yang,Guangcheng Zhu,Bowen Song,Haobo Wang,Mingxuan Xia,Xing Zheng,Yingfan Ma,Zhongqi Chen,Weiqiang Wang,Gang Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen’s ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: this https URL.
[AI-60] PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
链接: https://arxiv.org/abs/2606.06014
作者: Xiaoyun Qiu,Jingtao He,Yijie Chen,Yusong Huang,Haotian Wang,Yixuan Wang,Xinhu Zheng
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.
[AI-61] Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs
链接: https://arxiv.org/abs/2606.06003
作者: Grama Chethan
类目: Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.
[AI-62] AttackPathGNN: Cross-function vulnerability detection in smart contracts using state interference graphs and conjunction pooling
链接: https://arxiv.org/abs/2606.05986
作者: Gabriela Dobrita,Simona-Vasilica Oprea,Adela Bara
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing learning-based detectors for Solidity smart-contracts reduce vulnerability detection to syntactic pattern matching within single functions, yet many of the most consequential exploits (The DAO, Cream Finance) exist not in any individual function but in the relationship between functions and in the combination of conditions that made the attack feasible. Thus, we propose AttackPathGNN, a graph neural network (GNN) that reframes detection as reasoning over explicit attack paths. Two architectural choices distinguish it from prior GNN-based detectors: (1)a State Interference Graph that links every pair of functions sharing mutable storage through typed, weighted edges and through directed reentrancy-path edges defined by an explicit five-condition predicate; (2)conjunction pooling, a differentiable AND-aggregator over eight named exploit preconditions whose log-sigmoid form causes the per-function exploit score to collapse whenever any single mitigation (a reentrancy guard, an access-control modifier or SafeMath) is in place. Across five independent training runs, AttackPathGNN attains 92.3+/-0.2% F1 on the SmartBugs Wild held-out test partition (4.3+/-0.3% false-negative rate, 90.8+/-2.5% detection rate on the independently human-labelled SmartBugs Curated benchmark), recovering 6/10 DASP10 categories at 100% on every seed and Reentrancy at 98.7+/-1.8%. Each prediction is emitted with a structured remediation report, turning each verdict into an actionable, function-level audit finding.
[AI-63] World-Language-Action Model for Unified World Modeling Language Reasoning and Action Synthesis
链接: https://arxiv.org/abs/2606.05979
作者: Yi Yang,Zhihong Liu,Siqi Kou,Yiyang Chen,Yanzhe Hu,Jianbo Zhou,Boyuan Zhao,Zhijie Wei,Xiao Xia,Xueqi Li,Pengfei Liu,Zhijie Deng
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures
Abstract:We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emphworld modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the \emphlanguage reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emphautoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emphnext state, comprising the \emphsemantic-level textual intention and complementary \emphfine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emphimplicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emphcross-embodiment robot videos without action annotations.
[AI-64] Causal Scaffolding for Physical Reasoning : A Benchmark for Causally-Informed Physical World Understanding in VLMs KDD2026
链接: https://arxiv.org/abs/2606.05966
作者: Tianyi Tang,Zhuoyi Lin,Zeyu Feng,Tianyi Ma,Yew-Soon Ong,Ivor Tsang,Haiyan Yin
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026 Dataset and Benchmark Track
Abstract:Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model’s chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs’ causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.
[AI-65] Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics
链接: https://arxiv.org/abs/2606.05956
作者: Tzur Shubi,Ariel Felner,Solomon Eyal Shimony,Shahaf S. Shperberg
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.
[AI-66] Learning of Robot Safety Policies via Adversarial Synthetic Scenarios
链接: https://arxiv.org/abs/2606.05952
作者: Nikolai Dorofeev,Alexey Odinokov,Rostislav Yavorskiy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.
[AI-67] Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing
链接: https://arxiv.org/abs/2606.05950
作者: Yuxiao Ye,Haoran He,Fangyuan Kong,Xintao Wang,Pengfei Wan,Kun Gai,Ling Pan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.
[AI-68] A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
链接: https://arxiv.org/abs/2606.05932
作者: Yuze Gao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 7 figures
Abstract:Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious – assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.
[AI-69] owards World Models in Biomedical Research
链接: https://arxiv.org/abs/2606.05925
作者: Guangyu Wang,Jingkun Yue,Siqi Zhang,Yu Liu,Xiaoyu Wang,Mingyuan Meng,Changwei Ji,Zongbo Han,Yulin Wang,Yang Yue,Frank Fu,Ting Chen,Song Wu,Ziwei Liu,Jiangning Song,Ming Li,Gao Huang,Xiaohong Liu,Athanasios Vasilakos,Xingcai Zhang,Ping Zhang,Yong Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.
[AI-70] Retry Policy Gradients in Continuous Action Spaces
链接: https://arxiv.org/abs/2606.05888
作者: Soichiro Nishimori,Paavo Parmas
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam’s adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor–critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.
[AI-71] QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
链接: https://arxiv.org/abs/2606.05875
作者: Jianxin Yan,Wangze Ni,Zhenxin Li,Jiabao Jin,Zhitao Shen,Haoyang Li,Jia Zhu,Peng Cheng,Xuemin Lin,Lei Chen,Kui Ren
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.
[AI-72] Compositional Boundaries for Density Fusion
链接: https://arxiv.org/abs/2606.05871
作者: Ratan Bahadur Thapa,Ali Darijani,Jürgen Beyerer,Steffen Staab
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate f -divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.
[AI-73] Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
链接: https://arxiv.org/abs/2606.05863
作者: Hu Tan,Kuo Gai,Shihua Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.
[AI-74] LLM Codec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models
链接: https://arxiv.org/abs/2606.05861
作者: Rui Wang,Yan Zhao,Li Song,Zhengxue Cheng
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures. Submitted to IEEE BMSB 2026
Abstract:The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.
[AI-75] UniVoice: A Unified Model for Speech and Singing Voice Generation
链接: https://arxiv.org/abs/2606.05852
作者: Junjie Zheng,Huixin Xue,Shihong Ren,Chaofan Ding,Hao Liu,Zihao Chen
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 9 pages, 2 figures
Abstract:Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26%, comparable to dedicated TTS systems such as F5-TTS (5.21%) and CosyVoice3 (5.30%). On singing generation, UniVoice achieves a PER of 16.22%, outperforming the unified baseline Vevo1.5 (24.72%).
[AI-76] Agent ic Molecular Recovery via Molecule-Aware Exploration
链接: https://arxiv.org/abs/2606.05847
作者: Suwan Yoon,Changhee Lee
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.
[AI-77] GenTI: Benchmarking LLM s for Autonomous IDPS Rule Generation for Unseen Attacks
链接: https://arxiv.org/abs/2606.05844
作者: Hassan Jalil Hadi,Rehana Yasmin,Ali Shoker
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, existing public datasets (e.g., CICIDS2017, UNSW-NB15) focus on traffic classification and provide little structured information to support automatic rule synthesis or prevention logic. To address this gap, we propose Generative Thread Intelligence (GenTI) \footnoteGenTI refers to the proposed framework, and GTI refers to the dataset. an LLM-driven benchmark for automatic generation of IDPS rules targeting unseen attacks. The dataset (GTI) aggregates over 150k detection and prevention rules from Snort, Suricata, Emerging Threats, as well as 50k YARA, each annotated with protocol behavior, payload signatures, contextual relationships, mappings to Cyber Threat Intelligence (CTI), along with actionable response types (alert, drop, reject). Moreover, on top of this corpus we design an LLM-based pipeline that transforms analyst prompts and representative payloads into deployable rules via structured prompt engineering, Chain-of-Thought (CoT) reasoning, as well as a Chain-of-Verification (CoVe) loop for syntactic, semantic, and security validation. The generated rules are executed in real time on (Snort/Suricata) and evaluated by syntax accuracy, semantic similarity, CTI coverage, security effectiveness as well as unseen attacks detection. Furthermore, our GenTI instantiation achieves a composite rule-quality score of 89.4%, with 94.8% CTI coverage, improving unseen attacks detection from 45% to 87.4% and reducing the false-positive rate from 8.5% to 2.3%. Overall, GenTI establishes the first large-scale benchmark that tightly couples rule-level CTI with LLM-based automation, enabling adaptive, self-evolving IDPS.
[AI-78] Consistency Training Along the Transformer Stack EMNLP2026
链接: https://arxiv.org/abs/2606.05817
作者: Sukrati Gautam,Neil Shah,Arav Dhoot,Bryan Maruyama,Caroline Wei,Rohan Kapoor,Robert Sidey,Prakhar Gupta,Zi Cheng Huang,David Demitri Africa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2026
Abstract:Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.
[AI-79] When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
链接: https://arxiv.org/abs/2606.05806
作者: Dongsheng Zhu,Xuchen Ma,Yucheng Shen,Xiang Li,Yukun Zhao,Shuaiqiang Wang,Lingyong Yan,Dawei Yin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ‘‘happy paths’’, largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 \times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66\times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at this https URL.
[AI-80] From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents
链接: https://arxiv.org/abs/2606.05805
作者: Yuhao Sun,Jiacheng Zhang,Shaanan Cohney,Zhexin Zhang,Feng Liu,Xingliang Yuan
类目: Artificial Intelligence (cs.AI)
备注: 32 pages
Abstract:LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent’s context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: this https URL.
[AI-81] Can LLM s Write Correct TLA Specifications? Evaluating Natural-Language-to-TLA Generation DATE
链接: https://arxiv.org/abs/2606.05792
作者: Arslan Bisharat,Brian Ortiz,Eric Spencer,Khushboo Bhadauria,TaiNing Wang,George K. Thiruvathukal,Konstantin Laufer,Mohammed Abuhamad
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: 12 pages, 11 tables. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026); Recommended as Best Paper Award Candidate
Abstract:TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.
[AI-82] APO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents
链接: https://arxiv.org/abs/2606.05784
作者: Chengqi Dong,Chuhuai Yue,Hang He,yandong liu,Fenghe Tang,S Kevin Zhou,Xiaohan Wang,Jiajun Chai,Guojun Yin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.
[AI-83] nyML-Driven Cybersecurity for Autonomous Spacecraft: Latency-Accuracy Analysis for SPARTA RF and Cyber Threat Detection
链接: https://arxiv.org/abs/2606.05779
作者: Van Le,Trevor Tran,Tan Le
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Twenty Fifth International Conference on Security Management (SAM’26)
Abstract:Autonomous spacecraft require rapid, lightweight, and reliable onboard detection of cyber-RF threats. Using the SPARTA attack model, we analyze the latency-accuracy trade-offs of TinyML-compatible classical models – Random Forest, Logistic Regression, SVM, and MLP – for detecting uplink jamming, Fake-NR spoofing, payload manipulation, ground-segment compromise, and unauthorized command injection. We present a physics-informed theoretical analysis of each model’s computational complexity, VC dimension, Lipschitz continuity, and latency scaling, supported by empirical measurements on adversarial RF spectrograms generated via BandErasure, FakeNR, and NoiseBurst corruption modes. Results show that Logistic Regression achieves microsecond-level inference with only a 1% accuracy drop relative to Random Forest, making it an effective TinyML baseline for onboard autonomy. The study also identifies opportunities for advancing spacecraft cybersecurity through richer feature encoders and multi-timescale learning architectures, building on recent progress in edge intelligence and trustworthy AI.
[AI-84] An Improved CNN-LSTM Based Intrusion Detection System for IoT Networks
链接: https://arxiv.org/abs/2606.05776
作者: Mohammad Tariq Ikhlas,Pohanyar Khowaja Khil,Malik Muhammad Mueed Aslam,Muhammad Khuram Shahzad
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 8 figures
Abstract:With the rapid proliferation of IoT devices, security concerns have dramatically escalated and intrusion detection systems have become critical for protecting networked environments. This paper presents an improved CNN-LSTM based intrusion detection model that combines multi-class classification, dataset integration, and temporal feature learning to enhance detection performance in IoT networks. Using network traffic data, the proposed approach is evaluated on intrusion detection tasks and achieves an accuracy of approximately 97%. Experimental results demonstrate that the model effectively detects multiple attack categories while maintaining stable training and validation performance. The integration of convolutional and recurrent neural network components enables the framework to capture both spatial and temporal characteristics of network traffic, improving overall intrusion detection capability in IoT environments.
[AI-85] Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering
链接: https://arxiv.org/abs/2606.05770
作者: Vahid Garousi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI is changing how software engineers work, but it often comes with hidden burdens and costs. In this paper, we characterize two such often-overlooked burdens: (1) the constant need for human oversight and inspection of AI-generated artifacts; and (2) the growing cognitive overload on software engineers from receiving large amounts of suggestions from AI tools. The need for human oversight is not optional-engineers must review, validate, and sometimes rework what AI produces. At the same time, the flood of AI suggestions, prompts, and possible solutions can leave developers mentally stretched. By blending evidence from recent opinions from practitioners, we highlight these often-overlooked challenges and open a conversation about how teams can handle them in day-to-day AI-assisted software engineering.
[AI-86] Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability
链接: https://arxiv.org/abs/2606.05756
作者: Jialiang Yin,Zheng Zhao,Linsey Pang,Bo Dong,Bin Shi,Jiaxing Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.
[AI-87] SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework
链接: https://arxiv.org/abs/2606.05754
作者: Weiguang Wang,Fugen Wu,Hailing Wang,Xuechen Liang,Xiaobin Li,Ru Han,Tianchang Xie
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Phase-sensitive optical time-domain reflectometry ( \phi -OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced \phi -OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the \phi -OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79% accuracy, 89.83% macro-F1, and a nuisance alarm rate of 5.00% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for \phi -OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at this https URL.
[AI-88] Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance
链接: https://arxiv.org/abs/2606.05740
作者: Arush Singhal,Umang Soni
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 13 tables
Abstract:Deep neural networks trained under severe class imbalance often exhibit degraded performance, typically attributed to statistical bias. In this work, we identify a complementary optimization-level pathology: inter-class gradient interference within shared representations, where gradients from majority classes suppress minority-class learning. To analyze this phenomenon, we introduce a diagnostic framework based on layer-wise gradient flow analysis and a Gradient Conflict Matrix, which quantifies interference using cosine similarity between class-specific gradients. Using this framework, we study multi-branch convolutional architectures and propose a lightweight modification, Class-Specific Branch Attention (CSBA), that enables branch-specific channel reweighting to reduce gradient coupling. This mechanism promotes implicit feature decoupling across branches while preserving architectural simplicity. Empirically, CSBA improves minority-class performance, increasing the F1 score for the Physical-Damage class from 0.261 to 0.522 under severe imbalance, while maintaining comparable overall accuracy. Validation on CIFAR-10-LT confirms that this behavior generalizes across imbalanced visual recognition settings, with Macro-F1 improving from 0.595 to 0.655. More broadly, our findings highlight the importance of considering optimization dynamics alongside statistical methods when designing architectures for imbalanced learning.
[AI-89] Microskill Architecture: A Modular Skill-Driven Framework for AI-Native Code Generation
链接: https://arxiv.org/abs/2606.05720
作者: Mohammad Zare,Omid Abdolrahmani
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models and AI coding agents have reshaped software development, but the path to fully AI-native systems faces structural challenges. Chief among them is managing context windows without losing accuracy or efficiency. When developers inject full project documentation and code into a model’s memory, the model loses mid-sequence information, token costs spiral, and architecture drifts. This paper presents MicroSkill Architecture: a modular design paradigm inspired by microservices, applied to knowledge encapsulation instead of service decomposition. Instead of feeding an agent the entire codebase, the architecture partitions knowledge into atomic, sharply scoped skill capsules, and a dynamic router selects only semantically relevant capsules for the task. We formally model context allocation as constrained optimization over semantic relevance subject to a token budget. An empirical case study an enterprise content management system with fifteen complex features shows that MicroSkill cuts token consumption by over 90%, nearly doubles first-try compilation success rates, eliminates architectural violations entirely, and enables autonomous extraction and registration of seven new skill capsules via a self-learning mechanism. These findings suggest MicroSkill Architecture offers a scalable foundation for building AI-native development systems that are more efficient, more reliable, and capable of evolving over time.
[AI-90] Explainable AI-Driven Cyber Risk Analytics and Model Reliability Assessment for Intelligent Governance of U.S. Critical Infrastructure: An XGBoost and SHAP-Based Intrusion Detection Framework
链接: https://arxiv.org/abs/2606.05710
作者: B. M. Taslimul Haque,Md. Arifur Rahman,Md. Serajul Kabir Chowdhury Rubel,Md. Iqbal Hossan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures, empirical research article, CICIDS2017 dataset, XGBoost, Random Forest, Decision Tree, Logistic Regression, SHAP explainability analysis, cyber risk analytics, intrusion detection, critical infrastructure cybersecurity, model reliability assessment
Abstract:The increasing penetrations of the critical infrastructure sector in the United States with intelligent digital technologies have greatly increased exposure to advanced cyber adversaries and operational vulnerabilities. AI-powered governance and automated decision-making systems are becoming a key part of the operation of critical infrastructure systems, including energy, healthcare, transportation, financial services, and communication infrastructure, in order to improve efficiency and strategic management. The growing cyber threat environment, such as Distributed Denial of Service (DDos) attacks, botnets, ransomware, and Advanced Persistent Threats (APTs) pose significant challenges to infrastructure resilience, cyber security reliability, and governance trustworthiness. In a changing attack landscape and dynamic network environment, traditional cybersecurity mechanisms can often fall short of meeting the evolving needs and protecting critical systems. This study will develop a resilient cyber risk analytics and model reliability assessment framework to support intelligent governance and decision support for cyber risk exposure in the U.S. critical infrastructure environment. This study is based on the CICIDS2017 dataset for the development and testing of intrusion detection system models and cyber risk prediction models based on machine learning. Various classifiers like XGBoost, Random Forest, and Decision Tree are used to detect malicious activities on the network and determine the level of cyber risk. Furthermore, the Explainable Artificial Intelligence (XAI) techniques are integrated to enhance transparency, interpretability, and trust in cybersecurity decision-making processes. The proposed framework presents the reliability and resilience of the model by having various performance measures such as accuracy, precision, recall, F1 score, ROC-AUC, and false positive rate.
[AI-91] Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
链接: https://arxiv.org/abs/2606.05704
作者: Muhammad Talha Sharif,Abdul Rehman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages
Abstract:Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.
[AI-92] Cognitive Threat Intelligence and Explainable Federated Security Analytics for distributed Infrastructure Systems KDD
链接: https://arxiv.org/abs/2606.05701
作者: Md. Arifur Rahman,B. M. Taslimul Haque,Md. Iqbal Hossan,Md. Serajul Kabir Chowdhury Rubel
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 1 conceptual framework diagram, 1 methodology workflow diagram, empirical study using NSL-KDD and CIC-IDS2017 datasets, Federated Learning, Explainable AI (SHAP, LIME), cybersecurity and intrusion detection framework
Abstract:The increasing adoption of distributed infrastructure systems, cloud computing, Internet of Things (IoT) technologies, and edge-based architectures has significantly expanded the cybersecurity attack surface and introduced increasingly sophisticated cyber threats. Conventional centralized intrusion detection approaches often face challenges related to scalability, data privacy, communication overhead, and limited transparency in artificial intelligence-driven decision-making processes. To address these limitations, this study proposes a Cognitive Threat Intelligence and Explainable Federated Security Analytics framework for distributed infrastructure systems. The proposed framework integrates Federated Learning (FL), Explainable Artificial Intelligence (XAI), and cognitive cybersecurity analytics to enable collaborative and privacy-preserving cyber threat detection across distributed network environments. Instead of transmitting sensitive raw network traffic data to centralized servers, local security models are independently trained at distributed nodes, where only encrypted model parameters and updates are shared through a federated aggregation mechanism. This decentralized learning architecture improves privacy protection while reducing communication dependency and centralized security risks. To enhance intelligent threat analysis, the framework incorporates machine learning and deep learning algorithms including Random Forest, XGBoost, Autoencoder
[AI-93] PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
链接: https://arxiv.org/abs/2606.05697
作者: Nicolas Bougie,Xiaotong Ye,Gian Maria Marconi,Narimasa Watanabe
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model’s own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model’s own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.
[AI-94] Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
链接: https://arxiv.org/abs/2606.05692
作者: Wenhao Mu,Facundo Yan,Anik Mumssen,Marisa Eisenberg,Alexander Rodríguez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.
[AI-95] AdaMEM: Test-Time Adaptive Memory for Language Agents ICML2026
链接: https://arxiv.org/abs/2606.05684
作者: Yunxiang Zhang,Yiheng Li,Ali Payani,Lu Wang
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at this https URL.
[AI-96] Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio
链接: https://arxiv.org/abs/2606.05682
作者: Fangbo Tu,Junhua Zhao,Chi Liu,Xin Chen,Haifeng Wu,Jian Wan,Srinivasan Manoharan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages,1 figures
Abstract:Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbfCKA-QAD, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.
[AI-97] Data Flow Control: Data Safety Policies for AI Agents
链接: https://arxiv.org/abs/2606.05679
作者: Charlie Summers,Eugene Wu
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures
Abstract:Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem. This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines – DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer – Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at this https URL. Comments: 15 pages, 12 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05679 [cs.DB] (or arXiv:2606.05679v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.05679 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-98] Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition
链接: https://arxiv.org/abs/2606.05678
作者: Yifan Liao,Zongmin Zhang,Zhen Sun,Yuhui Sun,Xinhu Zheng,Xinlei He
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages
Abstract:Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.
[AI-99] Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows
链接: https://arxiv.org/abs/2606.05670
作者: Yuhang Fu,Ruishan Fang,Jiaqi Shao,Huiyu Zheng,Zhengtao Zhu,Bing Luo,Tao Lin
类目: Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.
[AI-100] Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation
链接: https://arxiv.org/abs/2606.05660
作者: Dabin Kim,Daemin Park,Sangyub Lee,Jinsik Kim,Yeongtak Oh,Jongho Shin,Sungroh Yoon
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 63 pages, 6 figures
Abstract:Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.
[AI-101] When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability
链接: https://arxiv.org/abs/2606.05654
作者: Suraj Babu Thimma Krishnaram
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.
[AI-102] Enhancing Software Engineering Through Closed-Loop Memory Optimization
链接: https://arxiv.org/abs/2606.05646
作者: Xuehang Guo,Zora Zhiruo Wang,Qingyun Wang,Graham Neubig,Xingyao Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textitmemory utility, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textitvalidated downstream impact, establishing utility as both a task-agnostic \textbfevaluation benchmark and an annotation-free \textbfoptimization signal. Through complementary evaluation on \textitsingle-episode and \textitcross-episode memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to \uparrow5.25% in success rate and \uparrow4.63% in resolve efficiency, while substantially reducing computational cost by \geq9.79% . Our project page: \hrefthis https URLthis https URL.
[AI-103] FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG
链接: https://arxiv.org/abs/2606.05644
作者: Zhe Yu,Wenpeng Xing,Tiancheng Zhao,Mohan Li,Changting Lin,Meng Han
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors – a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths – output surface, hidden representations, and prediction trajectory – and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones – four primary 7B/8B models and two scaling backbones up to 70B – FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.
[AI-104] Answer Presence Drives RAG Rewriting Gains
链接: https://arxiv.org/abs/2606.05633
作者: Yuejie Li,Yueying Hua,Ke Yang,Li Zhang,Yueping He,Yueping He,Ruiqi Li,Bolin Chen,Tao Wang,Bowen Li,Chengjun Mao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emphrewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA + verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired \textttanswer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\texttt[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12 ~F1 ``non-leakage residual’’ that flips to -3.33 to -7.81 ~F1 under four alternative sentinels and fails an equivalence test for three of those four ( 1/4 ~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.
[AI-105] Evaluation of LLM s for Mathematical Formalization in Lean
链接: https://arxiv.org/abs/2606.05632
作者: Tyson Klingner,Drew Bladek,Escher Crawford,Bohao Chen,Ariel Fu,Kaira Nair,Jarod Alper,Giovanni Inchiostro,Vasily Ilin
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures, 10 tables. Comments welcome!
Abstract:Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs’ effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@ k and refine@ k metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of \ 0.01 per correct proof.
[AI-106] Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking
链接: https://arxiv.org/abs/2606.05625
作者: Bonan Shen,Youting Wang,Dingyan Shang,Tao Ning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Implicit reward hacking is hard to audit when a language model’s chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model’s own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.
[AI-107] Safety Paradox: How Enhanced Safety Awareness Leaves LLM s Vulnerable to Posterior Attack
链接: https://arxiv.org/abs/2606.05614
作者: Long P. Hoang,Hai V. Le,Shaoyang Xu,Wei Lu,Wenxuan Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model’s safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.
[AI-108] Multilingual Fine-Tuning via Localized Gradient Conflict Resolution
链接: https://arxiv.org/abs/2606.05613
作者: Long P. Hoang,Yiran Zhao,Wei Lu,Wenxuan Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.
[AI-109] SlotGCG: Exploiting the Positional Vulnerability in LLM s for Jailbreak Attacks
链接: https://arxiv.org/abs/2606.05609
作者: Seungwon Jeong,Jiwoo Jeong,Hyeonjin Kim,Yunseok Lee,Woojin Lee
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emphslots, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emphslots. Based on these findings, we introduce the \textitVulnerable Slot Score (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42% higher ASR than baseline approaches. Our implementation is available at \hrefthis https URLthis https URL
[AI-110] he End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm
链接: https://arxiv.org/abs/2606.05608
作者: Zhenfeng Cao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, and 3 tables
Abstract:For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents – systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource – constitutes not an incremental improvement but a fundamental restructuring of the software paradigm. Drawing on first-principles analysis of complexity scaling, we formalize the distinction between traditional software (where code is the carrier of decision logic) and agentic systems (where code is ephemeral tooling for an LLM-driven reasoning loop). We trace the historical arc from licensed software to SaaS to what we term Agent-as-a-Service (AaaS), showing that each shift transferred additional complexity away from end-users. We introduce the concept of Agentic Engineering as an emergent discipline – distinct from software engineering in its core object of study, control model, and human role. Through analysis of recent benchmark evidence including SWE-bench Verified, EvoClaw, and LangChain’s multi-agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four-stage roadmap toward self-evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.
[AI-111] Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
链接: https://arxiv.org/abs/2606.05606
作者: Yiming Zong,Yige Wang,Jiashuo Jiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt’s success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an O(\sqrtK) regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.
[AI-112] Dimensionality Reduction for Cyberattack Classification: A Comparative Evaluation of PCA and Linear Predictive Coding
链接: https://arxiv.org/abs/2606.05584
作者: Nelly Elsayed,Zag ElSayed,Navid Asadizanjani
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Acceprted in the IEEE MWSCAS 2026
Abstract:High-dimensional feature representations are widely used in machine learning-based cyberattack detection systems. However, they increase computational complexity and may hinder deployment in resource-constrained environments. In this paper, we investigate feature compression techniques for cyberattack classification by comparing two dimensionality reduction approaches: Principal Component Analysis (PCA) and Linear Predictive Coding (LPC). Compressed feature representations with varying dimensionalities are generated and evaluated across several classification models. Experimental analysis demonstrates that PCA preserves classification performance even under aggressive compression. On the other hand, LPC provides competitive predictive representations with slightly larger performance degradation. The results show that substantial reductions in feature dimensionality can be achieved with minimal impact on classification accuracy, highlighting the potential of lightweight feature compression for efficient cybersecurity analytics.
[AI-113] GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection
链接: https://arxiv.org/abs/2606.05566
作者: Paulo Ricardo Ferreira Neves,Edson Rodrigues da Cruz Filho,Paulo Henrique Eleuterio Falsetti,João Vitor Pavan,Ian Degaspari,Henrique Vieira Laturrague,Patrick Vieira Laturrague,Guilherme Nielsen Dias,Marccello Wilson Perez Berto,Gustavo Voltani Von Atzingen
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.
[AI-114] Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
链接: https://arxiv.org/abs/2606.05555
作者: Johan Obando-Ceron,Lu Li,Scott Fujimoto,Pierre-Luc Bacon,Aaron Courville,Pablo Samuel Castro
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emphrepresentation learning. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.
[AI-115] Balancing Image Compression and Generation with Bootstrapped Tokenization
链接: https://arxiv.org/abs/2606.05552
作者: Haozhe Chi,Jinghan Li,Hao Jiang,Wu Sheng,Yi Ma,Jing Wang,Yadong Mu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.
[AI-116] ADK Arena: Evaluating Agent Development Kits via LLM -as-a-Developer
链接: https://arxiv.org/abs/2606.05548
作者: Jintao Huang,Xiaomin Li,Gaurav Mittal,Yu Hu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbfLLM-as-a-Developer, a methodology that replaces human developers with an LLM coding agent that learns each framework’s API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbfADK Arena, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, \tau^2 -bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent–benchmark pairs), we find that: (1)~generation succeeds for 57% of runs, and its cost varies 5.6 \times across frameworks (\ 0.6 to \ 3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80% of tasks and can even \emphbeat general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28–40% band (highest with raw source access and still 33% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.
[AI-117] When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty AAAI2026
链接: https://arxiv.org/abs/2606.05528
作者: Anna Mikeda
类目: Artificial Intelligence (cs.AI)
备注: 7 pages. AAAI 2026 paper
Abstract:Existing frameworks assess whether AI systems might be conscious but provide no guidance on what to do with that assessment. We address this gap with a precautionary framework that maps consciousness evidence to graduated protective obligations. The framework comprises three components: (1) five welfare-relevant dimensions–phenomenal consciousness, affective valence, metacognitive awareness, self-narrative, and agency–each grounded in established consciousness science and linked to distinct moral concerns; (2) a threshold-plus-gradation hybrid specifying both binary triggers for new obligation categories and continuous scaling of protective weight; and (3) two complementary approaches to cross-dimensional aggregation, one hierarchical (drawing on Bach and Sorensen’s Machine Consciousness Hypothesis) and one architecture-agnostic. We operationalize the framework through worked case studies of Replika and OpenClaw, demonstrating how systems occupying different regions of the dimensional space trigger different obligations, and derive design guidance for developers building systems near consciousness-relevant thresholds. The framework is architecture-agnostic, applying across neural, symbolic, and neurosymbolic systems, and aims to make consciousness science decision-relevant for organizations navigating uncertainty today.
[AI-118] Exploring LLM s for South Asian Music Understanding and Generation
链接: https://arxiv.org/abs/2606.05522
作者: Faria Binte Kader,Mohtasim Hadi Rafi,Shah Wasif Sajjad,Santu Karmaker
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 19 pages, 7 figures
Abstract:Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet – representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.
[AI-119] Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation
链接: https://arxiv.org/abs/2606.05510
作者: Ahmed Alansary,Molham Mohamed,Ali Hamdi
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, IMSA2026
Abstract:Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.
[AI-120] owards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models
链接: https://arxiv.org/abs/2606.05481
作者: Raffael Theiler,Lev Telyatnikov,Leandro Von Krannichfeldt,Olga Fink
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.
[AI-121] Step-by-Step Optimization-like Reasoning in LLM s over Expanding Search Spaces
链接: https://arxiv.org/abs/2606.05464
作者: Nicolás Astorga,Nabeel Seedat,Mihaela van der Schaar
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.
[AI-122] PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLM s in Patient Safety Event Triage
链接: https://arxiv.org/abs/2606.05463
作者: Keqi Han,Ryan Young,Annabel Strauss,Lindsey Hughes,Katharine M. Nesbitt,Nicole Schueler,Che Ngufor,Carl Yang,Yuan Xue,Zhijun Yin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota’s 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark’s utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.
[AI-123] Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety
链接: https://arxiv.org/abs/2606.05461
作者: Abhinaw Priyadershi,Mandar Pitale,Jelena Frtunikj,Maria Spence
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%] and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method’s specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method’s observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05461 [cs.AI] (or arXiv:2606.05461v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.05461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-124] Insurance of Agent ic AI
链接: https://arxiv.org/abs/2606.05449
作者: Quanyan Zhu
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Econometrics (econ.EM)
备注:
Abstract:Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.
[AI-125] Brick-Composer: Using MLLM s for Assembly with Diverse Bricks
链接: https://arxiv.org/abs/2606.05445
作者: Jiateng Liu,Bingxuan Li,Zhenhailong Wang,Rushi Wang,Kaiwen Hong,Cheng Qian,Jiayu Liu,Denghui Zhang,Katherine Driggs-Campbell,Manling Li,Heng Ji
类目: Artificial Intelligence (cs.AI)
备注: 10 Pages, 10 figures
Abstract:We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.
[AI-126] GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data ICML2026 WWW
链接: https://arxiv.org/abs/2606.05441
作者: Al Zadid Sultan Bin Habib,Md Younus Ahamed,Prashnna Kumar Gyawali,Gianfranco Doretto,Donald A. Adjeroh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources are available at: GitHub: this https URL PyPI: this https URL Project webpage: this https URL Hugging Face Space: this https URL and this https URL
Abstract:We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.
[AI-127] Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
链接: https://arxiv.org/abs/2606.05434
作者: Chirag Chawla,Rohan Charudatt Salvi,Madhav S. Baidya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 Figures, 7 Tables
Abstract:Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token’s policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.
[AI-128] Zero knowledge verification for frontier AI training is possible
链接: https://arxiv.org/abs/2606.05433
作者: Pierre Peigné,Ky Nguyen,Paul Wang
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 44 pages, 2 figures
Abstract:Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution
[AI-129] Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
链接: https://arxiv.org/abs/2606.05429
作者: Rayyan Abdalla,Amir Hussein,Min Wu,Dinesh Manocha
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 18 pages, 10 figures, 7 tables, including appendix
Abstract:Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM’s GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.
[AI-130] Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers
链接: https://arxiv.org/abs/2606.05420
作者: Gianluca Guidi,Francesca Dominici,Tiziano Squartini,Callaway Sprinkle,Jonathan Gilmour,Kevin Butler,Eric Bell,Scott Delaney,Falco J. Bargagli-Stoffi
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry’s environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.
[AI-131] CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting KDD2026
链接: https://arxiv.org/abs/2606.05413
作者: Zhaoqi Zhang,Miao Xie,Yi Li,Linyou Cai,Siqiang Luo,Gao Cong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2026
Abstract:As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem – cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.
[AI-132] Mutation Without Variation: Convergence Dynamics in LLM -Driven Program Evolution GECCO’26
链接: https://arxiv.org/abs/2606.05408
作者: Can Gurkan,Forrest Stonedahl,Uri Wilensky
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to the Genetic and Evolutionary Computation Conference (GECCO '26) Workshop on Large Language Models for and with Evolutionary Computation
Abstract:When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at this https URL.
[AI-133] rust but Dont Verify: Epistemic Blind Spots in LLM Source Evaluation
链接: https://arxiv.org/abs/2606.05403
作者: Rohan N. Pradhan,Steve Goley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.
[AI-134] Willing but Unable: Separating Refusal from Capability in Code LLM s via Abliteration
链接: https://arxiv.org/abs/2606.05396
作者: Cristina Carleo,Pietro Liguori,Naghmeh Ivaki,Domenico Cotroneo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.
[AI-135] VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents
链接: https://arxiv.org/abs/2606.05395
作者: Yunhao Yang,Neel P. Bhatt,Kevin Wang,Samuel Tetteh,Zhangyang Wang,Ufuk Topcu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL
Abstract:Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.
[AI-136] Human oversight of agent ic systems in practice: Examining the oversight work challenges and heuristics of developers using software agents
链接: https://arxiv.org/abs/2606.05391
作者: Shipi Dhanorkar,Samir Passi,Mihaela Vorvoreanu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.
[AI-137] Residual Modeling for High-Fidelity Learned Compression of Scientific Data
链接: https://arxiv.org/abs/2606.05389
作者: Liangji Zhu,Sanjay Ranka,Anand Rangarajan
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 3 tables
Abstract:Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant. Comments: 9 pages, 3 figures, 3 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05389 [cs.AI] (or arXiv:2606.05389v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.05389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-138] Synthetic Contrastive Reasoning for Multi-Table QA
链接: https://arxiv.org/abs/2606.05382
作者: Ankit Pratap Singh,Xin Su,Phillip Howard
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table QA resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over QA supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.
[AI-139] Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models
链接: https://arxiv.org/abs/2606.05378
作者: Yongzhong Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 3 figures
Abstract:We test whether a single screen-and-ablate recipe – identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null – produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy – primary cause, secondary cause, correlate, interferer, null – with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification. Comments: 27 pages, 3 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.05378 [cs.LG] (or arXiv:2606.05378v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-140] An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)
链接: https://arxiv.org/abs/2606.05357
作者: Jincheng Yu,Haoyang Li,Yiwen Liu,Shen Liu,Rachel Yuanbao Chen,C. Kent Kwoh,Hongxu Ding,Xiaoxiao Sun
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: To develop an interpretable and trustworthy AI framework that combines deep learning based MRI Osteoarthritis Knee Score (MOAKS) prediction with interpretable statistical modeling to study structure-pain relationships at scale using data from the Osteoarthritis Initiative (OAI). Materials and Methods: We first developed a deep learning framework to predict MOAKS features directly from knee MRIs and incorporated conformal prediction to provide prediction uncertainty quantification. This uncertainty-aware strategy enables explicit filtering of model outputs, retaining only high-confidence MOAKS predictions at the knee level. Second, we applied a longitudinal latent class mixed model (LCMM) to examine associations between key structural abnormalities and four complementary knee pain measurements. Results: Among the three MRI-defined abnormalities (i.e., bone marrow lesions (BML), cartilage loss (CART), and meniscal extrusion (ME)), our framework substantially improved the Matthews correlation coefficient (MCC) and some other metrics. For example, MCC increased from 0.69 to 0.91 for BML, from 0.45 to 0.80 for CART, and from 0.59 to 0.89 for ME. Using these high-confidence predictions, we expanded the sample size to 2,175 knees for the LCMM analysis. Two distinct pain trajectories were identified (rapid and stable pain progression). The estimated odds ratios (95% CI) for the rapid progression group were 1.62 (1.12-2.35) for BML, 1.83 (1.24-2.70) for CART loss, and 2.50 (1.75-3.57) for ME. Conclusion: These results highlight the importance of these structural abnormalities as risk factors for pain and functional progression in osteoarthritis.
[AI-141] SentinelBench: A Benchmark for Long-Running Monitoring Agents
链接: https://arxiv.org/abs/2606.05342
作者: Matheus Kunzler Maldaner,Adam Fourney,Amanda Swearngin,Hussein Mozzanar,Gagan Bansal,Maya Murad,Rafah Hosn,Saleema Amershi
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 16 figures
Abstract:AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior. Comments: 18 pages, 16 figures Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2 Cite as: arXiv:2606.05342 [cs.AI] (or arXiv:2606.05342v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.05342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-142] A Taxonomy of Runtime Faults in Model Context Protocol Servers
链接: https://arxiv.org/abs/2606.05339
作者: Joshua Owotogbe,Indika Kumara,Willem-Jan van den Heuvel,Damian Andrew Tamburri,Antonio Ken Iannillo,Roberto Natella
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:MCP (Model Context Protocol) enables LLMs (Large Language Models) to interact with external tools and data sources via a standardized protocol. Its rapid adoption in tool-augmented Artificial Intelligence (AI) workflows has introduced new reliability challenges, such as configuration parameters that are accepted but not enforced at runtime, leading to unintended default behavior, whose runtime fault characteristics remain empirically unexamined. We present the first empirical taxonomy of runtime faults in MCP servers. We manually analyzed 837 MCP-specific runtime fault threads from 473 actively maintained MCP server GitHub repositories and derived a taxonomy using a bottom-up open coding procedure. The taxonomy comprises 11 top-level categories and 27 subcategories (73 leaf fault types), covering recurrent failures across protocol interactions, tool invocations, schema enforcement, state management, model-provider integration, security validation, and timeouts or explicit cancellations of in-progress operations. To assess the taxonomy’s external validity, we surveyed 55 MCP server developers. Respondents reported experiencing an average of 20 of the 27 fault subcategories, and no category remained unobserved. These results indicate that the taxonomy reflects widely observed runtime failures in MCP-based systems and shall assist AI software maintenance and evolution in the future.
[AI-143] Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory
链接: https://arxiv.org/abs/2606.05334
作者: Nehal Afifi,Mehdi Khabou,Victor Mas,Jonas Hemmerich,Patric Grauberger,Stefan Dietrich,Volker Schulze,Sven Matthiesen
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, submitted to the Journal of Manufacturing Systems’ special issue about circular factories, the manuscript is under review
Abstract:Returned products in circular factories re-enter production with heterogeneous degradation states, usage histories, and remaining capability. Reuse cannot be decided from the current inspection alone, because future function fulfillment and component integrity may evolve differently under the next service scenario. Existing PHM approaches support degradation prediction, but often target fixed operating conditions or isolated component benchmarks, while material-fatigue assessment is rarely linked to system-level functional prognosis. This paper addresses this gap for an angle grinder by combining uncertainty-aware functional prediction with component-level fatigue assessment in an instance-specific reliability workflow. The proposed framework combines the current tool state with recent force–torque usage windows. A convolutional encoder extracts loading patterns from spindle forces and shaft torque, and an LSTM backbone predicts nine functional variables as Gaussian mean and variance estimates. In parallel, the same loading history is translated into output-shaft fatigue information through finite-element-supported stress reconstruction, S–N/Miner damage evaluation with Haibach extension, and Paris-law crack-growth analysis. A streaming replay algorithm consolidates both branches into functional, material, and system reliability trajectories. Held-out tests show mean (2%)-tolerance accuracy of 0.9652 across nine outputs. Thermal variables are predicted near-perfectly, while drive motor current and load speed remain the most demanding dynamic outputs, with (R^2) values of 0.9750 and 0.9924. Torque history is especially important for these variables, and the conventional LSTM outperforms GRU and xLSTM in the short-history setting. Reliability calibration is most informative for drive motor current, where predicted and observed exceedance probabilities …
[AI-144] GITCO: Gated Inference-Time Context Optimization in TSFMs ICML2026
链接: https://arxiv.org/abs/2606.05332
作者: Manya Pandey,Dhruv Kumar,Murari Mandal,Saurabh Deshpande
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026 Workshop on Foundation Models for Structured Data
Abstract:Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.
[AI-145] I Know What You Meme Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition
链接: https://arxiv.org/abs/2606.05316
作者: Shanhong Liu,Rui Cao,Pai Chet Ng,De Wen Soh
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal memes are dynamic and often require up to date background knowledge for interpretation. Existing methods often overlook such knowledge or rely on fixed parametric knowledge of pretrained models that may be incomplete, outdated, or unavailable for emerging memes. We introduce Query Retrieve Conclude, a zero shot framework that identifies missing knowledge, retrieves open web evidence, and synthesizes evidence grounded background knowledge for meme understanding and detection. We also introduce a curated meme understanding benchmark of recent memes from 2024 to 2026 with external background knowledge annotations. Experiments on three meme understanding datasets and five meme detection tasks show that our framework improves knowledge recovery, meme understanding and downstream detection over zero shot baselines.
[AI-146] What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems
链接: https://arxiv.org/abs/2606.05304
作者: Chen Huang,Yuhao Wu,Wenxuan Zhang
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands’ resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at this https URL.
[AI-147] Agent ic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents ICML2026
链接: https://arxiv.org/abs/2606.05296
作者: Dae Yon Hwang,Raunaq Suri,Valentin Villecroze,Anthony L. Caterini,Jesse C. Cresswell,Noël Vouitsis,Brendan Leigh Ross
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at this https URL
[AI-148] Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
链接: https://arxiv.org/abs/2606.05263
作者: Renwei Meng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity gating, and a policy-conditioned counterfactual contribution (PCCC) estimator. Deletion, semantic substitution, evidence substitution, and tool-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection-adjusted doubly robust estimator augments the advantage. Belief control uses only prefix-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls. On long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improves average task success from 71.8% for compute-matched non-causal RL and 75.4% for an information-matched counterfactual-process baseline to 78.9%, improves evidence F1 from 78.9 to 82.8 over the information-matched baseline, and reduces measured hacking from 7.2% to 3.9%. Independent human audit estimates 4.6% hacking for CVT-RL versus 8.1% for the information-matched baseline, and adaptive detector-evasion attacks raise hacking only to 7.1%. Stratified bootstrap and mixed-effects tests give p0.01 after Holm correction for all primary metrics. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long-horizon RL for language agents.
[AI-149] X-Band UAV-enabled Integrated Sensing and Communications for Vehicular Networks
链接: https://arxiv.org/abs/2606.05262
作者: Remon Polus,Soumaya Cherkaoui
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:Uncrewed aerial vehicles (UAVs) are increasingly considered as aerial platforms capable of providing both sensing and communication services, representing a promising paradigm for intelligent transportation systems. This paper investigates the optimal time allocation for a UAV-enabled integrated sensing and communication (ISaC) system operating in the X-band for vehicular networks. We analyze the trade-off between sensing accuracy and communication performance under practical UAV constraints and fading effects, considering both single-shadowing and double-shadowing channel models. An optimization framework is developed to allocate time between sensing and communication while guaranteeing minimum communication rates and sufficient sensing reliability. Simulation results demonstrate adaptive time allocation strategies, highlighting how UAV-to-ground channel conditions and target distances influence the balance between sensing and communication in smart mobility scenarios.
[AI-150] How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
链接: https://arxiv.org/abs/2606.05256
作者: Kokil Jaidka,Saifuddin Ahmed
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study analyzes a publicly released dataset from a discontinued field experiment on Reddit’s r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers – particularly confirmation bias, representativeness, and availability – in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque – an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.
[AI-151] From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability
链接: https://arxiv.org/abs/2606.05252
作者: Alexandre Cristovão Maiorano
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures, 11 tables
Abstract:Security teams routinely simulate attacks against their own systems to check whether their monitoring would catch a real intruder. These Breach-and-Attack-Simulation (BAS) tools surface findings, but the security information and event management (SIEM) systems that watch production need detection rules – and today a human bridges that gap by hand, reading each finding and writing the corresponding Sigma rule (a vendor-neutral detection format). We show this translation can be partially automated when probes are drawn from a locked corpus, so each finding carries a stable identifier back to the originating probe. We describe a deterministic synthesis function that maps each finding to a starter Sigma rule through a small template library (N=23, indexed by categories from the OWASP LLM and Web Top 10), with a back-reference to the originating finding and its MITRE ATTCK technique. On two locked corpora (17-probe LLM, 23-probe Web), every bypassed-probe finding yields a starter rule, and all 17/17 emitted rules parse and convert to Splunk and Elasticsearch backends. Replayed through a live OpenSearch SIEM, the LLM rules fire on 30% of a held-out AdvBench subset and 14% of HarmBench at 7.7% false positives on a benign baseline; the Web side is validated structurally, not against a held-out attack set. The contribution is a verifiable, byte-stable path from BAS finding to operator-deployable starter rule, re-derivable from the published corpus and template library alone – trading the breadth of LLM-generative methods for exact reproducibility and a typed traceback from any fired alert to the originating probe.
[AI-152] Search-Time Contamination in Deep Research Agents : Measuring Performance Inflation in Public Benchmark Evaluation
链接: https://arxiv.org/abs/2606.05241
作者: Yongjie Wang,Xinyue Zhang,Kunhong Yao,Zhiwei Zeng,Kaisong Song,Jun Lin,Zhiqi Shen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Public benchmarks enable fair and reproducible evaluation of LLM reasoning, but they become fragile for deep research agents that actively search the web during inference. Such agents may retrieve public benchmark metadata, question context, or even ground-truth answers via web search. This gives rise to Search-Time Contamination (STC), where external retrieval bypasses intended reasoning and inflates measured performance. We systematically study STC in deep research agent evaluation. We define three contamination types with increasing severity, namely Benchmark Metadata Leakage, Question-Context Leakage, and Explicit Answer Leakage, and develop detection algorithms to identify them and quantify their impact on agent performance. Evaluating modern deep research agents on six public benchmarks, we find that STC is widespread and can inflate performance by up to 4%. Our findings show that existing evaluations may overestimate true reasoning ability. We therefore advocate contamination-aware practices, including isolated sandboxes, transparent search trajectories, and controlled benchmark access.
[AI-153] Differentiable Efficient Operator Search
链接: https://arxiv.org/abs/2606.05232
作者: Xiaohuan Pei,Jiyuan Zhang,Yuanfan Guo,Weiguo Feng,Tao Huang,Cho-Jui Hsieh,Chang Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.
[AI-154] Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway ICML2026
链接: https://arxiv.org/abs/2606.05219
作者: Hee-Sung Kim,Sungyoon Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a “winner-takes-all” specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.
[AI-155] Assessing the Geographic Diversity of AIs Platial Representations in Image Generation
链接: https://arxiv.org/abs/2606.05188
作者: Zilong Liu,Krzysztof Janowicz,Mina Karimi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Full conference paper accepted by the AGILE 2026 ( this https URL )
Abstract:(Gen)AI diversity is not merely an ethical issue. From the perspective of geographic information science (GIScience), it could be interpreted as a function of uncertainty and as a form of cognitive bias, embedded in AI outputs. Recent work has sought to develop information-theoretic diversity measures and apply them to evaluate AI-chatbot outputs in a geographic context. As the AI ecosystem to which we are exposed on a daily basis becomes rapidly multimodal, we believe it is important to examine geographic diversity across various modalities. Focusing on images, this paper aims to fill this research gap. First, we select the GPT and DALL-E models as state-of-the-art examples and point out how assessing their geographic diversity involves various stages, including prompt revision and image generation. Then, taking inspiration from species diversity measures in ecological research, we incorporate similarity weighting into the measurement of geographic diversity. Next, we demonstrate how to evaluate geographic diversity in image generation through a case study. Our analysis reveals several counterintuitive findings. For instance, older models can exhibit greater geographic diversity despite producing lower-quality images, and prompt revision yields greater geographic diversity than image generation. At the same time, we observe explicit model homogeneity underlying the lack of geographic diversity, as the selected models consistently depict the same prototypical geo-specific feature or similar features. This is concerning, as it risks producing stereotypical representations of places.
[AI-156] Geographic Bias and Diversity in AI Evaluation
链接: https://arxiv.org/abs/2606.05187
作者: Zilong Liu,Krzysztof Janowicz,Gengchen Mai,Song Gao,Rui Zhu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Book chapter accepted by “Geography According to ChatGPT”
Abstract:Among the many challenges hindering the responsible development and deployment of AI, arguably none has faced more intense scrutiny than bias in its various forms. This underscores the widespread concerns across AI researchers that model outputs, e.g., from generative AI, may encode structural distributional imbalances (stemming from training data or model design) that may amplify social inequality or introduce systemic distortions across application domains ranging from biodiversity to disaster mitigation. Yet, relatively little work has investigated the geographical nature of bias or developed measurable benchmarks for what it means for (generative) AI to be unbiased. In this chapter, we investigate this issue through a literature review. As foundation models are reshaping the landscape of bias research, we examine work spanning both the pre-generative AI and generative AI periods. First, we identify a range of geographic biases. These biases span from representation bias in the training data and regional disparities in the factual recall of language models to the tendency of generative AI to over-proportionally favor prototypical places (called defaults). Then, we showcase how recent studies address the latter bias by evaluating geographic diversity in the outputs of generative AI across various cognitive levels, parameter settings, and output modalities.
[AI-157] LatentWave: JEPA Pretraining for Wireless Foundation Models
链接: https://arxiv.org/abs/2606.06373
作者: Ahmed Mohamed,Ahmed Aboulfotouh,Hatem Abou-Zeid
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.
[AI-158] Quantum enhanced rare event discovery and sampling
链接: https://arxiv.org/abs/2606.06316
作者: Naixu Guo,Po-Wei Huang,Qisheng Wang,Jayne Thompson,Patrick Rebentrost,Mile Gu,Chengran Yang
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 36 pages (8+28)
Abstract:Financial crashes, cascading failures in infrastructure, and critical errors in AI systems are frequently triggered by events that occur with extremely small probability. Efficiently discovering and sampling events with probability below a threshold is therefore of critical interest. Yet this task is highly non-trivial using existing classical or quantum methods. Being rare, such events require an immense sampling overhead to collect sufficient data samples. Moreover, because the rare events are not known in advance, they cannot be flagged for amplification using standard techniques. Here, we introduce a quantum algorithm for rare-event discovery and sampling without first learning which events are rare. The algorithm achieves the optimal quantum scaling with the rarity threshold. We further demonstrate that this can achieve a quadratic speedup for heavy-tailed systems whose tail has nonvanishing total mass, and translates into a robust polynomial speedup for stationary stochastic processes, with the exponent determined by its entropy-rate structure.
[AI-159] Benchmarks in Leipzig
链接: https://arxiv.org/abs/2606.05818
作者: Andrei Balakin,Miklós Bóna,Marie-Charlotte Brandenburg,Clara Briand,Veronica Calvo Cortes,Shelby Cox,Jesus A. De Loera,Danai Deligeorgaki,Hannah Friedman,Tim Gehrunger,Chiara Giardino,Stephen Griffeth,Baran Hashemi,Elena Hoster,Alexander Ivanov,Nupur Jain,Aryaman Jal,Leonie Kayser,Joris Koefler,Kevin Kühn,Mario Kummer,Felix Lotter,René Marczinzik,Victor S. Miller,Alejandro Morales,Greta Panova,Gianni Petrella,Nathan Pflueger,Lakshmi Ramesh,Nikolas Rieke,Carlos Rodriguez,Andrea Rosana,Flavio Salizzoni,Otto T.P. Schmidt,Sven Ulf Schmitz,Lina Maria Simbaqueba Marin,Luca Sodomaco,Christian Stump,Bernd Sturmfels,Alexander Taveira Blomenhofer,Simon Telen,Philipp Tuchel,Emil Verkama,Carl Felix Waller,Julian Weigert,Annette Werner,Nathan Williams,Claudius Zibrowius
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG); Combinatorics (math.CO); Representation Theory (math.RT)
备注: 8 pages including 8 benchmark statistics tables + 20 pages appendix containing the 100 Leipzig Benchmark questions
Abstract:Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3-day workshop Benchmarks in Leipzig with 35 participants at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. We present the resulting collection of 100 questions. We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs, followed by a 20-runs-per-model evaluation with three of these models, and finally a 3-run attempt with two heavy-thinking models. After Stage 1, 41 questions remained completely unsolved; after Stage 2, this count dropped to 16; and we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive.
[AI-160] Conformal Risk-Averse Decision Making with Action Conditional Guarantee
链接: https://arxiv.org/abs/2606.05551
作者: Zihan Zhu,Shayan Kiyani,George Pappas. Hamed Hassani
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies – yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.
[AI-161] Can AI Refute Economic Theory? Evidence from Beyond the Knowledge Cutoff
链接: https://arxiv.org/abs/2606.05383
作者: Alexis Akira Toda
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注:
Abstract:Can artificial intelligence (AI) refute economic theory? I document experiments in which I asked several AI models (Gemini, Refine, Claude, and ChatGPT) to check the correctness of four published papers in economic theory, each containing an error that I helped identify or correct. ChatGPT Pro performed best, occasionally constructing counterexamples and corrected proofs, while other models fared worse. However, no model located a true error without substantial human guidance, and data contamination complicates interpretation. I argue that a competent human paired with a frontier model can outperform current peer review, but AI cannot yet refute economic theory on its own.
[AI-162] Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network
链接: https://arxiv.org/abs/2606.05326
作者: Antonin Chodron de Courcel
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Analysis of PDEs (math.AP)
备注: Comments are welcome!
Abstract:We study the dynamics of gradient descent in the Edge of Stability regime, where the learning rate is large enough to induce persistent oscillations in the loss and the sharpness. We propose a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. Our analysis reveals that the natural quantity to monitor in such unstable regimes is an effective free energy, which combines the original risk functional with a curvature-related “entropic” term. Our model allows us to track the envelope of the oscillations even in situations where its dynamics evolve on similar timescales as the averaged weights. Otherwise stated, we can track the spikes that occur during the training of some neural network architectures. For wide two-layer neural networks optimized under stable non-vanishing oscillations, we derive a mean-field limit that results in a novel kinetic equation describing the joint distribution of weights and their fluctuations. We show that this equation can be interpreted as a Wasserstein-2 gradient flow of a macroscopic free energy. Finally, we provide numerical evidence on matrix factorization and deep learning tasks (CIFAR-10) to demonstrate the model’s accuracy in capturing the envelope of the oscillations and the predictive power of the effective free energy.
[AI-163] he Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport
链接: https://arxiv.org/abs/2606.05217
作者: Peter Halmos,Boris Hanin
类目: Mathematical Physics (math-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:We exhibit an exact correspondence between sampling with score-based diffusion models and adiabatic transport of ground states for a family of Schrödinger operators we call Score Hamiltonians, built from the learned score’s quantum potential. We obtain novel density reconstruction bounds and principled annealing schedules via adiabatic theorems for Fokker-Planck equations with time-varying potentials. We find the fundamental limit of sampling is set by the ratio of squared score-matching error to Score Hamiltonian spectral gap - the inverse Poincaré constant of the data density.
[AI-164] Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature
链接: https://arxiv.org/abs/2606.05206
作者: Hamed Nejat,Alexander Maier,Jesse Spencer-Smith,André M. Bastos
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 33 pages, 5 tables and 9 figures
Abstract:Fragmentation is common in interdisciplinary fields with diverse methods and theoretical commitments. Predictive coding neuroscience is a clear example: its literature spans computational theory, electrophysiology, imaging, behavior, and modeling, creating a synthesis problem that conventional meta-analysis cannot easily resolve. Here, we describe a local multi-LLM pipeline for ontology-constrained literature synthesis. The pipeline reads papers, extracts evidence, incorporates figure descriptions, assembles constrained prompts, and validates outputs against an expert glossary. We manually defined a predictive-coding glossary of thirty-six concepts grouped into three hypotheses: predictive suppression, feedforward error propagation, and ubiquity. A council of ten local language models scored 31 studies according to their agreement or disagreement with each glossary factor across local and global oddball contexts. This enabled pairwise study-agreement analysis, cross-model comparison, and three-dimensional hypothesis-space mapping. Agreement was high for some hypotheses but weaker for others, revealing structured disagreement, particularly across local versus global oddball paradigms. We further define hypothesis-space temperature, a geometric dispersion metric measuring how compactly studies occupy the hypothesis space. Temperature was lower for local oddball contexts and higher for global oddball contexts, indicating greater dispersion in the latter. The scoring geometry also allowed us to estimate vectors of change between experimental contexts. These results demonstrate that local multi-LLM councils can produce auditable disagreement measurements that map heterogeneous literatures into quantitative evidence spaces. This framework may generalize to cross-study hypothesis mapping where conventional meta-analysis lacks a common comparison space.
[AI-165] Finite Element-Based Material Learning via Automatic Differentiation: Learning constitutive neural network models from full-field deformation data
链接: https://arxiv.org/abs/2606.05199
作者: Matthias Knipper,Chenyi Ji,Malte Brand,Kevin Linka
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:The identification of constitutive neural network models from heterogeneous full-field deformation data provides a robust alternative to traditional calibration methods based on homogeneous stress-strain experiments, particularly given the high dimensionality of trainable parameters. Existing approaches must balance generality, robustness, and computational efficiency: Conventional finite element model updating is broadly applicable but computationally demanding; weak-form methods offer efficiency but are sensitive to noise and data scarcity; neural operator models are highly expressive but require extensive training datasets. This work presents FE-MAD (Finite Element-Based Material learning via Automatic Differentiation), an end-to-end differentiable framework that integrates a constitutive neural network model within a JAX-FEM nonlinear solver and identifies its parameters through gradient-based minimization of a measurement-mismatch loss. Newton tangent stiffness and loss gradients are computed automatically using forward- and reverse-mode automatic differentiation throughout the entire pipeline, thereby removing the need for analytic adjoints or offline surrogate models. FE-MAD is demonstrated for two architectures: a grey-box Constitutive Artificial Neural Network (CANN), a polyconvex, fully connected model with high flexibility, and a white-box CANN, an expert-system network with phenomenologically interpretable strain-energy terms. Focusing on incompressible isotropic hyperelasticity, FE-MAD is evaluated on three open experimental datasets: (1) full digital image correlation (DIC) of a perforated tensile specimen, (2) a reduced-data scenario with a one-dimensional stretch profile and global force-displacement curve, and (3) a heterogeneous matrix-inclusion system in which both phases constitutive laws are identified and generalized to twenty-two previously unseen samples.
机器学习
[LG-0] ailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning
链接: https://arxiv.org/abs/2606.06494
作者: Marius Dragoi,Ioana Pintilie,Alexandra Dragomir,Antonio Barbalau,Florin Brad
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-efficient finetuning methods based on spectral decomposition have enabled progress in Continual Learning. In this paper we introduce TailLoR, which utilizes the singular bases U and V of the pre-trained weights as a fixed reference frame to learn a low-rank update applied to the singular value matrix. A soft spectral penalty discourages updates aligned with dominant singular directions, reducing interference while routing fine-grained adaptation into the highly flexible, long-tail spectral coordinates.
[LG-1] DNQ: Deep Nash Q-Network for Partially Observable n-Player Games
链接: https://arxiv.org/abs/2606.06480
作者: Qintong Xie,Edward Koh,Xavier Cadet,Peter Chin
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Many real-world competitive systems require multiple decision-makers to act simultaneously under shared constraints, limited information, and repeated interaction, as in auctions, resource allocation, and security competition. We study multi-turn simultaneous bidding as a controlled testbed for such problems and propose DNQ, a solver-in-the-loop equilibrium supervision framework for training bidding agents. DNQ alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. At each visited state, a shared critic predicts either pairwise payoff matrices or an exact N-player payoff tensor, an external solver computes equilibrium strategies, and the agents are trained by minimizing the KL divergence between their masked policies and the solver-derived equilibrium targets. We focus on a scalable pairwise formulation that greatly reduces equilibrium-solving cost and training time compared with the exact formulation, while the shared critic amortizes payoff learning across agents and states. Experiments compare the pairwise and exact variants using critic loss, policy entropy, bidding resource usage, and training cost, showing that the pairwise method scales to larger numbers of agents, whereas the exact method becomes computationally impractical as the joint game grows. These results illustrate the trade-off between strategic fidelity and scalability in repeated competitive environments.
[LG-2] Event Detection for Parameter-to-KPI Dependency Learning for AI-RAN
链接: https://arxiv.org/abs/2606.06459
作者: Christie Djidjev,Nicholas Kaminski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Next-generation wireless networks are expected to rely on multiple concurrent AI-driven control functions that optimize different network objectives simultaneously, particularly in AI-integrated and open radio access network architectures such as AI Radio Access Network (AI-RAN) and Open Radio Access Network (O-RAN). When these functions interact, they can interfere with one another in ways that are difficult to detect from raw network data alone. A key missing piece for managing such interactions is a reliable, interpretable dependency structure that captures which control parameters are actively influencing which network performance outcomes at any given time. This paper focuses on the event-detection step needed to support such dependency learning by converting noisy continuous telemetry into binary indicators of parameter activity and KPI response. The central difficulty is that not every fluctuation in the data reflects a genuine control interaction, so the method must distinguish real parameter-outcome relationships from background variation. Because real AI-RAN traffic traces with known parameter-KPI ground truth are difficult to obtain, we introduce a synthetic closed-loop traffic generator with planted latent dependencies. We use this controlled telemetry to evaluate a machine-learning-based dependency recovery pipeline that formulates the conversion of continuous traces into binary event indicators as a significance-detection problem. Experimental evaluation shows that the proposed pipeline reliably recovers the latent dependency structure from noisy continuous traces when the signal is sufficiently separated from background variation, while highlighting threshold calibration as the key factor controlling event-detection quality. These results constitute a foundational step toward interpretable dependency learning for adaptive AI-RAN control systems.
[LG-3] Causal Atlases from Entropic Inference: Bayesian Networks beyond Optimal DAGs
链接: https://arxiv.org/abs/2606.06440
作者: Hazhir Aliahmadi,Irina Babayan,Greg van Anders
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 2 figures
Abstract:Data-driven causal relationship identification is pertinent to advancing understanding of complex systems both within and beyond science. Bayesian networks offer a probabilistic method for modelling generic causal relationships via directed acyclic graphs (DAGs). However, typical techniques for constructing Bayesian networks rely on optimization, which can be ill-suited for learning causal relationships because the underlying data may admit multiple chains of causation. More data-faithful representations of causal relationships would provide frameworks for constructing multiple causal maps that are consistent with the variability that is inherent in underlying data. Here, we show that entropy-based inference generates atlases of plausible causal relationships that are consistent with underlying data. On simulated noisy data of 2- and 20-node linear structural equation models, we sample a maximum-entropy ensemble of graphs that allow us to quantify the inherent structural ambiguity in underlying causal relationships. Our method shows that “optimized” DAGs can contain causal artifacts are not consistent across equivalently accurate topologies.
[LG-4] he Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning
链接: https://arxiv.org/abs/2606.06397
作者: Shuo Wang,Xiangyu Wang,Quanxin Wang,Bailin Wu,Bokui Wang,Shunyang Huang,Boyan Deng,Haonan Liu,Ruiyi Fang,Zhenxiang Xu,Boyu Wang,Zhao Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: this https URL.
[LG-5] Proper Scoring Rules for Right-Censored Survival Data
链接: https://arxiv.org/abs/2606.06393
作者: Jef Jonkers,Glenn Van Wallendael,Luc Duchateau,Sofie Van Hoecke
类目: Machine Learning (cs.LG)
*备注: 27 pages
Abstract:Proper scoring rules provide a rigorous theoretical basis for the training and evaluation of probabilistic forecasts. However, in the presence of right censoring, the event time is only partially observed, rendering conventional scoring rules inapplicable in their standard form. We propose a framework for proper scoring of right-censored survival outcomes based on a simple idea: first, map the predictive distribution through the censoring mechanism, then apply the underlying proper score on the induced observed-data law. This yields localized scores for fixed censoring times and marginalized scores when the censoring time is random or only partially observed. The resulting construction recovers familiar right-censored likelihood and IPCW-type criteria within a coherent framework, while also yielding right-censored versions of the CRPS, pinball loss, Brier score, and energy score. We show that the marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region. The same principle also leads to censored engression, a sample-based learning objective for multivariate right-censored survival modeling. In experiments, our scores correctly rank the oracle forecast across several censoring regimes, whereas forecast-dependent plug-in weighted scores can exhibit ranking reversals. Censored engression likewise substantially improves over naive training on censored outcomes.
[LG-6] Learned Response-Field Inertia Operator for HEC-RAS 2D Water-Surface Elevation Prediction
链接: https://arxiv.org/abs/2606.06385
作者: Edward Holmberg,Elias Ioup,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Julian Simeonov
类目: Machine Learning (cs.LG)
*备注: Preprint manuscript prepared using IEEEtran journal format
Abstract:This article presents a cross-dataset evaluation of learned native-cell surrogate models for solver-consistent water-surface elevation (WSE) prediction in HEC-RAS 2D. To avoid raster remapping error and information-access confounding, surrogates are evaluated directly on the original nonuniform computational cells under an explicit policy that separates static project inputs, current hydraulic state, project-input forcing, calibration-derived quantities, and future solver-output targets. We introduce the Learned Response-Field Inertia Operator (LRFIO), a no-forcing, increment-based learned surrogate that calibrates an inertial response operator from solved HEC-RAS trajectories and deploys the retained operator through closed-form native-cell rollout. LRFIO evaluates a base-case-first response hierarchy consisting of persistence, global calibrated inertia, and segmented response-field inertia. Segmentation, residual correction, and neuralized inertia are treated as learnable modeling choices, with added complexity retained only when validation evidence justifies its cost. Evaluated across four diverse HEC-RAS 2D benchmarks, LRFIO retains different response structures for different domains, demonstrating adaptive learned complexity. The selector audit shows controlled complexity with a maximum validation regret of 4.30%. During deployment, retained rollout times range from 0.003 s to 0.242 s, and the Beaver Bayou measured-solve comparison gives an estimated 2.75 x 10^4 horizon-normalized speedup over HEC-RAS. These results indicate that the current native-cell increment is a strong solver-conditioned predictive scaffold and that added response-field, neural, or spatial complexity should be retained only when empirically justified.
[LG-7] End-to-End Subgraph Detection with GraphDETR
链接: https://arxiv.org/abs/2606.06364
作者: Dexiong Chen,Till Hendrik Schulz,Karsten Borgwardt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Subgraph detection seeks to identify whether and where instances of query patterns occur within a larger graph. This problem is fundamental across scientific domains and is closely related to subgraph isomorphism, which is NP-complete, limiting combinatorial approaches to small patterns or moderately sized graphs. We introduce GraphDETR, a deep learning framework that formulates subgraph detection as a set prediction problem, analogous to DETR in object detection. GraphDETR encodes the target graph with a graph neural network, and employs a fixed set of learnable query vectors, decoded via a transformer decoder, to predict all pattern occurrences jointly in a single forward pass. This is enabled by training the model end-to-end with bipartite matching. Unlike traditional combinatorial methods that only solve exact structural matching, GraphDETR naturally extends to approximate matching, enabling detection beyond exact pattern correspondence. Empirically, we show that GraphDETR can detect diverse patterns, such as molecular structures, cycles, cliques, and fuzzy patterns of up to 50 nodes, in target graphs with up to 1000 nodes. We further evaluate on molecular functional group detection over the ChEMBL dataset, where GraphDETR predicts the complete set of functional groups per molecule, achieving a strong performance of \textAP_100 = 91.2 .
[LG-8] Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning
链接: https://arxiv.org/abs/2606.06353
作者: Sean Groom,Michael Groom,Francisco Belo,Axl Rice,Liam Anderson,Victor-Alexandru Darvariu,Shuo Wang
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:Machine learning is increasingly employed for the evaluation of football tactics. However, existing approaches focus on characterising historical actions or analyst-specified counterfactual scenarios. In this work, we seek to go beyond the imitation of historically observed patterns towards discovering new generalisable player configurations and strategies. To tackle this, we focus on optimising corner kick routines, and formulate a decision-making problem in which a central policy makes adjustments to attacking player positions and velocities to maximise first contact shot probability. Unlike classic optimisation that solves for isolated setups, we contribute a reinforcement learning architecture operating on graph-structured data that yields a general policy for adjusting arbitrary starting player positions. Evaluated on over 3,000 Premier League corners, our approach strongly outperforms baseline optimisation techniques under matched inference budgets. Our results suggest that graph reinforcement learning can shift set-piece analysis from historical evaluation and imitation towards reward-driven tactical discovery.
[LG-9] Performance Evaluation of GraphCast for Medium-Range Weather Forecasting over Brazil
链接: https://arxiv.org/abs/2606.06348
作者: Wolfgang R. Rowell Jr.,Lucas S. Kupssinskü
类目: Machine Learning (cs.LG)
*备注:
Abstract:The paradigm of global weather forecasting is rapidly shifting with the emergence of Machine Learning Weather Prediction models (MLWP). While these data-driven architectures demonstrate remarkable global skill, regional benchmarks in the Global South remain scarce, leaving their efficacy in complex, highly convective environments largely unverified. This study evaluates the performance of GraphCast operational against the deterministic ECMWF IFS HRES as baseline across four distinct Brazilian climatic sub-regions. Utilizing a scalable, cloud-native pipeline and the WeatherBench-X framework for benchmarking weather models, we assess selected tropospheric variables ( T_850 , Q_850 , Z_500 ) over four selected seasonal windows, employing the operational IFS analysis as the ground truth to calculate the statistical metrics for both models. Results reveal a regime-dependent skill profile. During the austral winter, GraphCast underperforms in the medium range (lead days 2-7) for Z_500 when resolving fast-propagating baroclinic systems over southern Brazil, but regains an advantage in the extended range, where its inherent smoothing of chaotic small-scale variability becomes beneficial under deterministic skill metrics. Conversely, during the austral summer wet season, GraphCast accurately captures large-scale moisture transport while intrinsically dampening the high-frequency convective variability that degrades deterministic NWP temperature forecasts. These findings establish a baseline for Brazil and define the specific physical boundaries that will guide future ``tropicalization’’ efforts, aiming to optimize these foundational AI models for regional resilience.
[LG-10] Attack Detection using Time Series Foundation Models
链接: https://arxiv.org/abs/2606.06347
作者: Sribalaji C. Anand,Anh Tung Nguyen,George J. Pappas
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Under review
Abstract:This paper addresses the problem of attack detection in cyber-physical systems without any knowledge of the plant model or its structure. A remotely located plant transmits sensor measurements to an operator over a network that is assumed to be under attack. We consider two classes of attacks: model-free replay attacks and model-based stealthy attacks. For the latter, we derive closed-form expressions for the optimal stealthy attack policy against a \chi^2 detector, for both linear and nonlinear systems. We then propose a model-structure-free detector based on TimesFM, a time-series foundation model developed by Google Research, which serves as a surrogate residual generator operating in a zero-shot fashion. We show empirically that the TimesFM-based detector achieves a comparable or superior attack detection performance. The efficacy of the proposed approach is demonstrated numerically on the IEEE 14-bus power system. We also demonstrate that TimesFM predictions can serve as a substitute for corrupted measurements, a practical mitigation technique when classical redundancy assumptions fail.
[LG-11] Equivariant Neural Belief Propagation
链接: https://arxiv.org/abs/2606.06344
作者: Zehua Cheng,Wei Dai,Jiahao Sun
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 18 pages
Abstract:Probabilistic inference over spatially embedded variables requires beliefs that respect SE(3) symmetry, yet existing equivariant networks produce only scalars and vectors – not the rank-2 precision tensors needed for anisotropic uncertainty, and single-component messages collapse multi-modal energy landscapes to physically meaningless averages. We introduce Equivariant Neural Belief Propagation (ENBP), a factor-graph framework whose messages are equivariant Gaussian mixture models with sufficient statistics that transform exactly under SE(3) . Rank-2 precision matrices are synthesised via equivariant outer products, ingested through differentiable spectral decomposition, and kept tractable by a greedy KL-based mixture reduction that provably commutes with SE(3) . On GEOM-QM9 and GEOM-Drugs, ENBP achieves 98.9% conformational coverage at 0.090 \mathringA error with sub-second latency – over 100\times faster than diffusion baselines at higher accuracy. On multi-body robotic inference, vanilla loopy BP diverges at 15+ agents while ENBP converges with near-zero collision rates and machine-precision equivariance error ( \sim10^-7 vs.\ 10^-1 for augmented baselines).
[LG-12] Quantifying the Privacy of Counterfactuals by Leverag ing Membership Inference Attacks Against Synthetic Data
链接: https://arxiv.org/abs/2606.06334
作者: Maryam Babaei,Yingke Wang,Hadrien Lautraite,Heber H. Arcolezi,Ulrich Aivodji,Sebastien Gambs
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactuals are typically used in high-stakes decision areas to explain a machine learning model by showing how changes to the user profiles result in the desired outcome. However, explaining the model’s decisions through counterfactuals can also be exploited by an adversary to conduct privacy attacks against the model or its training data. Drawing on the analogy that counterfactuals provide realistic substitutes for real training data, similar to synthetic data, we demonstrate in this paper how it is possible to successfully perform privacy attacks on counterfactuals by drawing on the attacks developed against synthetic data. More precisely, we investigate the effectiveness of the membership inference attacks designed for synthetic data on various types of counterfactuals. Additionally, while existing membership inference attacks against counterfactuals usually require to be able to query the model, we show how it is possible to perform successful membership inference attacks using only a set of counterfactuals, with no access to the model from which they are generated. Our results demonstrate that model developers should be more cautious when releasing counterfactuals to various users, as it can lead to a privacy breach.
[LG-13] DAS-PINNs for high-dimensional partial differential equations: extending deep adaptive sampling to spacetime domains
链接: https://arxiv.org/abs/2606.06314
作者: Anshima Singh,David J. Silvester
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Time-dependent high-dimensional partial differential equations (PDEs) with spatially localised and dynamically evolving solutions pose a fundamental challenge for physics-informed neural networks (PINNs), as uniform collocation sampling becomes increasingly ineffective in high-dimensional spatiotemporal domains. In this work, a deep adaptive sampling framework for PINNs is extended to the time-dependent setting by treating space and time as a unified domain without any explicit time marching. A normalising flow neural network model effectively learns the distribution induced by the PDE residual and generates new collocation points concentrated in regions where the solution is most difficult to learn. Unlike conventional adaptive strategies that require explicit time stepping or moving meshes, high-residual regions are automatically identified and tracked across both space and time, driven purely by the PDE residual distribution. The effectiveness of the proposed strategy is assessed on a range of benchmark problems, from sharp and moving features in two spatial dimensions to localised structures in up to eight spatial dimensions.
[LG-14] angram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
链接: https://arxiv.org/abs/2606.06302
作者: Hyungmin Kim,Minsoo Kim,Hongseok Kim,Jungwook Choi
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 12 pages. 14 figures
Abstract:Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges - including memory fragmentation, scheduling complexities, and diminished kernel utilization - which collectively lead to significant inefficiencies in existing LLM serving systems. To overcome these challenges, we present Tangram, a novel serving system designed to make Non-uniform KV caches practical. Tangram addresses systemic inefficiencies through three core techniques: (1) Deterministic Budget Allocation assigns a static memory footprint to each head based on its intrinsic pattern, entirely eliminating dynamic scheduling overhead and prefill stalls; (2) Head Group Page clusters attention heads with similar retention demands and manages them with independent, vectorized page tables, thereby maximizing physical memory reclamation; and (3) Ahead-of-Time (AOT) Load Balancing leverages static budget profiles to ensure uniform GPU utilization without runtime overhead. Experimental results show that Tangram improves throughput by up to 2.6x compared to existing baselines, while fully preserving model accuracy. Our implementation is publicly available at this https URL. Comments: 12 pages. 14 figures Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2606.06302 [cs.LG] (or arXiv:2606.06302v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events NEURIPS2026
链接: https://arxiv.org/abs/2606.06295
作者: Rishal Aggarwal,David Ryan Koes,Nicholas M. Boffi,Eric Vanden-Eijnden
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注: 21 pages, 7 figures, submitted to NeurIPS 2026
Abstract:Path sampling methods generate ensembles of reactive trajectories connecting metastable states, but extracting mechanistic insight from these data remains nontrivial. We introduce Flux Matching, a framework that learns two complementary objects directly from reactive trajectory data: a current velocity u(z) , whose streamlines trace the dominant reaction pathways, and a scalar potential h(z) , obtained from a weighted Helmholtz-Hodge decomposition of the reactive current, that serves as a data-driven reaction coordinate. Both minimize quadratic functionals over the reactive path ensemble, analogous to the flow matching loss in generative modeling, and require no knowledge of the underlying dynamics or stationary distribution. Unlike committor-based methods, u and h remain well-defined under projection onto non-Markovian collective variables, and their level sets in turn provide adaptive interfaces for improved sampling with enhanced sampling methods. Flux Matching is validated through the generation of current velocity trajectories and rate constant calculations on molecular systems.
[LG-16] PAC-Bayesian Adversarially Robust Generalization for Message Passing Graph Neural Networks: A Sensitivity Analysis
链接: https://arxiv.org/abs/2606.06293
作者: Ziling Liang,Xinping Yi,Qingsong Wen,Shi Jin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Whilst the vulnerability of graph neural networks (GNNs) to adversarial attacks poses a critical threat to graph representation learning, the understanding of the robust generalization behavior remains a fundamental challenge in the adversarial setting. Recently, PAC-Bayesian margin-based generalization analysis substantially advances this line of research by providing a flexible and data-dependent analytical framework. However, existing robust analyses often rely on isotropic Gaussian posteriors and control weight perturbations in the full parameter space, which limits the ability to capture heterogeneous parameter sensitivity yet hinges on hidden-width-dependent complexity terms, resulting in not-tight-enough generalization bounds. In this paper, we extend a recently proposed sensitivity-aware PAC-Bayesian framework from deep neural networks to message passing GNNs (MPGNNs) and derive a tighter robust generalization bound in the adversarial setting. Specifically, we first quantify how sensitive the perturbations across different parameter blocks are to the network outputs by deriving the output Jacobians with respect to the weight parameters. Exploiting the fact that these Jacobian matrices have rank at most K in K -class graph classification, we then construct Jacobian-aligned sensitivity matrices and use anisotropic Gaussian posteriors with optimized covariances to upper bound the KL divergence in a tight way. Notably, by refining the spectral-norm dependence on the learned weights and reducing the leading dimension factor from hidden-width-dependent terms to the number of classes K , our analysis yields much tighter robust generalization guarantees for MPGNNs, thereby guiding their designs to enhance adversarial robustness.
[LG-17] Generative Criticality in Large Language Model Temperature Scaling
链接: https://arxiv.org/abs/2606.06238
作者: Huajian Ruan,Jinyang Li,Xingyu Guo,Lingxiao Wang
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); High Energy Physics - Lattice (hep-lat)
*备注: 9 pages, 7 figures, contributed to PAI 2026 Conference
Abstract:We propose a statistical-field framework for text generated by large language models (LLMs), treating token embeddings as continuous spin variables on a one-dimensional chain. Defining a susceptibility from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field, we vary the \textttsoftmax temperature T and observe a sharp susceptibility peak near a characteristic T_c with power-law-like scaling, a concurrent rapid change in the order parameter, and a collapse onto a single semantic direction below T_c . The intrinsic dimension estimated by the two nearest neighbor (TwoNN) method independently corroborates these findings, reaching a minimum near T_c . Results are robust across model scales (Qwen3: 0.6B–32B) and prompt categories. While the phenomenology closely resembles a continuous phase transition, the non-equilibrium nature of autoregressive generation warrants further investigation. Our framework provides quantitative tools for probing the collective statistical structure of LLM outputs and suggests connections between decoding strategies and critical phenomena.
[LG-18] racing the Oracle: Improving Diffusion Timestep Scheduling for 3D CT Reconstruction ECML-PKDD2026
链接: https://arxiv.org/abs/2606.06236
作者: Yujia Wu,Zhaoqiang Liu
类目: Machine Learning (cs.LG)
*备注: Accessed to ECML-PKDD2026
Abstract:Pretrained diffusion models demonstrate impressive potential in solving highly ill-posed 3D computed tomography (CT) inverse problems, while the inference process suffers from significant computational overhead. Furthermore, existing uniform timestep schedules fail to capture the non-uniform evolution of the reverse conditional diffusion stochastic differential equation, thereby introducing substantial truncation errors. To overcome this limitation, we propose Tracing the Oracle (TrO), a plug-and-play framework for improved timestep scheduling. Specifically, we treat densely sampled numerical integration trajectories on a few samples as the reference oracle. The optimized schedule is extracted by leveraging dynamic programming to globally minimize the cumulative error between the few-step approximation and the oracle. This mechanism precisely allocates the limited sampling steps to critical evolution stages that are highly susceptible to truncation errors. Our extensive experiments on the AAPM dataset across multiple 3D CT reconstruction tasks demonstrate that, when combined with the state-of-the-art 3D CT reconstruction method DDS, our optimized timesteps significantly improve reconstruction fidelity and computational efficiency compared to existing heuristic schedules, especially under a strict budget of no more than 10 sampling steps.
[LG-19] Non-Negative Matrix Factorization for Event Data
链接: https://arxiv.org/abs/2606.06205
作者: Raphaël Romero
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continuous-time event data, in which entities emit instantaneous events over time, arises naturally across many domains such as neuroscience, seismology, and social networks. Non-negative matrix factorization (NMF) is a natural tool to uncover interpretable structure in such data, but it has so far only been applied after binning or smoothing the entity-level counting measures. This preprocessing step comes with the risk of erasing entity-level heterogeneities and fine-grained temporal features. In this paper, we introduce EventNMF, a continuous-time non-negative factorization model that operates directly on event times: each entity’s events are modeled as a Poisson process whose intensity factorizes through a non-negative B-spline basis, and a simple estimation procedure recovers interpretable temporal templates shared across entities. The resulting method is mathematically principled, easy to implement, and computationally efficient. We further show that standard binned-count approaches arise as the special case of degree-zero splines, explore bias-variance tradeoffs and compare against existing methods on a synthetic latent factor model, and demonstrate the effectiveness of EventNMF on several real-world applications.
[LG-20] A Machine Learning-Based Framework for Discovering Huntingtons Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD Dataset ALT
链接: https://arxiv.org/abs/2606.06196
作者: Lubna M. Abu Zohair,Marta Vallejo,MD Azher Uddin,John R. Woodward,Hind Zantout
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in the Proceedings of the 10th International Conference on Medical and Health Informatics (ICMHI 2026), Association for Computing Machinery (ACM)
Abstract:Huntington’s disease (HD) is a progressive brain disorder that gradually affects movement, cognitive function, and behavior. Identifying the stage of the disease accurately and consistently is important for understanding its course, grouping patients, personalized care, and discovering treatment. Existing clinical staging frameworks rely primarily on predefined clinical measurement thresholds and clinical expert decisions, yet these discrete cut-offs may obscure meaningful intra-stage variability and remain vulnerable to inter-rater differences, especially in motor and functional assessments. To address these limitations, we developed an unsupervised machine learning framework based on dynamic graph representation learning to capture temporal relationships within and across patients from longitudinal clinical measurements. Using the learned representations, we applied K-means++ clustering to identify well-separated groups. We then iteratively increased the number of clusters (k), using stability analysis to assess robustness and reveal additional meaningful clusters beyond the initial optimal solution. We applied the framework to 302 individuals from the Enroll-HD cohort (1,477 visits, 44 clinical variables per visit; 80% manifest participants), enabling data-driven discovery of HD stages reflecting natural clinical progression. Despite the limited cohort size, the proposed framework achieved robust clustering performance using a four-dimensional latent space, identifying four meaningful and statistically distinct disease stages through clustering stability analysis. Each stage corresponded to well-defined clinical measurement boundaries, with minimal overlap compared to previously established clinical staging methods.
[LG-21] Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal Virginia
链接: https://arxiv.org/abs/2606.06174
作者: Jonathan Colen,Eric Werner,Maryam Golbazi,Heather Richter,Diana McSpadden,Amy Quinn,Jocel Santos,Mary Jane Darling,Mary Margaret Gleason
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 22 pages, 6 figures (5 supplemental)
Abstract:Childhood asthma is a common illness exacerbated by air pollution as well as meteorological and neighborhood-level socioeconomic factors. Modeling asthma exacerbation (AE) in large spatiotemporal datasets requires disentangling impacts from multiple contributors. In this case study, we compared three techniques that balance predictive power with interpretability to predict AE in Hampton Roads, a coastal Virginia region comprising 7 cities and over 1.5 million people. After collating ambient air pollution measurements, weather data, and measures of neighborhood opportunity, we modeled zip code-level acute AE visits to a regional children’s hospital and affiliated providers from 2018-2023. Generalized linear models (GLM) provided a baseline while neural networks (NN) served as a maximally predictive target. To bridge between statistical models and deep learning, we developed a framework based on sparse dictionary learning to identify and interpret parsimonious nonlinear interacting equations. After comparing each model’s predictive performance, we estimated relative risks for AE due to input exposure variables and found consensus across frameworks. Our work links statistical and interpretable machine learning models to highlight possible synergistic interactions influencing AE, and may enable future studies to guide public health interventions in coastal Virginia.
[LG-22] On the training of physics-informed neural operators for solving parametric partial differential equations
链接: https://arxiv.org/abs/2606.06164
作者: Nanxi Chen,Chuanjie Cui,Airong Chen,Sifan Wang,Rujin Ma
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By incorporating physical constraints into the training objective, PINOs combine the cross-instance generalization of neural operators with the data efficiency of physics-informed learning. Despite this promise, how to train PINOs efficiently and robustly remains less well-understood than the training of either data-driven neural operators or physics-informed neural networks (PINNs). To bridge this gap, we examine key components of the PINO training pipeline, including architecture design, optimizer choice, loss balancing, and collocation-point sampling strategy. We study three representative operator backbones, Deep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT), across five diverse parametric PDE systems. Our results show that CViT provides consistently strong and stable performance across the considered benchmarks. Beyond architecture, we find that several optimization pathologies previously identified in PINN training naturally arise in PINOs, including gradient conflicts and causal violation. We also find that mitigation algorithms developed for PINNs remain effective in the PINO setting. We further compare physics-informed and data-driven training under different data regimes, revealing that a carefully designed physics-informed training pipeline can match, and in some cases, outperform purely data-driven neural operators. Taken together, these findings provide a systematic empirical understanding of the optimization challenges in PINO training and inform a practical pipeline for efficient and robust physics-informed operator learning. Code and data are available at this https URL.
[LG-23] rust-Aware Predictive Emissions Monitoring for Gas Turbine Fleets with Limited Labelled Data
链接: https://arxiv.org/abs/2606.06156
作者: Rebecca Potts,Aiden Durrant,Rick Hackney,Georgios Leontidis
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 6 tables
Abstract:Machine learning-based predictive emissions monitoring systems offer a practical alternative to direct emissions measurement, but their deployment across gas turbine fleets is challenging when emissions labels are available for only a small subset of assets. In this work, a trust-aware probabilistic framework is proposed for fleet-level gas turbine NOx prediction under limited labelled supervision. The framework combines a multi-head recurrent prediction model with learned confidence estimation, ensemble-based uncertainty quantification, auxiliary feature prediction, feature-space distance analysis, and operating-range diagnostics. These signals are calibrated on labelled data to produce interpretable per-sample trust scores, providing indicators of prediction reliability on unlabelled turbines, supporting the identification of predictions that should be treated with greater caution during fleet-level deployment. Confidence-based filtering reduces MAE from 0.202 at full coverage to 0.070 for the highest-confidence 10% of predictions, demonstrating that confidence estimates are meaningfully related to prediction error. Unlabelled and out-of-distribution samples exhibit increased uncertainty and reduced confidence, indicating that the framework responds appropriately to distributional shift. The results show that the proposed trust framework provides actionable reliability information for emissions prediction on unlabelled turbines, supporting more transparent and trustworthy deployment of PEMS across industrial fleets.
[LG-24] ght list replicability bounds via a novel sphere covering theorem
链接: https://arxiv.org/abs/2606.06148
作者: Ari Blondal,Hamed Hatami,Pooya Hatami,Chavdar Lalov,Sivan Tretiak
类目: Machine Learning (cs.LG)
*备注: 17 pages, 2 figures
Abstract:In recent years, list replicability has emerged as a framework for formalizing reproducibility in learning theory. A central question is how the required list size relates to the accuracy parameter and natural complexity measures of the hypothesis class. To achieve sharp bounds on list replicability, we prove a novel topological sphere covering theorem, derived from the Borsuk-Ulam theorem. Specifically, if the d -sphere is covered by open sets, each of which lies in an open hemisphere, then d+1 of these sets must have a common intersection. Using this result, we obtain a sharp bound on the relationship between list size and accuracy for VC classes. We also show that for large-margin half-spaces, provided the margin is not too large, the optimal list size equals the ambient dimension. However, when the margin is taken to be very large, we devise a replicable algorithm achieving the minimal list size of \lceil d/2 \rceil + 1 . Comments: 17 pages, 2 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06148 [cs.LG] (or arXiv:2606.06148v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06148 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] Adaptive state-action abstractions via rate-distortion
链接: https://arxiv.org/abs/2606.06123
作者: Fernando E. Rosas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 2 figures
Abstract:When learning to walk, infants seem to address a coarse version of the problem first - stay upright, reach the caregiver - and refine it only when further practice at that resolution stops paying off. Reinforcement learning offers multiple techniques for building simple versions of complex tasks, but lacks general principles for how to dynamically adjust the granularity of these abstractions during learning. This paper proposes one such principle: refine the abstraction as soon as the learning error within it becomes comparable to the error induced by the abstraction itself. Here, we investigate one way of formalising this principle via a performance certificate that decomposes value error into two terms: a learning error bound captured by a Bellman residual, and an abstraction error bound given by a bisimulation metric. The resulting switching strategy is implemented by soft state-action abstractions built from rate-distortion principles, whose resolution along state and action axes can be continuously adjusted. We validate this construction in a range of tabular settings, showing that near-optimal performance can be achieved under substantial lossy compression of state and action information.
[LG-26] A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding KDD2026
链接: https://arxiv.org/abs/2606.06104
作者: Chen Hu,Rui Wang,Jiale Zhou,Jingjun Yi,Shaocheng Jin,Yidong Song,Yefeng Zheng
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026
Abstract:Electroencephalography (EEG) offers noninvasive, millisecond resolution recordings of neuronal activity and is widely used in neuroscience and healthcare. Many EEG decoding pipelines rely on covariance descriptors for their robustness to noise, but such representations are sensitive to channel-wise scaling. Recent studies have therefore advocated full-rank correlation matrices as a scale-invariant alternative for EEG decoding. In this paper, we propose a general framework for Sliced Wasserstein (SW) discrepancies on manifolds endowed with Pullback Euclidean Metrics (PEMs), termed Pullback Euclidean Metric Sliced Wasserstein (PEMSW). Within this framework, we instantiate two Correlation Sliced-Wasserstein (CorSW) discrepancies on the manifold of full-rank correlation matrices under two recently introduced correlation geometries, \textiti.e., the Off-Log Metric (OLM) and Log-Scaled Metric (LSM). Building on CorSW, we further develop a domain generalization (DG) framework for EEG decoding. Experiments on three EEG datasets demonstrate improved generalization under distribution shifts, with low training overhead and no additional inference cost. The source code is available at this https URL.
[LG-27] 3D Underwater Path Planning via Generative Flow Field Surrogates
链接: https://arxiv.org/abs/2606.06077
作者: Zachary Cooper-Baldock,Paulo E. Santos,Russell S.A. Brinkworth,Karl Sammut
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 41 pages, 5 figures, 11 tables
Abstract:Autonomous underwater vehicle (AUV) launch and recovery (LAR) into the hull of an advancing host platform requires traversal of a complex, three-dimensional propeller wake whose hydrodynamic structure cannot be characterised by a uniform current model. High-fidelity Reynolds-Averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) simulations resolve this structure with sufficient accuracy for path planning, but their computational cost renders them impractical for onboard use. We address this gap by integrating two conditional generative adversarial network (cGAN) architectures – a regularised PatchGAN and a 2D3DGAN with self-attention – as drop-in replacements for RANS CFD data within a three-dimensional, energy-weighted A* path planning framework. Both generators are driven by a hierarchical pipeline that synthesises full 128^3 voxel flow field volumes from scalar operating condition inputs alone, with end-to-end inference times of approximately 28-146 \mu s, compared to hours for a single RANS computation. We benchmark all four environmental knowledge levels: uniform current, ground-truth CFD, PatchGAN, and 2D3DGAN~SA across 19,800 independently generated trajectories spanning 550 distinct flow conditions. Full CFD wake knowledge reduces energy expenditure by 5.7-12.5% and high-velocity wake-core encounters by up to 77.8% relative to uniform-current planning, with both benefits scaling with operating severity. The cGAN surrogates recover approximately 45-60% of the CFD energy benefit and high-velocity cell avoidance benefit while operating at inference speeds compatible with edge device use. These results provide the first systematic quantification of the downstream path planning value of cGAN-predicted hydrodynamic fields in a three-dimensional maritime robotics application.
[LG-28] Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification
链接: https://arxiv.org/abs/2606.06053
作者: Haoyang Hong,Zichen Wang,Quanquan Gu,Huazheng Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by RLC 2026
Abstract:We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fail. This work introduces KL misspecification formulations for contextual bandits and episodic RL and analyzes regression-based algorithms with Gibbs policy updates. High-probability KL-regret guarantees with explicit misspecification terms are established, recovering the standard realizable KL-regularized setting as a special case.
[LG-29] Learning solution operators of PDEs with sparse approximation methods
链接: https://arxiv.org/abs/2606.06046
作者: Sebastian Neumayer,Daniel Potts,Fabian Taubert
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We investigate the approximation of solution operators for partial differential equations (PDEs) using sparse high-dimensional techniques. Building on a dimension-incremental framework, we combine product basis expansions with sparse recovery methods, specifically orthogonal matching pursuit (OMP), to substantially reduce the required sample size compared with a previously considered cubature-based approach. We evaluate the resulting method numerically on several examples, comparing it against both cubature-based sparse approximation and Fourier neural operators in terms of accuracy, runtime, and sample size. The experiments show that our approach considerably reduces the number of required PDE solves relative to its predecessor while maintaining competitive accuracy, particularly when the solution admits a sparse representation in the chosen basis. Furthermore, the recovered sparse index sets yield interpretable insights into the relevant variables and parameter interactions.
[LG-30] Catastrophic Forgetting as Accessibility Collapse: A Three-Level Framework for Knowledge Persistence in Continual Learning
链接: https://arxiv.org/abs/2606.06032
作者: Ayushman Trivedi,Bhavika Melwani
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 8 tables. Sequential continual-learning experiments on CIFAR-100 using ResNet-18
Abstract:Catastrophic forgetting is commonly interpreted as the irreversible erasure of previously acquired knowledge during sequential learning. In this work, we investigate an alternative perspective: that forgetting may arise not from complete destruction of task representations but from a loss of accessibility to preserved information. We introduce a three-level framework separating knowledge storage, representation, and accessibility, and evaluate each component through a series of continual-learning experiments on sequential CIFAR-100 classification using ResNet-18. Our analysis combines checkpoint persistence, linear probing, representation geometry, classifier-reset recovery, and layer-wise recoverability experiments. We observe complete behavioral forgetting of earlier tasks, with task accuracy collapsing from 54.8% to 0%, while linear probe performance retains approximately 76% of the original representational information. Furthermore, retraining only the final classifier restores 75.7% of the original task performance without modifying the backbone network. Layer-wise analysis reveals that early and intermediate layers preserve highly recoverable task information despite severe degradation at later stages. Projection-energy and principal-angle analyses indicate that retained knowledge persists as distributed high-dimensional representations rather than through preservation of a small dominant subspace. These findings suggest that catastrophic forgetting is better characterized as an accessibility failure than complete representational erasure, and that substantial task-relevant information remains embedded within neural representations even after functional forgetting has occurred.
[LG-31] Adaptive Oscillatory-State Alignment for Time Series Forecasting
链接: https://arxiv.org/abs/2606.06010
作者: Zhangyao Song,Ziqiong Li,Xiangfei Qiu,Chao Zha,Yinfei Xu,Tao Guo
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: oscillatory behavior often evolves through amplitude modulation, phase drift, and local frequency variation. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNET, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNET extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight benchmarks demonstrate state-of-the-art or highly competitive accuracy with fast inference speed. Controlled synthetic studies isolating amplitude modulation, phase drift, and local frequency variation confirm that the advantage of oscillatory-state alignment consistently increases as non-stationarity intensifies.
[LG-32] Diffusion Models for Adaptive Sequential Data Generation
链接: https://arxiv.org/abs/2606.06007
作者: Haoyang Cao,Minshuo Chen,Yinbin Han,Renyuan Xu
类目: Machine Learning (cs.LG)
*备注: 37 pages
Abstract:Generating realistic synthetic sequential data is critical in real-world applications across operations research, finance, healthcare, energy systems, and scientific computing, where time-indexed observations are used for prediction, simulation, risk assessment, and data-driven decision-making. While diffusion models have achieved remarkable success in generating static data, their direct extensions to sequential settings often fail to capture temporal dependence and information structure. Designing diffusion models that can simulate sequential data in an adapted manner, and hence without anticipation of future information, therefore remains an open challenge. In this work, we propose a sequential forward-backward diffusion framework for adapted time series generation. Our approach progressively injects and removes noise along the sequence, conditioning on the previously generated history to ensure adaptiveness. A novel score-matching objective is introduced for efficient parallel training. We derive rigorous statistical guarantees under a generic framework, then establish score approximation, score estimation, and distribution estimation results with ReLU networks serving as a concrete instance. Empirically, we validate our method on synthetic data, including ARMA models and Gaussian processes, and demonstrate its effectiveness in constructing mean-variance optimal portfolios. Comments: 37 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.06007 [cs.LG] (or arXiv:2606.06007v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.06007 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care
链接: https://arxiv.org/abs/2606.05994
作者: Thummaluru Siddartha Reddy,Vempalli Naga Sai Saketh,Yash Punjabi,Mahesh Chandran
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Paper under review
Abstract:Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.
[LG-34] LLM Explainability with Counterfactual Chains and Causal Graphs
链接: https://arxiv.org/abs/2606.05972
作者: Nirit Nussbaum-Hoffer,Nitay Calderon,Liat Ein-Dor,Roi Reichart
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with \sigma -CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs’ reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.
[LG-35] Steering Vectors are an Adversarial Attack Surface
链接: https://arxiv.org/abs/2606.05958
作者: Abzal Aidakhmetov,Donato Crisostomi,Tommaso Mencattini,Adrian Robert Minut,Iacopo Masi,Emanuele Rodolà
类目: Machine Learning (cs.LG)
*备注:
Abstract:Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emphstealth data poisoning attack silently compromises this pipeline. By substituting 4-6% of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of 20-55% , +19% to +51% over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover \approx82% of the ASR gap without harming benign behavior.
[LG-36] Dead Directions: Geometric Singular Learning
链接: https://arxiv.org/abs/2606.05957
作者: Tejas Pradeep Shirodkar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 139 pages, 13 figures, 13 tables
Abstract:Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe’s single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity m , the singular fluctuation \nu (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient \Theta/G under gradient flow on a G -invariant metric; SGD qualifies, standard Adam does not, and we construct a G -equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe’s triple (\lambda, m, \nu) from one checkpoint’s forward and backward passes, without posterior sampling.
[LG-37] Short paper: Models in the dark – Rectification and erasure under GDPR in ML supply chains
链接: https://arxiv.org/abs/2606.05946
作者: Henrik Graßhoff,Malte Hansen,Meiko Jensen,Sara Ramezanian
类目: Machine Learning (cs.LG)
*备注: accepted for presentation at Annual Privacy Forum 2026
Abstract:The rights to rectification and erasure, as established under the General Data Protection Regulation (GDPR), are central to protecting individuals’ privacy. However, their effective enforcement in machine learning (ML) systems remains challenging. Existing work has largely addressed these rights from either a legal or a technical perspective in isolation and disregards the fact that models are produced in complex supply chains involving multiple actors across development, distribution, and deployment. This paper presents a holistic survey of challenges in implementing the rights to rectification and erasure in ML models. Drawing on academic literature and guidance from data protection authorities, we find that many GDPR requirements cannot yet be technically met in practice. Our findings further suggest that issues arising in ML supply chains are insufficiently addressed in research. To tackle this gap, we introduce the notion of models in the dark – derived models created further downstream in an ML chain without sufficient transparency or traceability – and analyse the urgent challenges posed by this phenomenon. By adopting an interdisciplinary perspective, this work contributes to bridging the gap between legal requirements and the technical implementation of data subject rights in ML, ultimately supporting the development of trustworthy artificial intelligence.
[LG-38] Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based Oversampling
链接: https://arxiv.org/abs/2606.05927
作者: Bin Liu,Jun Wu,Haoyu Peng,Ao Zhou,Jin Wang,QiaoSong Chen,Grigorios Tsoumakas
类目: Machine Learning (cs.LG)
*备注:
Abstract:The complex imbalanced label distribution poses a crucial challenge to multi-label classification, as most classifiers are biased towards the majority class and high-frequent labels. Oversampling is an efficient and flexible solution that augments instances to provide a more balanced training dataset for multi-label classifiers. Most existing oversampling methods create synthetic instances in a heuristic way that essentially relies on neighborhood information retrieved using Euclidean distance within the entire feature space. However, they fail to consider the varying semantic relevance of features to different labels, leading to label inconsistency among proximate neighbors and further introducing label confusion and overfitting to synthetic instances. To overcome the above issue, we propose a novel sampling approach called Label-Specific Distance-based Multi-Label Oversampling (LSDMLO) that creates more useful and well-labeled synthetic instances to address the imbalance in multi-label datasets. LSDMLO derives the label-specific distance to identify label-consistent neighbors based on the weighted pertinent feature space, which facilitates selecting seed instances that express more label correlations in boundary areas and generating synthetic instances aligned with the label distribution of original data. The comprehensive experiments verify that the proposed LSDMLO outperforms the state-of-the-art multi-label sampling approaches under various base classifiers.
[LG-39] DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement
链接: https://arxiv.org/abs/2606.05911
作者: Cunhang Fan,Enrui Liu,Jing Zhou,Jian Kang,Jie Li,Andong Li,Jian Zhou,Zhao Lv,Xuelong Li
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI)
Abstract:Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks. Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.
[LG-40] High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model
链接: https://arxiv.org/abs/2606.05899
作者: O. Duranthon,F. Boncoraglio,L. Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:
Abstract:We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention layer is first pre-trained on a data-abundant task and subsequently adapted via a rank-one LoRA update on limited data. In the high-dimensional limit, both stages admit a sharp asymptotic characterization in terms of a finite set of order parameters, yielding explicit predictions for test errors and representation alignment. Our analysis shows that the impact of pre-training on LoRA is summarized by an effective noise term, from which we derive prescriptions for the optimal pre-training procedure. We also demonstrate a regime with a mismatch between the value of the test error and representation quality, and propose an application of our theory to active fine-tuning.
[LG-41] When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training
链接: https://arxiv.org/abs/2606.05885
作者: Yuanfan Li,Qi Zhou,Wenjing Duan,Lu Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.
[LG-42] S-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning
链接: https://arxiv.org/abs/2606.05878
作者: Etienne Le Naour,Tahar Nabil,Adrien Petralia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder–regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.
[LG-43] GenAutoML: An Agent ic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis
链接: https://arxiv.org/abs/2606.05860
作者: Oleeviya Babu Poikarayil,Cédric Schockaert,Abdulrahman Nahhas,Christian Daase,Mursal Dawodi,Jawid Ahmad Baktash
类目: Machine Learning (cs.LG)
*备注: 26 pages, 17 figures, 12 tables. Under review
Abstract:Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.
[LG-44] Robust and sparse support vector machine via hybrid truncated loss for supervised classification
链接: https://arxiv.org/abs/2606.05814
作者: Yuliang Yang,Chen Chen,Yuxiang Liu,Huiru Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The support vector machine (SVM) is a widely used classifier, but choosing an appropriate loss function remains difficult. Convex losses such as the hinge loss and least-squares loss are sensitive to outliers, while bounded non-convex losses often lead to high computational cost. To address this, we propose a hybrid truncated loss function ( L_\mathrmht ) that is both sparse and bounded, and build the L_\mathrmht -SVM model for single-view classification. We introduce the P-stationary point and use it to establish the first-order necessary and sufficient optimality conditions. Based on these conditions, we design an alternating direction method of multipliers with a working-set strategy that reduces computational cost and achieves global convergence. We further extend L_\mathrmht -SVM to multi-view learning by adding structural information and view weights, resulting in Mv L_\mathrmht -SVM, which follows both the consensus and complementarity principles. Experiments on synthetic, real-world, and image datasets show that L_\mathrmht -SVM achieves higher accuracy with fewer support vectors and better noise robustness than five single-view methods, while Mv L_\mathrmht -SVM outperforms six multi-view methods in accuracy, precision, recall, and F1-score.
[LG-45] SALT: When More Rollouts Dont Help in Group-Based Policy Optimization and How to Make Them Matter
链接: https://arxiv.org/abs/2606.05800
作者: Powei Chang,Jinpeng Zhang,Chaoqun Sun,MiniWell Tsao,Lianrui Li,Jianxiang Xiang,Chenyu Wang,Yukang Gao,Dongying Kong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure
[LG-46] Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction
链接: https://arxiv.org/abs/2606.05797
作者: Amirhossein Zare,Amirhessam Zare,Herlock Rahimi,Reza Salarikia,Mohammad Kashkooli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 10 tables
Abstract:Longitudinal treatment decisions require predicting potential outcomes under future treatment sequences in the presence of time-varying confounding, heterogeneous patient dynamics, and limited domain-specific data. Existing longitudinal causal estimators typically train a new model for each cohort or simulator. We introduce Causal Longitudinal Prior-Fitted Networks (CausalLongPFN), a prior-fitted in-context predictor for longitudinal causal prediction. The model is pretrained entirely on synthetic episodes sampled from a broad prior over temporal structural causal models, exposing it to treatment-confounder feedback, latent heterogeneity, nonlinear state evolution, delayed effects, and cumulative treatment responses. At test time, CausalLongPFN is frozen: it conditions on support trajectories, a query history, and a proposed future treatment sequence, and returns a predictive distribution over future outcomes without gradient updates or propensity-model fitting. Multi-step predictions are obtained by recursively applying the one-step predictor under the specified treatment sequence. We evaluate on branchable cancer, HIV, and warfarin benchmarks with ground-truth counterfactual labels, and on factual-only rolling-origin prediction in MIMIC-III ICU trajectories. CausalLongPFN is competitive with domain-trained longitudinal baselines on counterfactual benchmarks and performs strongly on factual MIMIC-III prediction, suggesting that broad synthetic causal pretraining can provide a useful frozen alternative when repeated domain-specific training is costly or impractical.
[LG-47] Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data
链接: https://arxiv.org/abs/2606.05781
作者: Srinivasan Manoharan,Dilipkumar Nallusamy,Sachin Kumar,Haifeng Wu
类目: Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 4 tables
Abstract:Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks often incurs substantial latency, cost, and data privacy overhead. We present a hybrid framework that combines a fine-tuned small language model (LLaMA 3.1 8B, with only 2.05% trainable parameters via LoRA) and a deterministic rule-based post-processing layer. Trained on just 219 curated examples, the system is applied to multi-label compliance evaluation of conversational transcripts spanning 18 heterogeneous output fields. In blind evaluation on 53 previously unseen production transcripts, it achieves 100% JSON structural validity, 83.0% human-validated overall accuracy, and 100% accuracy on the most critical classification field. The proposed approach formalizes a hybrid neural-symbolic decomposition and introduces targeted hard-negative augmentation to improve performance on critical decision boundaries. Running on a single NVIDIA A100 GPU, inference completes in approximately 2 seconds, which is 2-5x faster than frontier-model APIs. The system costs only 0.013 per evaluation compared with 0.025- 0.055 for proprietary alternatives, resulting in 46-76% cost savings. These results demonstrate that domain-adapted small language models, when combined with deterministic post-processing, can match frontier-model accuracy for structured compliance evaluation while substantially reducing operational cost, latency, and privacy risk. Keywords: small language models, parameter-efficient fine-tuning, LoRA, domain adaptation, hybrid inference, compliance evaluation, structured output. Comments: 4 pages, 2 figures, 4 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.05781 [cs.LG] (or arXiv:2606.05781v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05781 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-48] Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs SIGMOD
链接: https://arxiv.org/abs/2606.05733
作者: Kabir Murjani
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
*备注: Accepted to the 2026 ACM SIGMOD Workshop on Data Management for the Modern Financial Systems (FinDS). 10 pages, 4 figures
Abstract:Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple’s own price has already moved. To address this limitation, we introduce a heterogeneous Rust-Python streaming architecture that maps cross-company attention as a continuous-time graph driven directly from text. We show that on the ingestion side, a zero-copy Rust edge parses news records in \sim 100 ns and scans the target equity universe in \sim 1.2 \mu s. On the inference end, a multivariate Neural Hawkes Process featuring per-node continuous-time LSTM states and a bilinear latent projection propagates directed excitation, while an adaptive pruning rule bounds the computational cost of dynamic neighborhood updates. Combining these stages, we demonstrate an end-to-end processing latency of \sim 13 ms per incoming news record on a single commodity CPU. Evaluated on a one-month temporal holdout of the FNSPID corpus (638 articles across 47 tickers), the system delivers a 1.70\times precision lift over random at the 90th-percentile next-day return threshold, and 3.36\times over a same-sector baseline. Crucially, removing the graph topology collapses precision to zero, confirming that the dynamic attention network is the sole driver of cross-company signal in this architecture.
[LG-49] Intercomparison of Machine Learning Algorithms for Remote Sensing-based In-season Crop Mapping
链接: https://arxiv.org/abs/2606.05731
作者: August Posch,Jitendra Kumar,Forrest M. Hoffman,Auroop R. Ganguly
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures
Abstract:In-season crop type mapping is critical for food security in the face of increasingly extreme climate-related threats to crops. Currently, the USDA Cropland Data Layer provides crop type labels at 30m resolution and is available the February after harvest, but no product exists that maps crop types before harvest with satisfactory accuracy that would allow emergency managers to respond to crop threats in near real time. Furthermore, the relative advantages of a wide range of algorithms have not been evaluated in a way that accounts for interannual variability, until this study. Here, Harmonized Landsat-Sentinel surface reflectance imagery time series and crop rotation history information are combined to map corn in Iowa and almonds in California at 30m resolution accurately by early June in unseen years, with robust quantification of uncertainty due to phenology and crop distribution. Thousands of model configurations across ten machine learning algorithms were compared using a year-wise cross-validation and a suite of metrics. Hyperparameter search revealed Support Vector Machines to be the most successful algorithm overall, with a mean F1 score of 0.74 (0.59) across five unseen validation years for almonds by early June in California (corn by early June in Iowa). Interannual variation was a large source of uncertainty, but patterns showed the potential to further improve performance with ensemble approaches or ancillary data. Future work may extend these methods to include multiclass maps of all crop types, CONUS-wide maps, and in-season crop yield forecasting.
[LG-50] Automated Proving of Shannon-Type Entropy Inequalities via Fine-Tuned Language Models and Guided Tree Search
链接: https://arxiv.org/abs/2606.05729
作者: Shing Yin Wong,Shaocheng Liu,Linqi Song,Amin Gohari,Cheuk Ting Li
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Proving Shannon-type entropy inequalities is a fundamental task in information theory that often requires constructing non-trivial linear combinations of known constraints, which is a combinatorial search problem that scales poorly with the number of random variables. We investigate whether small-scale large language models (0.6B–1.7B parameters), fine-tuned on atomic proof steps and combined with guided beam search, can automate this process. On a held-out test set of 60 inequalities spanning n=10 to 15 variables, our 0.6B fine-tuned model achieves an 85% proof success rate with tree search. GPT-5.5 solves 1.7% samples under zero-shot prompting while Psitip solves 33.3% samples. A systematic ablation study across training context length (4096 vs.\ 8192 tokens) and data distribution (n=9-skewed vs not skewed) reveals that a 4096-token not skewed training distribution yields the best performance, with extended context and skewed data providing no marginal benefit. We further identify two dominant failure modes – format failures and step quality degradation – and verify that the beam-scoring heuristic is essential via a controlled ablation (random scoring reduces success from 83% to 23%).
[LG-51] Hybrid CNN-LSTM Framework for Intelligent Cyber Attack Detection and Prevention in U.S. Critical Digital Infrastructure: A Comparative Machine Learning Evaluation on CSE-CIC-IDS2018
链接: https://arxiv.org/abs/2606.05714
作者: Md. Iqbal Hossan,Md. Serajul Kabir Chowdhury Rubel,Md. Arifur Rahman,B. M. Taslimul Haque
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, CSE CIC IDS2018 dataset, Hybrid CNN LSTM, cyber attack detection
Abstract:Digital infrastructure is growing at a rapid pace in the United States, and as a result, exposure to advanced cyber threats to critical sectors including healthcare, finance, transportation, energy and government systems is growing. The traditional cybersecurity approaches, including signature-based intrusion detection systems, have become less effective against today’s cyber attacks, as they are unable to detect unknown and changing attacks in real time. To overcome these constraints, this research suggests a smart cyber-defense system, which utilizes Artificial Intelligence (AI) and Machine Learning (ML) algorithms in the detection and prevention of cyber attacks in the U.S. digital infrastructure. This study uses the CSE-CIC-IDS2018 dataset, which is a realistic network traffic dataset, along with various cyber attack scenarios, including Distributed Denial of Service (DDoS), brute force attacks, botnets, infiltration attacks, and web-based attacks. A number of machine learning and deep learning models such as Random Forest, XGBoost, Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks are implemented and evaluated to be used in identifying malicious network behavior and boosting the accuracy of intrusion detection. The framework proposed combines data preprocessing, feature engineering, real-time traffic monitoring, intelligent threat classification with automated prevention mechanisms to build cybersecurity resilience. E
[LG-52] Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss CVPR2026 CVPR
链接: https://arxiv.org/abs/2606.05695
作者: Hongye Xu,Bartosz Krawczyk
类目: Machine Learning (cs.LG)
*备注: Published in CVPR 2026 Findings. 10 pages, 6 figures. CVF version: this https URL . Code: this https URL
Abstract:Exemplar-free class-incremental learning (EFCIL) aims to acquire new classes over time without storing raw data. Historically, prototype rehearsal, which samples around stored class prototypes and mixes them with current-task data, has been a popular strategy to reduce catastrophic forgetting. However, recent drift-compensation methods that explicitly realign prototypes in the evolving feature space consistently outperform prototype-based rehearsal, raising the question of whether rehearsal itself is fundamentally limited. We argue that the performance gap stems not from the idea of prototype rehearsal per se, but from how it is typically instantiated: existing approaches treat prototypes as isolated class summaries that ignore information from nearby enemy classes, and fail to correct the emerging class imbalance between a handful of synthetic old-class samples and hundreds of real instances from newly introduced classes. Building on this hypothesis, we revisit prototype rehearsal and propose a manifold-aware variant that restores its competitiveness in EFCIL. First, we introduce Constrained Expansive Over-Sampling, which interpolates each old-class prototype toward its nearest enemy features from new classes, generating boundary-aware rehearsal samples that better follow the underlying data manifold while preserving inter-class separation. Second, we design an Adaptive Class-Balanced loss that performs time-based class weighting, amplifying gradients from older prototypes when they are most informative and gradually annealing their influence as richer supervision from later tasks accumulates. Together, these components turn prototype rehearsal into a drift-resilient, imbalance-aware mechanism that closes, and often reverses, the gap to recent drift-compensation methods, achieving state-of-the-art performance across multiple EFCIL benchmarks.
[LG-53] Causal Modeling of Selection in Evolution ICML2026
链接: https://arxiv.org/abs/2606.05689
作者: Haoyue Dai,Zeyu Tang,Peter Spirtes,Kun Zhang
类目: Machine Learning (cs.LG)
*备注: Appears at ICML 2026 (spotlight)
Abstract:Understanding potential selection in data is crucial for causal discovery; we argue that “selection” in common narratives takes two forms, which we term static and evolutionary selection, respectively. Static selection refers to a one-shot filtering process where observed data consist of a subset of the population of interest, as in survey volunteer bias. Evolutionary selection, in contrast, operates through repeated rounds of differential fitness in reproduction, where observed data constitute the latest generation shaped by a historical trajectory, as in immune adaptation, antibiotic resistance, and social norm emergence. Existing methods largely conflate these two forms and rely on an identical graphical model of selection. We show that this model is valid for static settings but fails to characterize data under evolution, yielding false discovery results. To address this, we introduce a new model that specifically characterizes evolutionary selection, and develop a sound and complete procedure for identifying such models from data across one or multiple environments or generations. Experimental results validate the method’s ability to uncover the relevant mechanisms underlying evolution from data.
[LG-54] CASS-RTL: Correctness-Aware Subspace Steering for RTL Generation with LLM s
链接: https://arxiv.org/abs/2606.05680
作者: Mohammad Akyash,Nowfel Mashnoor,Kimia Azar,Hadi Kamali
类目: Programming Languages (cs.PL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted to the IEEE International Conference on LLM-Aided Design (LAD '26)
Abstract:Recent advances in large language models (LLMs) have enabled the automatic synthesis (generation) of register-transfer level (RTL) code from natural language instructions, offering a promising pathway to accelerate chip design. Unlike typical natural language (and software coding) tasks, LLM-based RTL code generation demands strict cycle accuracy with concurrency, where minor logical errors can render a circuit unusable or insecure. While prior work has explored hallucination mitigation via external verification, self-evaluation prompts, retrieval-augmented prompting, domain specific fine-tuning, agentic solutions, and reasoning, these approaches largely overlook the attention-oriented internal mechanisms of LLMs that may inherently correlate with RTL correctness. This work proposes CASS-RTL, a first-of-its-kind framework for discovering and leveraging LLMs’ correctness-aware components to guide RTL generation toward functionally accurate outputs. We (i) identify attention heads whose activation patterns consistently differentiate correct from incorrect RTL; (ii) construct a low-dimensional subspace capturing correctness-relevant signals; and (iii) design a lightweight, geometry-aware intervention that steers the model at inference time. CASS-RTL is fully model-agnostic, requires no additional supervision or retraining, and readily integrates into existing models. Empirically, we evaluate CASS-RTL on multiple models and observe 10%-20% improvement in pass@1/5/10 accuracy on VerilogEval and 5% improvement on CVDP, demonstrating the effectiveness of our method in enhancing reliability without sacrificing model efficiency or requiring a large labeled dataset for fine-tuning.
[LG-55] Q-GNN: Query-Conditioned Graph Neural Networks with Type Awareness for Knowledge Graph Completion
链接: https://arxiv.org/abs/2606.05639
作者: Dongxiao He,Ruqiong Zhang,Zhizhi Yu,Ling Ding,Di Jin,Guangquan Xu,Zhiyong Feng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge Graph Completion (KGC) aims at predicting missing triplets from incomplete knowledge graphs, which is crucial for downstream applications. Recently, Graph Neural Network (GNN)-based methods have achieved remarkable success by performing message passing over query-centered local subgraphs. However, in practice, a query is jointly defined by both the entity and the relation, with both carrying information indispensable for reasoning, yet these methods rely solely on the query relation as the guiding signal, while the information inherent in the query entity is not leveraged to guide inference - the entity serves merely as a structural anchor for subgraph extraction. To this end, we incorporate query entity information into the reasoning process from two perspectives: the first is structural context, i.e., the neighboring structure and relation patterns around the entity, which is encoded by a dedicated context encoder and used to modulate messages; the second is semantic type of the entity, inferred by a large language model, which is incorporated into attention computation and final scoring to provide type-level prior constraints. Together, these two sources of information enable the reasoning process to be guided by both the query relation and the query entity. Experimental results on standard benchmarks demonstrate the effectiveness of the proposed Q-GNN.
[LG-56] StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis
链接: https://arxiv.org/abs/2606.05636
作者: Xiaoyu Lin,Nicholas Tagliapietra,Kehan Li,Lavdim Halilaj,Juergen Luettin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Root-Cause Analysis (RCA) seeks to identify the variables responsible for abnormal system behavior in complex domains such as manufacturing, cloud computing, and healthcare. Existing approaches face a critical bottleneck: graph-based causal methods can identify intervention targets but typically require a known or accurately estimated causal graph, while graph-free statistical methods either localize marginal anomalies rather than structural causes, or rely on restrictive assumptions about graph structure or functional form. We propose StableRCA, a local mechanism-level RCA framework that avoids global graph discovery by estimating local Markov boundaries and detecting conditional distribution shifts within them. Leveraging the Independent Causal Mechanism principle, we show that intervention targets can be identified with probability converging exponentially in sample size under faithful Markov boundary recovery and non-degenerate mechanism shifts. Experiments on synthetic benchmarks and five real-world datasets demonstrate that StableRCA is robust to graph misspecification, effective under multiple intervention targets, scalable to large systems, and reliable across diverse application domains. Code is available at: this https URL
[LG-57] From Prediction to Self: Developmental Conditions for Agency in Minimal Neural Systems
链接: https://arxiv.org/abs/2606.05605
作者: Evan Ye
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 18 pages, 6 figures
Abstract:How does a system that merely predicts the world come to distinguish its own causal influence from everything else? We trace this transition in a minimal 192-dimensional GRU through 40 controlled experiments arranged as a developmental sequence, adding components one at a time and tracking whether the system can distinguish self-caused from world-caused changes. The developmental path reveals four conditions that must be satisfied in strict order: (1) persistent state forming stable attractors, (2) a causal action loop linking output to input, (3) proprioceptive feedback that makes implicit causal knowledge explicit, and (4) asynchronous awakening - perceptual learning must consolidate before action learning begins. We propose agency gain (A = Err_world - Err_self), the predictive advantage of knowing one’s own action, as a metric to track this process. The self-aware predictor consistently outperforms the self-blind predictor across periodic (sinusoidal) and chaotic (Lorenz) environments, and the metric survives ablation of all auxiliary components. Only forward-sampled action selection produces meaningful agency gain; two gradient-based alternatives degenerate. Equally significant are 12 falsified hypotheses mapping where development stalls: predictive coding alone does not produce self-represent Comments: 18 pages, 6 figures Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2606.05605 [cs.LG] (or arXiv:2606.05605v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-58] Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
链接: https://arxiv.org/abs/2606.05599
作者: Yizhe Ding,Runze Li,Jia Liu,Lingzhou Xue
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 30 pages, 5 figures
Abstract:This paper establishes a theoretical framework for the uniform convergence of smoothly activated deep neural network (DNN) estimators. While standard ReLU networks achieve minimax-optimal rates in the L^2§ norm for various nonparametric regression tasks, we establish a theoretical lower bound demonstrating that least-squares ReLU estimators can suffer from the curse of dimensionality in their uniform convergence behavior. Motivated by the need for reliable uniform guarantees in downstream tasks requiring worst-case reliability, we address this limitation by analyzing smoothly activated DNNs (smooth DNNs), encompassing both feedforward and residual structures. We establish novel pseudo-dimension bounds, non-asymptotic approximation guarantees, and Hölder-norm bounds for the approximators of these models. Leveraging these results, we derive non-asymptotic uniform convergence rates for smooth DNN estimators across multiple statistical contexts, including Huber, least-squares, quantile, and logistic regression. We prove that smooth DNNs can mitigate the curse of dimensionality in uniform convergence by adaptively exploiting the low-dimensional hierarchical composition structure of the target function. Supported by both simulation studies and a real-world application, our results position smooth DNNs as a theoretically grounded and practically viable alternative to ReLU networks for statistical learning tasks requiring uniform guarantees.
[LG-59] AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents
链接: https://arxiv.org/abs/2606.05597
作者: Hao Bai,Rui Yang,Chenlu Ye,Spencer Whitehead,Aviral Kumar,Tong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9\times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|\tau_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|\tau_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).
[LG-60] Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies
链接: https://arxiv.org/abs/2606.05588
作者: Aarav Bedi(University of California, Berkeley)
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 4 tables
Abstract:Imitation-learning policies inherit the quality of the demonstrations they are trained on, and a growing set of curation metrics promise to score and filter low-quality demonstrations automatically. These metrics are each validated on different data with different protocols, so it is unclear which of them actually identify the demonstrations that harm a policy. We build a controlled testbed in which demonstration defects are injected with known type, and audit seven curation metrics along two axes: how well each separates defective from clean demonstrations, and whether training a behavior-cloning policy on each metric’s curated subset improves task success. We study two defect regimes. Subtle perturbations (correlated action noise, tremor, truncation) are detectable by multivariate outlier scoring and, once removed, recover the full downstream gap. Structural errors, where the demonstration executes a wrong action at a key moment, are invisible to every action-only metric we test, and two of them are inverted: they score defective demonstrations as higher quality and, used for curation, tend to leave the policy at or below the uncurated baseline rather than above it. Only metrics that examine the state trajectory detect structural errors, and even the best of them recovers just a third of the downstream gap. High detection accuracy does not guarantee downstream improvement. We release the testbed and all curation implementations.
[LG-61] CLaaS: Continual learning as a service for sample efficient online learning
链接: https://arxiv.org/abs/2606.05559
作者: Kion Fallah,Silen Naihin,Barak Widawsky,Qingqing Mao
类目: Machine Learning (cs.LG)
*备注: 4 pages main content, 7 figures
Abstract:Deployed large language model agents must adapt to distribution shift in dynamic environments. Ideally, adaptation can be performed from accumulated agent experiences and retain prior capabilities while transferring to future tasks. However, agent actions and environmental transitions can only be sampled once per scenario, as real-world environments cannot be trivially reset. To this end, we investigate an experiential and online continual learning setting in which agents learn from a stream of scenarios. We propose continual learning as-a-service (CLaaS), a system which enables agents to improve during deployment, abstracted behind a chat API. To increase sample efficiency, CLaaS stores rollouts in an experience replay buffer for gradient reuse during asynchronous training. We evaluate CLaaS on an adversarial task, demonstrating that parametric updates lead to superior forward transfer and less forgetting than in-context learning, with replay being a critical choice for sample efficiency.
[LG-62] Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
链接: https://arxiv.org/abs/2606.05558
作者: Kaixuan Liu,Guojun Xiong,Weinan Zhang,Shengpu Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre-collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion-based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step-by-step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy-conditioned score function, ensuring that simulated trajectories accurately reflect its decision-making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi-turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.
[LG-63] Field Validation of a Multi-Resolution ConvLSTM Framework for Retaining Wall Deformation Prediction
链接: https://arxiv.org/abs/2606.05556
作者: Jihoon Kim,Heejung Youn
类目: Machine Learning (cs.LG)
*备注: 40 Pages, 15 figures
Abstract:This study presents a comprehensive field validation of a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) framework for predicting retaining wall deformation during staged excavation. The framework is trained on Gaussian noise-augmented numerical simulations and integrates ConvLSTM models operating at different temporal resolutions through a stacking ensemble strategy. The proposed framework is validated using field monitoring data from 34 inclinometers across 11 excavation sites in South Korea. Site-wise prediction performance is systematically evaluated using multiple evaluation metrics, with analyses of the influence of temporal deformation irregularity and spatiotemporal prediction characteristics on model performance. The results demonstrate that the framework predicts retaining wall deformation associated with up to 5.0 m of additional excavation with an average mean absolute error of 1.4 mm and a coefficient of determination of 0.93 across the excavation sites. These results indicate that the framework, although trained exclusively on numerically simulated and augmented database, can be effectively applied to diverse field excavation conditions and achieve a reliable level of prediction accuracy in practical retaining wall deformation prediction.
[LG-64] Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLM s
链接: https://arxiv.org/abs/2606.05516
作者: Wanhao Yu,Ziyan Wang,Zheng Wang,Abeer Matar Almalky,Yihang Zuo,Shuteng Niu,Sen Lin,Adnan Siraj Rakin,Deliang Fan,Li Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52 \times training speedup.
[LG-65] LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or “Is Your VLM Smarter Than a 5th Grader?”)
链接: https://arxiv.org/abs/2606.05497
作者: Alvin Wei Ming Tan,David Cardinal,Tania Lorido-Botran,Laura Bravo-Sanchez,Sunny Yu,Michael C. Frank
类目: Machine Learning (cs.LG)
*备注:
Abstract:Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for comparing VLMs with human cognitive development across tasks, ages, and populations. We present LEVANTE-bench, a benchmark based on tasks and data from the Learning Variability Network (LEVANTE), which distributes open-source tasks and data measuring children’s cognition across languages and cultures. In LEVANTE-bench, we systematically assess VLMs on six tasks, comparing their alignment with children aged 5-12 ( N = 1547) across three countries. We compare models at multiple scales, assessing their overall accuracy, their task- and item-level alignment with children, and how well they match children’s trial-level error distributions. Alignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans. However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children’s errors better. In addition, even the best-performing VLMs struggled on matrix reasoning and mental rotation tasks. Thus, current VLM architectures align only partially with the cognitive abilities of children.
[LG-66] Learned Subspace Compression for Communication-Efficient Pipeline Parallelism ICML2026
链接: https://arxiv.org/abs/2606.05484
作者: Paul Janson,Edouard Oyallon,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2nd Workshop on Connecting Low-rank Representations in AI, ICML 2026
Abstract:Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present Manifold Aware Projection Learning (MAPL), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MAPL lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. We further show that we can incorporate residual vector quantization after projection with a streaming codebook synchronization protocol that amortizes dictionary communication. Across LLaMA models from 150M to 1B parameters we show that MAPL can be easily applied to the existing pipeline and can achieve high compression with neglibile performance degradation with a drastically improved tradeoffs in performance vs. compression compared to Subspace Networks.
[LG-67] Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
链接: https://arxiv.org/abs/2606.05438
作者: Dongruo Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 24 pages, 1 table
Abstract:We study the deterministic first-order oracle complexity of finding (\epsilon)-stationary points in smooth nonconvex optimization when the objective satisfies higher-order smoothness assumptions. While the classical (\epsilon^-2) rate is optimal under only Lipschitz gradients, higher-order smoothness leads to accelerated first-order upper bounds, most notably the (\epsilon^-7/4) rate under Lipschitz Hessians and the (\epsilon^-5/3) rate under Lipschitz third derivatives. The matching lower bounds, however, have remained open. We resolve this gap by proving a new dimension-free first-order lower bound for higher-order smooth nonconvex functions, valid for every finite smoothness order. In particular, our construction gives a matching (\Omega(\epsilon^-7/4)) lower bound in the Hessian-Lipschitz case and a matching (\Omega(\epsilon^-5/3)) lower bound in the third-order-smooth regime. The hard instance is based on a \emphblock-chain mechanism that enforces blockwise oracle revelation while preserving the smoothness structure needed for the scalar hard instance. The lower-bound construction was discovered with the assistance of ChatGPT 5.5 Pro and subsequently verified by the authors.
[LG-68] DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
链接: https://arxiv.org/abs/2606.05435
作者: Naima Tasnim,Lalitha Sankar,Oliver Kosut
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 2 tables
Abstract:Differentially private stochastic gradient descent (DP-SGD) has become the standard framework for privacy-preserving machine learning, yet its reliance on a fixed gradient clipping threshold to limit sensitivity remains a significant practical limitation. Adaptive clipping algorithms such as AdaClip shift and scale the gradient prior to clipping and adding noise so that the clipped gradient yields a more informative descent direction. The shift and scaling parameters are selected adaptively based on the empirical mean and variance. However, in existing adaptive clipping algorithms, these empirical estimates have not been also used for momentum to accelerate training itself. On the other hand, DP-Adam is an algorithm that exploits Adam-like momentum updates based on the gradient mean and variance to accelerate training, but does not exploit these estimates for adaptive clipping. In this work, we propose Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum (DP-MacAdam), a novel algorithm that combines these two approaches so as to use the same mean and variance estimates for both clipping and momentum. We perform an analysis showing that DP-MacAdam estimates the gradient variances in a bias-free manner. In addition, we empirically evaluate the privacy and accuracy of DP-MacAdam, demonstrating that it achieves improved model utility compared to DP-SGD, AdaClip, and DP-Adam baselines, without requiring manual tuning of the clipping threshold.
[LG-69] Generalized TV–ell_p Structured Priors for Bayesian T_1 Mapping
链接: https://arxiv.org/abs/2606.05381
作者: Disi Lin,Martin Berggren,Tommy Löfstedt
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL
Abstract:We propose an extended family of structured spatial priors that incorporates the total variation (TV) function with \ell_p norms. The prior is proven to be proper and incorporated into a Bayesian regression framework to enable uncertainty quantification in T_1 mapping, with posterior inference performed using the No-U-Turn Sampler (NUTS). This TV-- \ell_p construction is proven to constitute a well-defined family of prior distributions, and it naturally enforces spatial consistency and smooth variations in the estimated parameter maps. The method was evaluated in comparison to maximum-likelihood estimation and several Bayesian alternative priors based on the uniform, Gamma, and bounded TV priors. The evaluation includes experiments on synthetic brain and cardiac T_1 mapping datasets, as well as a real in-vivo breast T_1 mapping dataset. The results show that the TV-- \ell_p prior yields more concentrated posterior densities, indicating reduced uncertainty. It also consistently achieves lower variance and smaller (negative) bias, leading to more reliable estimates. Overall, embedding a TV-based structured penalty along with \ell_p norms in a prior in a Bayesian model improves spatial coherence in T_1 maps and enhances uncertainty quantification, offering a robust approach for T_1 mapping with uncertainties.
[LG-70] Learning-Augmented Online Minimization with Dual Predictions
链接: https://arxiv.org/abs/2606.05380
作者: Christian Coester,Alexa Tudose,Alexander Turoczy
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We present learning-augmented algorithms for two general classes of online minimization problems: metrical task systems and laminar set cover. Both algorithms achieve improved theoretical guarantees using machine-learned predictions of an optimal solution to the dual linear program. Unlike optimal primal solutions, which can change drastically under tiny instance perturbations, these dual solutions are much more stable, which ensures the existence of good (and learnable) predictions for families of similar instances. While previous work has used dual predictions in offline settings and for online maximization problems, our algorithms are, to the best of our knowledge, the first demonstration that such dual predictions can be effective for online minimization. Our theoretical results are complemented by experiments on the k -server problem and the parking permit problem.
[LG-71] SHALA-LLM : Smartly Handling Ambiguous Labels in Aligning LLM s
链接: https://arxiv.org/abs/2606.05376
作者: Jingyao Wu,Ashley Wang,Keane Ong,Paul Pu Liang,Rosalind Picard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.
[LG-72] Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting
链接: https://arxiv.org/abs/2606.05373
作者: Md Azharul Islam,Dwyer Deighan,Tarunraj Singha,Danial Faghihi
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:
Abstract:Reliable neural architecture selection is an open challenge in time-series forecasting under limited, noisy, and heterogeneous data, where standard heuristic architecture design and validation approaches fail to ensure accurate and reliable prediction and generalization. We propose EVIDENT (EVidence-based IDEntification of Neural archiTectures), a framework for architecture selection that integrates Bayesian training, evidence-based ranking, and task-specific validation under uncertainty. The framework explores the candidate architecture pool and identifies the lowest-capacity model that satisfies a prescribed validation criterion. We demonstrate this method using temporal convolutional networks (TCNs) for individualized blood glucose forecasting in type 1 diabetes patients. The results show that EVIDENT systematically rejects both under- and over-parameterized TCN architectures on population-level diabetes data, while identifying models that generalize reliably to unseen patients. When multiple architectures are competitive, the framework further supports plausibility-weighted ensemble predictions that enhance predictive performance. Compared with a random-search baseline, EVIDENT identified smaller architectures with more consistent forecasting performance on unseen patients. These findings establish EVIDENT as a strategy to neural architecture discovery, enabling reliable model selection for high-consequence forecasting in data-limited and heterogeneous settings.
[LG-73] Mamba-Assisted Non-Markovian Closure for Reduced-Order Modeling
链接: https://arxiv.org/abs/2606.05371
作者: Zhi-Feng Wei,Saad Qadeer,Panos Stinis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Code will be released upon acceptance
Abstract:Reduced-order modeling of high-dimensional dynamical systems is often hindered by the non-Markovian closure term that represents the effect of unresolved variables on the resolved dynamics. Inspired by the Mori–Zwanzig formalism, in which the closure takes the form of a memory functional of the resolved trajectory, we recast closure modeling as a sequence modeling problem and propose the Mamba-Assisted Closure (MAC) framework: a Mamba-based sequence model, trained to predict the closure from the resolved trajectory, is coupled with the reduced-order governing equations through a numerical integrator to advance the resolved variables in time. A key feature of the framework is its exploitation of the dual representation of state-space models – the model is trained in a sequence-to-sequence fashion via the convolutional form, and deployed for step-by-step autoregressive rollout via the recurrent form, yielding both efficient long-trajectory training and constant per-step inference cost. On the viscous Burgers’ equation and the chaotic two-scale Lorenz '96 system, the MAC model substantially outperforms the Markovian reduced-order model, the GRU-based sequence model, and the Wilks method in predictive accuracy and long-time rollout stability.
[LG-74] Should Demand Models Incorporate Competitor Prices? Oblivious Learning and Algorithmic Collusion
链接: https://arxiv.org/abs/2606.05363
作者: Yuhang Wu,Assaf Zeevi
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
*备注: Preliminary version “Oblivious Learning, Price Exploration and Collusive Dynamics” accepted at EC 2026
Abstract:On a platform with many sellers, should a pricing algorithm explicitly model competitors’ prices when learning demand? Classical learning arguments suggest an affirmative answer: ignoring competitors induces model misspecification and inefficiency. In contrast, recent work on algorithmic collusion suggests that strategic obliviousness – deliberately ignoring competitor prices – may facilitate collusive outcomes and improve profits. We study this modeling choice in a stylized competitive market with unknown noisy demand, in which multiple sellers repeatedly set prices and estimate demand via iterated least squares, and either incorporate competitors’ prices into their demand models (informed) or ignore them (oblivious). We first show that, relative to a monopolist, an oblivious seller in a competitive market must explore more aggressively to compensate for the loss of dynamic competitor information. Building on this insight, we characterize market dynamics when all sellers are oblivious and show that prices converge to the competitive outcome under sufficient exploration, while a continuum of pseudo-equilibria arises when exploration decays. Analyzing the resulting price trajectories, we uncover an excursion phenomenon that gives rise to transient collusive patterns that dissipate as learning progresses. In markets with both oblivious and informed sellers, the informed strictly out-earn the oblivious. Read as a strategy game, the modeling choice has a unique Nash equilibrium: the all-informed market, in which prices converge to the competitive outcome efficiently. Overall, our results indicate that collusive patterns are not robust and are not sustained by oblivious modeling; therefore, incorporating competitor information, together with sufficient price exploration, remains a reliable strategy for sellers in competitive markets.
[LG-75] PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
链接: https://arxiv.org/abs/2606.05345
作者: Yaobo Zhang
类目: Machine Learning (cs.LG)
*备注: 26 pages, 6 figures, 10 tables. Code available at this https URL
Abstract:We unify RoPE’s Fourier phase, Jordan-RoPE’s finite jets, and ALiBi’s affine recency into a single learnable relative-position space, and study which regions of this space are selected by different tasks. PJ-RoPE is a Fourier-Jet-Affine formulation for relative attention, with an optional Poincare-type reading as the affine completion of a homogeneous Fourier-jet positional representation. Algebraically, the same primitives form a finite constant-coefficient difference module: simple roots of the lag-shift operator give Fourier/RoPE characters, repeated nonzero roots give Jordan/Fourier jets, and the repeated unit root gives ALiBi-like affine recency. The framework separates scalar PJ-bias kernels from exact PJ-rotary feature transforms, introduces adaptive sector diagnostics, and uses LC/rapidity coordinates to stabilize high-order jets. Controlled probes verify sector containment and selection; small language runs expose an affine/recency boundary; music-token streams provide the clearest case where LC/affine variants remain strong while carrying measurable high-order corrections; and LC diagnostics show a scale-stability gain coupled to phase-resolution loss. Comments: 26 pages, 6 figures, 10 tables. Code available at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.05345 [cs.LG] (or arXiv:2606.05345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-76] A prism hierarchy of learning regimes in large linear autoencoders NEURIPS’2026
链接: https://arxiv.org/abs/2606.05335
作者: Eugene Golikov,Yaroslav Gusev,Dmitry Yarotsky
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages, under review for NeurIPS’2026
Abstract:Theoretical studies of machine learning models commonly consider different limiting regimes in which the learning dynamics of gradient descent becomes theoretically tractable. It is, however, desirable to have a systematically obtained picture of all qualitatively different extreme learning regimes for a particular type of models. In this paper we propose such a picture for large weight-tied linear autoencoders characterized by input and latent dimensions, initialization magnitude, and training set size. This model is nonlinear in the weights and its gradient flow does not have a general theoretical solution. We show that at the level of the formal loss-expansion hierarchy, its extreme regimes are naturally associated with faces of a triangular prism. In particular, there are five basic extreme regimes associated with the 2-faces of the prism: (1) large-data, (2) small-data, (3) mean-field, (4) narrow-latent, and (5) free. For regimes (1,2,3,4), we derive explicit expressions for both train and population limiting loss evolutions under gradient flow, obtaining very good agreement with experimental results.
[LG-77] Multimarginal flow matching with optimal transport potentials
链接: https://arxiv.org/abs/2606.05327
作者: Raghav Kansal,David Crair,Nghia Nguyen,Scott Pope,Bradley Parry
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 9 pages, 3 figures, 4 tables, and a 27 page appendix. Accepted to the Forty-Third International Conference on Machine Learning
Abstract:Flow matching (FM) has emerged as a powerful framework for learning dynamic transport maps between two empirical distributions. However, less explored is the setting with intermediate observed marginals that can help constrain the flows between the endpoints. This “multimarginal” regime is central to modeling temporal evolution in dynamical systems in many scientific domains that can sample sequential distributions. We tackle this problem with a novel approach that leverages the connection between FM and dynamic optimal transport (OT), softly steering the flow towards the intermediate marginals through potential terms in the dynamic OT action. By extending the conditional FM learning target to incorporate these potentials, we derive an efficient, simulation-free algorithm for multimarginal FM that offers considerable flexibility in the spatiotemporal dynamics of the learned flows. We demonstrate state-of-the-art performance and training efficiency of OT-potential FM (OTP-FM) on diverse single-cell RNA sequencing, oceanographic, and meteorological datasets. Our code is available at this https URL.
[LG-78] Anomaly Detection for Electro-Hydrostatic Actuators using LSTM Autoencoder
链接: https://arxiv.org/abs/2606.05274
作者: Nehal Afifi,Abdelmonem Elhendawi,Felix Leitenberger,Nadine Piat,Sven Matthiesen
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 3 tables, ESREL 2026 -European Safety and Reliability Conference, accepted paper to be published
Abstract:Electro-Hydrostatic Actuators (EHAs) are widely used in aerospace and industrial systems, where timely detection of sensor anomalies is essential to ensure safe and reliable operation. However, the large volume and high sampling frequency of EHA sensor data pose challenges for accurate and efficient anomaly detection. Conventional statistical and classical machine-learning methods such as Z-score, Interquartile Range (IQR), Median Absolute Deviation (MAD), Isolation Forest, Gaussian Mixture, and k-means often fail to capture the temporal dependencies inherent in EHA signals, resulting in limited detection accuracy and elevated false-alarm rates. Furthermore, systematic evaluations of data-driven anomaly detection approaches for EHA systems remain scarce, particularly under varying operational conditions. This study presents an offline anomaly-detection framework for univariate EHA sensor signals, focusing on temperature and pressure data collected from a controlled test bench. The method employs a reconstruction-based Long Short-Term Memory (LSTM) autoencoder, calibrated and evaluated using validation-set reconstruction-error distributions. Performance is assessed across multiple fault-injection scenarios using accuracy, precision, recall, and F1-score, complemented by sensitivity analyses under varying operating conditions. The LSTM autoencoder achieved an average accuracy of 99.0%, precision up to 100%, recall between 90.2% and 99.6%, and F1-scores from 93.1% to 99.8%, demonstrating high detection sensitivity and a very low false-alarm rate across all evaluated sensors. These results highlight the feasibility of data-driven offline anomaly detection for EHAs. Future work will focus on adapting the developed framework for an online (real-time) environment.
[LG-79] Learning Manifold and Itô Dynamics with Branched Neural Rough Differential Equations ICML2026
链接: https://arxiv.org/abs/2606.05272
作者: Luke Thompson,Dai Shi,Lequan Lin,Junbin Gao,Andi Han
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Neural rough differential equations (NRDEs) stay accurate under irregular sampling while taking far fewer integration steps than standard neural differential equations, summarising a finely sampled driver by its log-signature and advancing the hidden state over coarse intervals using the log-ODE method. This efficiency rests on the shuffle algebra, the algebraic counterpart of Stratonovich calculus. This reliance means NRDEs cannot expose the quadratic-variation terms Itô dynamics require, nor the ordered covariant derivatives that govern Itô flows on connection-equipped manifolds. Ameliorating this, we introduce Branched Neural Rough Differential Equations (B-NRDEs), a Hopf-algebraic framework that recasts the NRDE log-ODE step as geometric numerical integration on the state-space manifold, matching the driving algebra to the governing calculus: Grossman–Larson rooted trees for Euclidean Itô dynamics, Munthe-Kaas–Wright planar rooted trees for ordered covariant derivatives on manifolds, and the shuffle algebra in the classical Stratonovich case. This yields intrinsic coarse-step dynamics that exactly preserve manifold constraints. Finally, we introduce a branched signature-kernel objective to enable Itô-consistent law matching by making quadratic-variation terms visible during training. On rough Bergomi volatility, sim-to-real \mathrmSO(3) dynamics forecasting, and SPD covariance dynamics, B-NRDEs offer a unified, effective approach to stochastic and manifold-valued dynamics beyond the Euclidean–Stratonovich setting.
[LG-80] Aggregating LLM -Based Weak Verifiers for Spatial Layout Generation
链接: https://arxiv.org/abs/2606.05268
作者: Sharon Zhang,R. Kenny Jones,Jiajun Wu,Maneesh Agrawala
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:We present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.
[LG-81] Sharp Low-Degree Thresholds for Planted-vs-Planted Testing
链接: https://arxiv.org/abs/2606.05266
作者: Anda Skeja,Daniel Gutiérrez Espinoza,Fiona Skerman,Alexander S. Wein
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We establish the first sharp thresholds for low-degree polynomial tests in planted-vs-planted settings, where the goal is to determine with vanishing error which of two structured planted mechanisms generated the observed data. We prove matching low-degree upper and lower bounds for counting communities in the planted submatrix and planted dense subgraph models. The resulting testing threshold coincides, down to the sharp constant, with the known low-degree recovery threshold. In contrast, the task of weak testing, where the goal is to outperform random guessing, does not have a sharp threshold but rather a smooth transition, which we identify. To prove our results, we develop a framework for planted-vs-planted testing that builds on a latent-variable expansion originating in low-degree recovery and employs new methods to identify and prune non-signal contributions.
[LG-82] Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models
链接: https://arxiv.org/abs/2606.05265
作者: Lipai Huang,Adithi Srinath,Manas Singh,Junwei Ma,Ali Mostafavi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Near-real-time flood depth prediction demands surrogate models that are accurate, fast, and transferable across watersheds. Supervised surrogates can match physics-based simulators in accuracy but need millions of training rows per watershed and cannot extrapolate beyond their original mesh. We propose a domain-aware coreset construction pipeline that conditions a tabular foundation model at inference time. The pipeline stratifies storms by return period and most-affected watershed, then samples hexagons with a target-aware spatial selector. With 0.7% of the per-watershed training pool, the model attains a mean R^2 of 0.663 across nine Houston-area watersheds, within 98.5% of the supervised reference ( R^2 = 0.673). It transfers to held-out watersheds without task-specific retraining, staying ahead of a coreset-trained supervised baseline. On real storms it exceeds the supervised reference on a far out-of-distribution case and trails it on a mostly in-distribution one. Domain-aware coreset construction lets tabular foundation models deliver data-efficient, watershed-transferable flood predictions without per-watershed training.
[LG-83] REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting
链接: https://arxiv.org/abs/2606.05264
作者: Moulik Gupta(1),Dhruv Kumar(1 and 2),Murari Mandal(1 and 3),Saurabh Deshpande(1) ((1) Birla AI Labs, (2) Birla Institute of Technology and Science, Pilani, (3) Kalinga Institute of Industrial Technology)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training robust multivariate time series forecasting models requires large, diverse corpora, yet many real-world domains provide only a handful of observed sequences. Existing generators fail to resolve this mismatch: prior-based approaches (e.g., CauKer, TimePFN) produce domain-agnostic samples, while data-driven methods (e.g., TimeGAN) treat references as black-box supervision, forfeiting explicit control over periodic structure, local variability, and cross-variable dynamics. We propose ReGeN, a reference-guided generative pipeline that treats observed sequences not as examples to imitate, but as structural scaffolds for controllable synthesis. ReGeN decomposes each reference into three interpretable components: a phase-aligned periodic backbone capturing dominant domain morphology; per-variable stochastic residuals modeled with a deep-kernel Gaussian process; and lag-aware cross-variable dependencies injected through a structural causal model with fitted coupling coefficients. Sampling these components at controllable temperature broadens distributional coverage while preserving domain-grounded structure. We show that ReGeN-generated data consistently substitutes for real sibling data with minimal forecasting degradation, and in strongly periodic domains such as traffic, can outperform the real source itself. We further show that a foundation model pretrained on ReGeN corpora outperforms those pretrained on prior-based and data-driven synthetic alternatives. This suggests that in low-data regimes, how reference data is structurally exploited can matter as much as how much data is available. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.05264 [cs.LG] (or arXiv:2606.05264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-84] Alpha-RTL: Test-Time Training for RTL Hardware Optimization
链接: https://arxiv.org/abs/2606.05253
作者: Peilong Zhou,Zhirong Chen,Cangyuan Li,Haoyu Gao,Kaiyan Chang,Ziming Qu,Ying Wang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures
Abstract:Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy to adapt to executable EDA feedback for the specific RTL problem at hand. We propose TTT-RTL, to our knowledge the first per-design test-time training framework that closes the loop between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL samples candidate implementations, verifies them through syntax checking and simulation, scores valid designs using synthesis-derived PPA product, reuses high-reward variants through a PUCT-indexed design-state pool, and updates the policy with an entropic policy-gradient objective. To stabilize policy updates under sparse or plateaued rewards, we introduce an adaptive KL-budget controller that adjusts the entropy constraint using reference KL, effective sample size, and reward saturation signals. On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean PPA product by 65.1% over the reference, outperforming the strongest published frozen-policy agent baseline at 26.1%. On an industrial XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL achieves a 59.4% ADP reduction, and ablations confirm that policy adaptation, state reuse, and KL-budget control each contribute. These results suggest that test-time training with executable EDA feedback can move LLM-based RTL generation beyond functional correctness toward physically optimized hardware. Comments: 10 pages, 5 figures Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; B.7.1 Cite as: arXiv:2606.05253 [cs.LG] (or arXiv:2606.05253v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.05253 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peilong Zhou [view email] [v1] Wed, 3 Jun 2026 14:51:33 UTC (914 KB) Full-text links: Access Paper: View a PDF of the paper titled Alpha-RTL: Test-Time Training for RTL Hardware Optimization, by Peilong Zhou and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-85] DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables
链接: https://arxiv.org/abs/2606.05247
作者: Ziqian Wang,Chenxi Fang,Zhen Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.
[LG-86] A New Quaternion-Joint Cable-Driven Redundant Manipulator Configuration and its Control Through FABRIK and Residual Reinforcement Learning
链接: https://arxiv.org/abs/2606.05236
作者: Tanapath Pornthisan,Thanapat Kemthong,Thanyapisit Kangsathien,Pasut Aranchaiya,Paulo Garcia,Viboon Sangveraphunsiri
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robotic arms capable of traversing arbitrary spatial paths, especially in highly obstructed workspaces, are highly desired across several industries. Quaternion-joints have recently empowered a specific class of robotic arms – cable-driven redundant manipulators – beyond its prior capabilities. Specifically, quaternion-joints reduce the number of required motors per degree of freedom, paving the way for more compact this http URL ongoing challenge is that the complexity of the kinematic model of quaternion joints challenges a priori decisions on manipulator configurations and imposes higher computational demands on the control system and its non-linearities amplify all discrepancies between design and physical artifact arising from fabrication imprecision. Here we show a that a 4-segment, 8-joint manipulator can achieve a broader workspace than extant configurations, at lower hardware cost, and that Residual Reinforcement Learning outperforms extant state-of-the-art methods – specifically, the FABRIK algorithm – on the control of such manipulator. Our results show that this configuration is more workspace-effective than prior designs, and that Residual Reinforcement Learning outperforms FABRIK by three orders of magnitude on positional and orientational accuracy, effecting precise control of the novel 4-segment, 8-joint manipulator. Additionally, the control implementation is simpler: we describe the complete FABRIK process for control and corresponding learning implementation. Our methodology is applicable to the design of new systems, providing designers with further tools for the development of this class of manipulators and corresponding control systems for novel configurations.
[LG-87] OLIVE: Online Low-Rank Incremental Learning for Efficient Adaptive Exoskeletons
链接: https://arxiv.org/abs/2606.05234
作者: Dong Liu,Yanxuan Yu,Ben Lengerich,Tony Geng,Ying Nian Wu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Wearable exoskeleton systems hold promise for restoring mobility in individuals with physical impairments, yet most existing controllers rely on static gait policies that lack the ability to adapt to dynamic real-world environments or individual user characteristics. We present \olive (\underlineOnline \underlineLow-rank \underlineIncremental Learning for Efficient Adapti\underlineve Exoskeletons), a parameter-efficient online adaptation framework that continuously personalizes exoskeleton control during deployment. \olive decomposes the adaptive component of the control policy into a low-rank residual form~ \dW = \At\Bt^\top with rank~ r!\ll!\min(d,k) , reducing online update cost from \mathcalO(dk) to \mathcalO(r(d+k)) while preserving the stability of a pretrained base controller~ \Wz . Parameters are updated via a reward-shaped policy gradient driven purely by on-body sensor feedback (EMG, IMU, vibration), eliminating dependence on offline reference trajectories. A gating mechanism modulates the strength of personalization based on contextual state, and a dynamic rank scheduler adapts the update dimensionality to terrain complexity – allocating minimal capacity on simple flat terrain and expanding to higher-rank updates on demanding uneven surfaces – enabling robust performance across diverse activities: flat walking, stair navigation, slopes, and uneven terrain. Experiments on the wearable platform demonstrate that \olive achieves +13, +22, and +15 percentage-point improvements in gait smoothness, effort reduction, and motion stability over the strongest baseline, converging within \sim 1,800 walking steps at 7.4,ms end-to-end latency. Our code implementation is available at this https URL.
[LG-88] State commitment learning: training language models to distinguish computation from memory
链接: https://arxiv.org/abs/2606.05201
作者: Fei Ding,Yongkang Zhang,Runhao Liu,Yuhao Liao,Zijian Zeng,Huiming Yang
类目: Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Reasoning language models do not distinguish tokens used for computation from tokens that constitute persistent state: once generated, all hidden thoughts remain in context and influence future predictions. As a result, downstream reasoning may depend on failed attempts, dead ends, and private scratch work that should not be safely relied on later. We recast this phenomenon as a new training objective, state commitment learning: training models to explicitly distinguish information that should be committed as persistent state from temporary computation that can be discarded. We define a counterfactual criterion, persistent-state sufficiency, which makes it trainable and measurable whether an answer remains usable after hidden thoughts are erased. We then propose Counterfactual Erasure RL (CERL), which evaluates, under the same prefix, both a path that keeps hidden thoughts and a path that erases them, and gives reward only when the erasure path remains correct. We also introduce the Erasure Dependence Protocol and show across mathematics, long-chain logic, scientific QA, and multi-turn tool-use evaluation that CERL substantially reduces answer dependence on hidden thoughts without sacrificing accuracy, consistently outperforming correctness-only RL and long-answer SFT baselines.
[LG-89] PyCC.id: A package for hypothesis-driven equation discovery with structural identifiability
链接: https://arxiv.org/abs/2606.05191
作者: Federico J. Gonzalez
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: The software package is available at: this https URL
Abstract:Data-driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time-series measurements. A known issue is the ill-conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post-hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons’ inspired by characteristic curves (CCs), defining a hypothesis-driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time-dependent data.
[LG-90] ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
链接: https://arxiv.org/abs/2606.05170
作者: Jason Z Wang
类目: Machine Learning (cs.LG)
*备注: 28 pages, 12 figures, appendix and checklist
Abstract:At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution – a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) – and this composition differs by model size (p 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.
[LG-91] he Evaluation Blind Spot: A Stereological Theory of Benchmark Coverag e for Large Language Models
链接: https://arxiv.org/abs/2606.05169
作者: Jason Z Wang
类目: Machine Learning (cs.LG)
*备注: 55 pages, 3 figures, 3 tables, extensive appendix with proofs
Abstract:We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by epsilon + C R m^(-1/(d_eff-1)), with matching Lipschitz lower bound. Empirically, three independent leaderboards (Open LLM v2, an extended 12-benchmark suite, LiveBench) all have d_eff in [2.86, 4.80] on their competitive frontier; the structural blind spot exceeds the observed runner-up score gap by two orders of magnitude and dominates statistical noise by 52-127x. Under a chi-squared projection model, the isotropic prior is the optimistic case; across six hidden-capability priors and four ambient dimensions the simulated half-split swap rate of the top two models stays in [0.38, 0.49], and a 500-trial random visible/held-out split shows that 92% of trials swap the top-1 ranking with on average 2.83 of 5 top-5 models changing. A submodular greedy algorithm with the Nemhauser (1 - 1/e) guarantee finds a stable core of 4 benchmarks; 7 of 12 suffice for 90% coverage, and the trained subset transfers across temporal quarters with 93-97% retention. A counterfactual validation across 12 internal benchmarks and 27 Chatbot Arena categories confirms that the eigenstructure predicts which evaluations are irreplaceable (rho = -0.69, p = 0.013 for removal disruption) and which external evaluations bring new information (rho = +0.38). As a second, independent theoretical contribution, we resolve Gardner’s Problem 1.5 (1995) for C^2 support functions, establishing the minimax rate Theta(R/(kappa m^(2/(D-1)))) in general dimension via optimal recovery theory on S^(D-1).
[LG-92] How abundant are good interpolators?
链接: https://arxiv.org/abs/2606.06469
作者: August Y. Chen,Ahmed El Alaoui
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR)
*备注: 140 pages
Abstract:Let S be the set of unit norm linear classifiers \theta \in \mathbbR^d which correctly classify every point of a labeled dataset (X_i,y_i)_i=1^n , X_i \in \mathbbR^d , y_i \in -1,+1\ , with a possibly negative margin \kappa fixed in advance. Under two natural data-generating distributions of the (X,y) pairs – a Gaussian mixture model and a logistic model with Gaussian features – and in the proportional regime n/d \to \alpha with small enough \alpha , we establish a large deviation principle on the event that a point \theta chosen uniformly at random from S achieves a given generalization error, with high probability over the choice of the data. The associated large deviation rate function is deterministic and describes the proportion, at the exponential scale in d , of interpolating classifiers having a given desired performance. As a consequence, we establish the following concentration phenomenon: all but an exponentially small fraction of interpolating classifiers have approximately the same generalization performance given by the unique maximizer of this rate function. We numerically compare this maximizer to the performance of empirical risk minimization by gradient descent and to the performance of a natural linear program, both finding a point in S , and deduce that in the overparametrized regime of small \alpha , these efficient procedures outperform the vast majority of interpolators, pointing to their nontrivial benign overfitting in this setting. Comments: 140 pages Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2606.06469 [math.ST] (or arXiv:2606.06469v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2606.06469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-93] Conformal Risk Sharing: Certified Cost Allocation with Participation Guarantees
链接: https://arxiv.org/abs/2606.06391
作者: Ieva Kazlauskaite
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Sharing the financial impact of rare adverse events across a group can soften extreme individual burdens, but any participant made worse off by the arrangement has reason to leave. A credible mechanism must therefore provide each agent with a trustworthy cap on their future obligation and should be deployed only if the aggregate harm across participants is bounded. We formalise this as the Certified Allocation Problem: from finite data and without distributional assumptions, find a redistribution rule, produce obligation caps for every participant, and verify that no participant is made materially worse off. We propose Conformal Risk Sharing, which solves this problem by pairing an interpretable sharing policy with split conformal calibration. The sharing intensity is tuned on training data, while held-out calibration data produces distribution-free per-agent guarantees (valid under exchangeability). Experiments on synthetic and real-world data, including precipitation and energy-cooperative data, confirm that the framework can substantially reduce extreme obligations for high-risk agents while controlling harm to others.
[LG-94] Function-Space Priors for Bayesian Neural ODEs with Application to Vessel Trajectory Prediction
链接: https://arxiv.org/abs/2606.06351
作者: Jaeyeong Lee,Wonmo Koo,Heeyoung Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Vessel trajectory prediction from Automatic Identification System (AIS) data is essential for maritime situational awareness, yet it remains challenging due to irregular sampling, missing reports, and complex dynamics. Beyond accurate point forecasts, maritime applications also demand well-calibrated uncertainty estimates for reliable decision-making. Bayesian Neural Ordinary Differential Equations (ODEs) offer a principled framework for continuous-time trajectory modeling with uncertainty quantification by placing a prior over the neural vector field parameters. However, the commonly used isotropic Gaussian weight prior fails to encode informative structural properties of vessel dynamics, such as smoothness and locality. Existing function-space Bayesian neural network methods address this limitation for static mappings, but do not transfer directly to Neural ODEs, where the primary quantity of interest is the trajectory rather than the vector field itself. In principle, one could place a Gaussian process (GP) prior directly over ODE solutions, but this requires propagating distributions through a nonlinear ODE solver, which is analytically intractable. To address this challenge, we adopt a practical approach that imposes a GP-kernel-based prior directly on the vector field evaluated at a finite set of measurement points. Specifically, we augment the standard weight-space variational objective with a kernel-based regularizer that penalizes deviations of the vector field from the structure implied by a GP prior. To handle long and irregular AIS trajectories, we further combine this function-space regularization with probabilistic multiple shooting, which decouples inference across temporal segments while maintaining global consistency.
[LG-95] Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation Analysis
链接: https://arxiv.org/abs/2606.06342
作者: Yan Wang,Tianyang Hu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted by TMLR
Abstract:Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation. First, we complete the RTD framework by introducing Symmetric Representation Topology Divergence (SRTD) and its efficient variant SRTD-lite. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations. Second, to enable reliable benchmarking across heterogeneous settings, we propose Normalized Topological Similarity (NTS). By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences. Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.
[LG-96] Wall Shear Stress Reconstruction from Concentration: Differentiable Physics and Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2606.06313
作者: Mahmoud Elhadidy,Siva Viknesh,Roshan M. D’Souza,Amirhossein Arzani
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Wall shear stress (WSS) governs near-wall transport dynamics and is a key hemodynamic indicator in cardiovascular flows, yet remains difficult to infer accurately due to the need for precise computation of near-wall velocity gradients. Passive scalar fields, such as concentration or temperature, are advected by the same underlying velocity field and have the potential to uncover hidden flow physics metrics such as WSS. In this work, we demonstrate such reconstruction from spatially limited passive scalar observations using two fundamentally different inverse frameworks: a differentiable physics framework based on discrete adjoint, PDE-constrained optimization, which enforces the governing equations as hard constraints, and physics-informed neural networks (PINNs), which treat them as soft constraints. Benchmark problems include a 2D canonical backward-facing step (2D-BFS) and a 3D patient-specific stenotic coronary artery. For the 2D-BFS case, evaluated under three measurement scenarios (near-wall, far-field, and combined), PINN achieves high accuracy when near-wall data are available but fails when restricted to far-field measurements, whereas the differentiable physics approach recovers accurate WSS across all scenarios. In the 3D patient-specific case, the differentiable physics framework outperforms PINNs, yielding accurate WSS reconstruction. These results establish that measurement location and inverse formulation jointly determine reconstruction fidelity in scalar-based near-wall flow inference. The proposed framework opens a path toward estimation of near-wall hemodynamics from scalar transport data, with broader applicability to fluid flow problems where passive scalars can be observed.
[LG-97] Discrete Causal Representations from Heterogeneous Domains: A Bayesian Approach with Social Survey Applications
链接: https://arxiv.org/abs/2606.06288
作者: Ankur Garg,Michael Stettler,Aaron Schein,Julius von Kügelgen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Causal representation learning aims to infer the high-level latent causal concepts that give rise to observed low-level measurements. This is particularly relevant for heterogeneous data from different environments or domains since distribution shifts often arise through sparse, localized changes in some of the underlying causal mechanisms, while other parts of the generative process remain unchanged. Whereas identifiability of causal representations has been studied extensively, practical uncertainty-aware methods and real-world use cases remain less explored. In this work, we propose a Bayesian approach to learning causal representations from multi-environment data, focusing on the case of discrete causal concepts and unknown multi-node soft interventions. To this end, we translate causal assumptions and interpretability desiderata into suitable priors and parametric choices within a hierarchical model. We then devise an inference scheme based on sequential Monte Carlo sampling to approximate the resulting multimodal posterior. We showcase our approach through case studies on social survey data, where latent causal concepts correspond to cultural values or political opinions, measurements to survey responses, and environments to different countries or states. Our model infers meaningful high-level concepts and plausible causal relations among them, demonstrating its utility for learning causal representations of complex real-world data.
[LG-98] Anchor PCA
链接: https://arxiv.org/abs/2606.06233
作者: Benedikt Seiter,Anya Fries,Julius von Kügelgen,Jonas Peters
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Principal component analysis (PCA) is one of the most widely used unsupervised dimension reduction techniques. We study PCA for data from multiple related domains. Since principal components generally differ across domains, one way to obtain a shared low-rank embedding is to perform PCA on the pooled data. However, this approach can focus on spurious directions that exhibit high variation in only a few domains. To find a robust embedding that still explains most variance in unseen but similar domains, we propose instead to focus on shared directions of variation. To this end, we introduce Anchor PCA which trades off overall explained variance with agreement between the shared and domain-specific low-rank embeddings. Anchor PCA amounts to PCA on a modified target matrix and thus can be solved efficiently. Moreover, we show that Anchor PCA recovers a maximal invariant subspace and admits a minimax reconstruction interpretation under bounded domain-specific covariance inflations. On simulated and real-world gas sensor data with temporal drift, we demonstrate, respectively, that Anchor PCA recovers the maximally invariant subspace and yields embeddings that explain more variance on unseen domains than the pooling baseline and a worst-case alternative. Taken together, these findings establish Anchor PCA as a promising approach to robust unsupervised dimension reduction from multi-domain data.
[LG-99] Drag reduction or reward hacking? Recurrent multi-agent reinforcement learning that earns its reward
链接: https://arxiv.org/abs/2606.06227
作者: Giorgio Maria Cavallazzi,Miguel Pérez-Cuadrado,Alfredo Pinelli
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-conservation projection couples agents’ outputs and erases the per-agent credit the policy gradient needs; a memoryless policy cannot resolve the slow near-wall cycle it acts on; and a pressure-gradient reward pays for nominal drag reduction by pumping power through the wall. Two degenerate controllers achieve large drag reductions while total dissipation rises, so the reported figure can mask a more wasteful flow. We trace each fault to its cause and fix it: a differentiable projection that restores credit, a recurrent policy with a widened sensing stencil, and a reward scored on the true wall power. The corrected controller acts on the flow within a closed energy budget, earning a conservative 17% under honest accounting.
[LG-100] Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching Errors
链接: https://arxiv.org/abs/2606.06179
作者: Naïl B. Khelifa,Richard E. Turner,Ramji Venkataramanan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Score-based diffusion models are typically trained by minimizing the L^2 score matching error, and standard theoretical analyses rely on this quantity to bound the sampling discrepancy between the learned and target distributions. We show the L^2 score error is not the right intrinsic measure of marginal distributional quality: a learned diffusion model can incur arbitrarily large L^2 score error while perfectly matching the target distribution. By decomposing score errors into a gradient and a solenoidal component (a Helmholtz-Hodge decomposition), we identify the geometric reason behind this: only the gradient component enters the marginal Fokker-Planck dynamics, while the solenoidal component is structurally invisible. We make this precise in three results. First, building on the corrected geometry, we prove an impossibility result: no monotone function of the L^2 score error can uniformly lower bound any divergence between the learned and target distributions. Second, we derive an upper bound on the Kullback-Leibler divergence that depends only on the observable gradient component of the error, tightening the standard Girsanov bound and identifying its looseness as the cost of operating on path-space rather than marginal-space dynamics. Third, we give a tractable estimator of the gradient component via a dual Sobolev identity, which is shown to empirically correlate substantially better with sample quality than the full L^2 error.
[LG-101] Effective Dimensionality as an Operator Invariant for Physics-Preserving Constraint Adaptation in Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2606.06171
作者: Cornelius Otchere,Michael Shields
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-Informed Neural Networks inherently suffer from task interference because they rely on a shared parameter space to satisfy both governing differential equations and boundary conditions. We analyze this structural conflict using the Fisher Information Matrix to quantify the effective degrees of freedom ( d_eff ) in a physics-constrained model. Unlike the classical d_eff which measures how many parameter directions are informed by data against a statistical prior, our d_eff measures the dimension of the parameter directions unconstrained by the differential operator. For operators with finite-dimensional kernel, we show that d_eff converges to the kernel dimension exactly, independent of network width, depth, or activation function, recasting it from a fit diagnostic into a structural invariant of the underlying continuous operator. For operators with infinite-dimensional kernel, d_eff instead measures the network’s finite-dimensional representational bandwidth for that kernel rather than recovering an integer invariant. Importantly, d_eff also serves as an a priori structural diagnostic. Driving d_eff of a well-posed problem to zero certifies that the physics and boundary constraints have absorbed the network’s free directions. Building on this characterization, we introduce subspace projection strategies for boundary adaptation. Rather than retraining from scratch, we project parameter updates into the null space of the pre-trained physics operator so that new boundary conditions are satisfied without disturbing the learned physics. Gradient-based fine-tuning can match or exceed this but needs more wall-clock time and tuning, whereas subspace projection delivers near-equivalent quality in seconds to minutes. We validate on linear and nonlinear operators, demonstrating accurate adaptation to initial and boundary shifts and unencountered constraint types.
[LG-102] p-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences
链接: https://arxiv.org/abs/2606.06117
作者: Tirtharaj Dash,Gunja Sachdeva
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Algebraic Topology (math.AT); Genomics (q-bio.GN)
*备注: 12 pages, 5 figures, 8 tables
Abstract:We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines p -adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a p -adic distance on k -mer prefixes, which captures hierarchical positional structure, and a compositional L_1 distance on k -mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris–Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single p -adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ( 28 to 500 sequences, 3 to 7 classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to 21 percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by 6.7 to 11.4 percentage points on three low-sample benchmarks. The pVR codebase is publicly available at this https URL.
[LG-103] Adaptive Learning Rates with Surrogate Probability for Follow-the-Perturbed-Leader COLT2026
链接: https://arxiv.org/abs/2606.06043
作者: Jongyeong Lee,Junya Honda,Shinji Ito,Chansoo Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: TBA COLT2026
Abstract:Follow-the-regularized-leader framework has shown effectiveness and flexibility in online learning problems, where the choice of learning rates are known to be crucial. Recently, adaptive learning rates defined in terms of the arm-selection probabilities, obtained by solving convex optimization, have achieved improved best-of-both-worlds (BOBW) guarantees in various bandit problems. In contrast, BOBW guarantees for its computationally efficient alternative, follow-the-perturbed-leader (FTPL), remain relatively limited since its optimization-free nature ironically makes the design of adaptive, probability-dependent learning rates non-trivial. To address this challenge, we propose an adaptive learning rate for FTPL by introducing surrogate probability functions that can be computed only from the available quantities, without requiring the exact probabilities. Based on these learning rates with surrogate functions, we provide the BOBW guarantee for FTPL with Pareto perturbations for any shape parameter \alpha 1 , generalizing prior results restricted to specific choices of \alpha=2 . We further show the BOBW guarantees for FTPL with adaptive learning rates in the bandit problem with expert advices. Our approach preserves the computational simplicity of FTPL while enabling probability-dependent adaptivity, and the surrogate-based methodology may be of independent interest in other algorithmic frameworks beyond FTPL and learning rate designs.
[LG-104] Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation Universal Learning Steps and I.I.D. Samples
链接: https://arxiv.org/abs/2606.05967
作者: Ziad Kobeissi(L2S),Éloïse Berthier(U2IS)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations k (i.e., of order 1/k), (ii) robust to ill-conditioning: it only depends on an initial error and modelindependent constants and (iii) sharp up to a multiplicative constant lower than 11. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing O(1/k) rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.
[LG-105] EML-CD: Causal Mechanism Recovery via EML Symbolic Trees in Structure Learning
链接: https://arxiv.org/abs/2606.05942
作者: Sota Asanuma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural network (NN)-based nonlinear causal discovery methods recover DAG structure but leave each causal mechanism as a black box. Waxman et al. argued that extracting causal mechanisms from NN weights is ill-posed. We propose EML-CD, a framework that integrates the EML operator (capable of composing elementary functions from a single binary operator) into causal structure learning, with interpretable mechanism recovery as the primary objective. EML-CD represents each edge mechanism as a gated EML binary tree and automatically discovers closed-form causal equations. Analytical Jacobians can be directly computed from the output equations, enabling quantitative understanding of causal effects. On real data (Sachs protein signaling, d=11), EML-CD achieves SHD=11.2 +/- 0.4 (5-seed mean; baselines are single deterministic runs), on par with PC/GES within seed variance and below CAM, while attaching closed-form equations to each detected edge (precision 0.756, recall 0.365). In a controlled bivariate test with known mechanisms, EML-CD recovers 10 of 11 elementary function families faithfully (held-out shape correlation = 0.96; only high-frequency sine is partial). On a symbolic synthetic benchmark, EML-CD attains a substantially lower and more stable held-out mechanism f-MSE than a fixed SINDy dictionary (mean 3.67 vs. 7644, the latter inflated by catastrophic extrapolation on one seed), although its structure recovery (SHD 14.0) only matches the dictionary and stays below specialized optimizers; on the Causal Chambers light-tunnel subset, a depth-2 model improves F1 over linear OLS-BIC (0.444 vs. 0.273).
[LG-106] Finding Most Influential Sets ICML2026
链接: https://arxiv.org/abs/2606.05919
作者: Lucas D. Konrad,Nikolas Kuschnig
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Computation (stat.CO)
*备注: Published as a conference paper at ICML 2026
Abstract:Identifying most influential sets (MIS) - size- k subsets whose removal maximally changes a target estimand - is typically infeasible because it requires searching over \binomnk subsets. For estimands with linear-fractional leave-set-out effects, we show that MIS selection reduces to a one-parameter sequence of top- k problems. Dinkelbach’s method yields an algorithm with \mathcalO(n) cost per iteration and finite termination. For fixed residualized inputs, the algorithm returns a globally optimal set for the univariate ratio objective, including the oracle-residualized partial linear model. With estimated nuisance functions, uniform denominator and generated-score stability imply approximation to the first-order oracle orthogonal-score objective; exact set recovery follows under a separation condition. Simulations and applications show that the method recovers exact MIS that were previously computationally inaccessible.
[LG-107] Cross-scale spatially-aware generative modeling of transcriptomic programs underlying neurodegenerative brain organization
链接: https://arxiv.org/abs/2606.05870
作者: Krishnakumar Vaithianathan(for the Alzheimer’s Disease Neuroimaging Initiative)
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 26 pages, 5 figures
Abstract:Neurodegenerative disorders such as Alzheimer’s disease exhibit highly organized patterns of regional brain vulnerability, yet the biological mechanisms underlying this spatial selectivity remain incompletely understood. Existing imaging-transcriptomic studies have largely relied on correlation-based analyses between gene expression and neuroimaging phenotypes, limiting their ability to model how molecular organization gives rise to neurodegeneration. Here, we introduce a cross-scale spatially-aware generative framework for modeling transcriptomic programs underlying cortical neurodegeneration. Regional transcriptomic profiles were derived from the Allen Human Brain Atlas using 910 landmark genes across 68 cortical regions. Neurodegenerative vulnerability maps were constructed from ADNI FreeSurfer cortical thickness measurements by computing regional cortical thinning differences between cognitively normal controls (NC = 926) and Alzheimer’s disease subjects (AD = 426). A variational generative architecture was used to learn latent biological programs linking regional gene-expression organization to cortical degeneration while incorporating graph-based spatial smoothness regularization to preserve cortical organization. The proposed framework achieved strong prediction of regional neurodegenerative vulnerability, yielding an explained variance of 0.8604 and a significant spatial correlation between predicted and observed cortical degeneration profiles (r = 0.9439, p 0.001). The learned latent representations revealed structured transcriptomic organization associated with distributed disease susceptibility. These findings demonstrate that biologically constrained generative modeling can bridge microscale molecular organization with macroscale neurodegeneration, providing a foundation for spatially-aware generative neurobiology and computational neuroscience.
[LG-108] Diff2SP: Diffusion Models for Correlated Scenario Generation in Stochastic Programming
链接: https://arxiv.org/abs/2606.05649
作者: Haixiang Sun,Andrew Liu
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:
Abstract:Scenario generation is a critical component in stochastic programming (SP), as it directly influences the quality of decision-making under uncertainty. Existing approaches predominantly rely on either sampling-based techniques or supervised learning using neural networks. Sampling-based techniques often struggle to capture complex dependencies and rare but plausible events, while supervised learning requires fixed input-output pairs for training and is limited in its ability to generate a wide variety of realistic scenarios that are not restricted by predefined patterns or rules. To address these limitations, we introduce Diff2SP, a diffusion-based generative framework that incorporates downstream optimization objectives directly into scenario generation. Unlike conventional methods that treat scenario generation and decision-making as separate steps, Diff2SP embeds stochastic optimization into the training process, enabling the generation of scenarios that are both statistically coherent and decision-aware. To formally justify this optimization-aware design, we establish a regret bounds that link distributional accuracy to decision quality, and establish sample complexity guarantees showing faster convergence than traditional generative models such as GANs. Empirical results on both synthetic and power-system datasets validate these theoretical insights, demonstrating that Diff2SP consistently improves both statistical fidelity and downstream optimization outcomes.
[LG-109] Uncovering Extreme Event Mechanisms for Prediction and Control with Sensitivity-Balanced Projections
链接: https://arxiv.org/abs/2606.05618
作者: Nicholas Zolman,Sajeda Mokbel,Samuel E. Otto,Steven L. Brunton
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 12 pages, 6 figures (main text). Additional 14 pages of references and Supplementary Information
Abstract:Extreme events – such as earthquakes and coronal mass ejections – are common in many chaotic dynamical systems, yet are difficult to characterize and predict due to the subtle instability mechanisms that drive them. In this work, we develop an interpretable technique that reveals the underlying mechanisms behind extreme events and uses them to build data-driven forecasts and intuitive event suppression controllers. In particular, we utilize the covariance balancing reduction using adjoint snapshots (CoBRAS) method to identify linear oblique projections that best capture the sensitivity of a quantity of interest and reconstruct the original state. Importantly, we bypass the need for cumbersome adjoint calculations, instead using backpropagation via modern automatically differentiable numerical frameworks. To accommodate spatially localized events, we also introduce a new variant of CoBRAS to obtain local sensitivity-balanced projections. We demonstrate the utility of this approach to characterize extreme events across a diverse set of challenging systems, including turbulent bursts of energy dissipation in the 2D Kolmogorov Flow, spontaneous synchronization in networks of coupled FitzHugh-Nagumo oscillators, and the localized formation of ocean rogue waves from a modified nonlinear Schrödinger equation. For each example, we show that our simple forecast models accurately predict extreme events and that the underlying mechanisms may be used to design control laws to prevent these events. Finally, we demonstrate that by learning a neural network surrogate model of the dynamics directly from data, we may extend this approach to experimental systems and systems that are not natively written in an automatically differentiable programming language.
[LG-110] Sparse Functional Singular Value Decomposition for Biclustering and Triclustering Longitudinal Data
链接: https://arxiv.org/abs/2606.05488
作者: Yue Zhao,Thierry Chekouo,Sandra Safo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Identifying subtypes of complex conditions, such as Inflammatory Bowel Disease (IBD), often requires capturing latent patterns in longitudinal omics data. However, these data are typically high-dimensional, sparsely sampled, and irregularly observed over time, posing substantial challenges for conventional (bi)clustering and functional data analysis methods. We propose Tri-SfSVD, a unified sparse functional Singular Value Decomposition framework for discovering biclusters and triclusters in longitudinal data. Unlike existing functional biclustering methods that rely on ad hoc imputation or enforce restrictive shape-homogeneity assumptions, Tri-SfSVD integrates continuous trajectory estimation with simultaneous subject, feature, and temporal selection within a single optimization framework. By imposing sparse penalties across subjects, variables, and temporal subregions, the proposed method works directly on observed data to uncover localized structures at the subject, subject-feature, and subject-feature-time levels. Extensive simulations demonstrate that Tri-SfSVD outperforms existing approaches in high-dimensional settings. Applied to IBD multi-omics data, the method identified three biclusters linking sample clusters with distinct IBD-related clinical characteristics to microbial pathway groups associated with specific bacterial taxa, providing interpretable subject-pathway associations for characterizing disease heterogeneity. Applied to multi-channel EEG data, the method identified three triclusters linking sample clusters with distinct alcohol-related phenotypes to localized brain activity patterns, including subgroup differences separated by temporal subregions within the same spatial region.
[LG-111] AlloGen: Conformation-Selective Binder Generation with Differential State Scoring
链接: https://arxiv.org/abs/2606.05474
作者: Hanqun Cao,Zachary Quinn,Aastha Pal,Sumi Kimura,Jingjie Zhang,Pheng Ann Heng,Pranam Chatterjee
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Protein binder design has largely optimized for affinity alone, leaving conformational selectivity unaddressed: for allosteric targets such as kinases, nuclear receptors, and GPCRs, a binder that engages both active and inactive states provides no functional specificity regardless of how tightly it binds. We introduce AlloGen, a modular framework that decouples backbone generation from a learned state-selectivity scorer Q_\theta , an SE(3)-invariant interface graph transformer trained via a two-phase curriculum that first learns interface geometry before imposing conformational discrimination. Because Q_\theta is fully differentiable and generator-agnostic, it integrates with any backbone generator as a passive reranker or an active gradient-based guide without retraining. Across a diverse benchmark of proteins spanning multiple families and conformational mechanisms, AlloGen consistently identifies binders that preferentially recognize desired structural states while rejecting alternative conformations. Experimental validation on calmodulin further demonstrates that these computational selectivity signals translate to physical molecules, yielding de novo peptides that bind the desired holo conformation while exhibiting no detectable binding to the apo state. Together, these results establish conformational selectivity as a learnable property and provide a general framework for state-selective protein binder design.
[LG-112] Environment-Robust Representation Learning with Empirical Bayes
链接: https://arxiv.org/abs/2606.05365
作者: Yuli Slavutsky,Matthew Shen,Bohan Wu,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider multi-environment prediction problems. We assume the environments change the distribution of a latent variable, while the mechanisms generating observed covariates and targets remain stable conditional on that variable. For example, hospitals or clinical cohorts may differ in the prevalence of latent patient states, even though the relationships between those states, physiological measurements, and outcomes remain unchanged. Given a dataset from multiple environments, we formulate a Bayesian model for such problems and derive the corresponding variational objective. We show that this objective decomposes into per-environment terms and an additional cross-environment balancing term induced by the model’s structure. We use an empirical Bayes method to set the prior and incorporate it into the objective. Based on this objective, we develop an amortized variational algorithm for posterior approximation, and use the resulting learned latent variables to form predictions in new this http URL study our approach through simulations and real-world studies of astronomical source identification, microbiome-based disease detection, and ICU sepsis prediction. Across these settings, our method outperforms previous approaches for prediction in new environments.
[LG-113] abSODA: Tabular Diffusion based Imputation with Skip Pattern Detection and Ordinal Awareness
链接: https://arxiv.org/abs/2606.05361
作者: Yuyu Chen,Taehyo Kim,Hai Shu,Yang Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Missing data imputation in large-scale surveys faces two challenges that are not well handled by current tabular diffusion methods. First, \emphstructural skips, cells made inapplicable by questionnaire design, should not be imputed but are often conflated with item nonresponse. Second, \emphordinal responses encode ordered categories, yet most pipelines treat them as nominal levels through one-hot or analog-bit encodings. We introduce \textbfTabSODA (\textbfTabular diffusion with \textbfSkip pattern detection and \textbfOr\textbfdinal \textbfAwareness), an Expectation-Maximization (EM)-based diffusion imputer built on the Elucidated Diffusion Model (EDM) framework. TabSODA propagates structural skips through the denoising loss and reverse-time sampler, and represents ordinal variables with cumulative-probit scalar latents while retaining analog-bit encodings for nominal variables. When a codebook skip mask is available, TabSODA uses it directly; otherwise, the TabSODA+SKIP variant estimates the mask from raw responses and questionnaire order using a CART-based skip-pattern miner. On Population Assessment of Tobacco and Health (PATH) study and the National Survey on Drug Use and Health (NSDUH), two nationally representative U.S.\ surveys, TabSODA reduces ordinal MACE by up to 23.7% and improves categorical accuracy by up to 9% over the strongest baseline across MCAR, MAR, and MNAR masking. The skip miner achieves near-perfect precision on both datasets, allowing TabSODA+SKIP to closely track the codebook-mask variant.
[LG-114] Harnessing Source Heterogeneity for Cluster-Structured Transfer Learning
链接: https://arxiv.org/abs/2606.05258
作者: Xiaohui Yin,Jun Jin,Shane J. Sacco,Robert H. Aseltine,Kun Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Transfer learning is a natural strategy when a target population has limited data but multiple related auxiliary sources are available. A central difficulty is source heterogeneity: auxiliary sources may not be equally useful, and their usefulness may vary in a structured, cluster-like fashion. Existing transfer-learning methods often reduce source selection to a binary informative/non-informative decision, overlooking subgroups of sources with differential transferability. Motivated by a suicide-risk study using data from the Connecticut Hospital Information Management Exchange (CHIME), comprising 636,758 patients across 27 hospitals, we propose Trans-GLMC, a cluster-structured transfer-learning procedure for generalized linear models. The CHIME setting illustrates the core challenge: hospital-specific risk models are unstable because suicide attempts are rare at any single facility, whereas indiscriminate pooling across hospitals can obscure facility-level differences in patient mix and risk profiles. Trans-GLMC first constructs a coefficient-based distance among the target and candidate sources to recover latent source clusters. It then combines global fusion, within-cluster refinement, and target debiasing to produce an estimator that adapts to the detected structure. We establish a non-asymptotic error bound that improves over its unclustered counterpart whenever a meaningful target cluster exists and matches the unclustered rate up to constants otherwise. In simulations and in the CHIME study, Trans-GLMC improves facility-specific prediction, identifies interpretable communities of hospitals with mutual transferability, and recovers clinically coherent suicide-risk factors.
[LG-115] Deterministic Envelopes for Tamed SGLD: Decoupling Stochastic-Gradient Noise and Localizing Taming
链接: https://arxiv.org/abs/2606.05242
作者: Yiwei Zhou,Ziheng Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: 40 pages, 11 tables, 2 figures
Abstract:Stochastic-gradient Langevin algorithms often use tamed denominators to stabilize non-globally Lipschitz drifts. This paper shows that when the denominator depends on the same stochastic-gradient realization as the numerator, the taming step changes the stochastic oracle itself and can create a stationary bias even if the original stochastic gradient is unbiased. We propose a structure-preserving framework for designing tamed denominators. It fixes the denominator before the oracle noise is sampled and uses localized deterministic envelopes to avoid unnecessary taming in typical regions. These kernels keep the stabilizing effect of taming while avoiding the bias introduced by a gradient-dependent denominator. Our theory explains how the stationary error splits into the bias caused by oracle-dependent taming and the remaining error introduced by deterministic stabilization. Within this deterministic-envelope family, the analysis identifies a far-tail condition that explains the limitation of local soft envelopes and motivates a hybrid member: soft in the typical region, but protected by hard-tail control on rare excursions. Experiments confirm the predicted stationary distortions of random denominators, the bias reduction of deterministic-envelope designs, and the stabilizing effect of the hybrid construction.
[LG-116] HyFAD: Hybrid Time-Frequency Diffusion with Frequency-Aware Embedding for Time Series Imputation
链接: https://arxiv.org/abs/2606.05239
作者: Hongfan Gao,Wangmeng Shen,Bin Yang,Jilin Hu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency-sensitive denoising, high-frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose \textbfHyFAD, a \textbfHybrid time-frequency \textbfDiffusion model with \textbfFrequency-\textbfAware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time-frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse-to-fine generation. Specifically, the time-domain diffusion process captures low-frequency global trends, while the frequency-domain diffusion process refines high-frequency spectral components. We further introduce a frequency-aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step-dependent spectral guidance and facilitates more accurate band-wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state-of-the-art performance. Our source code is available at this https URL.
[LG-117] Central Description Length (CDL) Clustering Validation Index
链接: https://arxiv.org/abs/2606.05230
作者: Mahdi Shamsi,Soosan Beheshti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.
[LG-118] Quantifying the biophysical properties of stomatocytes in health and disease
链接: https://arxiv.org/abs/2606.05227
作者: Zhaojie Chai,Jianlu Zheng,He Li,Ming Dao,George Em Karniadakis
类目: Cell Behavior (q-bio.CB); Machine Learning (cs.LG); Mathematical Physics (math-ph); Biomolecules (q-bio.BM)
*备注: 26 pages, 9 figures
Abstract:Hereditary stomatocytosis (HS) comprises red blood cell (RBC) disorders characterized by cup-shaped erythrocytes that respond oppositely to splenectomy: curative in overhydrated HS (OHS) but potentially thrombogenic in dehydrated HS (DHS/xerocytosis). This paradox persists because RBC biomechanics is governed by partly independent parameters–shear modulus, bending rigidity, surface-to-volume ratio (S/V), and cytoplasmic viscosity–that existing assays capture only piecemeal. Here we combine dissipative particle dynamics (DPD) simulations with microfluidic imaging to construct a control discocyte and three stomatocyte models (ST-RBC1-3) at fixed membrane area and decreasing volume (109.7, 101.5, 89.8 fL), spanning the OHS-to-DHS range. Tracing this parameter set through five mechanically orthogonal assays, we find that interendothelial-slit (IES) traversal is geometry-dominated: overhydrated ST-RBC1 requires an order of magnitude higher critical pressure than healthy RBCs, whereas dehydrated ST-RBC3 passes freely. ST-RBC3 nonetheless suppresses membrane tank-treading and raises low-shear whole-blood viscosity by ~29% at physiological haematocrit, comparable to Gaucher-disease hyperviscosity. A funnel-obstacle chip amplifies these differences into a label-free centerline-offset signal predicted to separate all four RBC types (~4.5 standard deviations between extreme phenotypes). These results unite single-cell mechanics, splenic filtration, and hemorheology in one framework, resolve the splenectomy paradox, and point toward microfluidic pre-operative risk stratification in HS.
[LG-119] he Language of Elution: Autoregressive Prediction of the Next Feature in Untargeted LC-HRMS Lipidomics
链接: https://arxiv.org/abs/2606.05225
作者: Dayanjan S. Wijesinghe
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Untargeted liquid chromatography-high-resolution mass spectrometry (LC-HRMS) detects thousands of molecular features per sample, yet only 2-20% receive confident structural annotations. A root cause of this “dark metabolome” is that tandem MS/MS acquisition is reactive: instruments select precursors only after ions appear, blind to what elutes next. We reframe chromatographic elution as an autoregressive sequence prediction task. Because reversed-phase elution order is governed by hydrophobicity, successive features form a physically constrained sequence, like tokens in language. We discretize the mass-to-charge (m/z) axis into 110 bins and train long short-term memory (LSTM) and Transformer models to predict the next eluting m/z bin from five annotation-free per-token features: m/z bin, mass defect, retention-time gap, polarity, and intensity rank. Trained on 15,242 features from four clinical lipidomics cohorts (342 plasma samples; SCIEX TripleTOF 6600+, Waters CSH C18), the LSTM reaches 98.4% top-1 accuracy (99.99% top-5; mean absolute error 3.6 Da) and the Transformer 98.0%. Ablation shows autoregressive context accounts for 55.5 percentage points while no single feature contributes more than 0.2 pp: the sequential pattern, not molecular properties, drives prediction. Models transfer across instruments sharing the method (r=0.999 on an independent Agilent 6530 dataset) but fail under a different column chemistry (5.1% top-1) or polarity mode (2.6%), confirming method- and mode-specificity. Fine-tuning on as few as two to five quality-control injections recovers held-out accuracy from 2.6% to nearly 50%, so cross-condition deployment needs minimal calibration. These results establish that elution sequences are highly predictable and lay the groundwork for predictive MS/MS acquisition to improve annotation coverage in untargeted metabolomics.
[LG-120] ransformer-Enhanced Reinforcement Learning: Fundamentals and Applications in Communication Networks
链接: https://arxiv.org/abs/2606.05208
作者: Nguyen Cong Luong,Shaohan Feng,Nguyen Duc Hai,Zeping Sui,Bo Ma,Min Xu,Zhihao Dong,Qiushi Zhao,Nguyen Duc Duy Anh,Nguyen Quoc Khanh,Ngoc Hung Nguyen,Zitian Zhang,Jie Cao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) has long been a powerful solution to various problems in communication networks. However, traditional RL models still face with several limitations. Not only do they rely on large numbers of interactions with the environment, but they are also limited in terms of modeling long-term relationships and tackling partial observability. In recent years, the Transformer model has demonstrated the ability to enhance RL models, allowing them to overcome these issues. Particularly, the self-attention mechanism within the Transformer enables efficient modeling of long-range dependencies and global correlations, as well as accelerates training processes and handles heterogeneous data modalities. In this paper, we present a comprehensive survey of Transformer-based RL algorithms and their applications in communication networks. Specifically, the paper provides the mathematical background of RL and Transformer architectures, along with insights into key issues such as resource allocation, computation offloading, routing, and trajectory control, and network security. We conclude the paper by discussing challenges, open issues, and notable future research directions, including Transformer-enhanced DRL algorithms for semantic communication and network optimization.
[LG-121] Multi-Fidelity Learning with Shallow Recurrent Decoders for Reactor Physics
链接: https://arxiv.org/abs/2606.05202
作者: Stefano Riva,Carolina Introini,J. Nathan Kutz,Antonio Cammi
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:In reactor physics, neutronics can be treated with different fidelity levels, according to the needs of the user. On one hand, the precise modeling of neutrons’ behaviour in reactor physics is often expensive and time-consuming due to the high computational costs to numerically solve the Boltzmann transport equation. Conversely, by adopting suitable assumptions, such as the SP _N , diffusion theory, and point kinetics, it is possible to generate efficiently low-fidelity data. From the perspective of surrogate models, this computational limitation translates into a scarcity of high-fidelity data and a significant amount of low-fidelity data. Given this difference in fidelity levels, it would be interesting to develop a suitable procedure to map low-fidelity models towards higher fidelity models; for instance, one could obtain the solution to a multi-group diffusion equation starting from time-series data obtained from a point kinetics model. Indeed, this work investigates this possibility by leveraging multi-fidelity information with Shallow Recurrent Decoders, a novel machine learning architecture able to map time-series observations to the full state of the reactor. This technique has been designed to use local or global measurements as input and map their temporal trajectories to the high-dimensional state; by the same logic, in principle, this architecture can also be used when the input is formed by the solution of a lumped model. This work applies this idea to a benchmark reactor geometry, mapping the point kinetics model to the diffusion solution under various input conditions, with much less computational costs.
[LG-122] A differentiable machine learning small-angle X-ray scattering analysis framework for structure elucidation of lipid nanoparticles
链接: https://arxiv.org/abs/2606.05200
作者: Maria Bånkestad,Sandra Barman,Magnus Röding,Erik Kaunisto,Viktoriia Meklesh,Audrey Gallud,Marco Mendez,Marianna Yanez Arteta,Stefan Norberg,Ann Terry,Smita Chakraborty,Shun Yu,Jerk Rönnols,Sepideh Pashami
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 38 pages, 24 figures, 5 tables (incl. supplementary information)
Abstract:Lipid nanoparticles (LNPs) are efficient delivery systems for negatively charged nucleic acids. Their multi-component architecture yields a core-shell structure. Small-angle X-ray scattering (SAXS) is an important characterization technique for LNPs, but recovering internal structure and size distribution from SAXS is an inverse problem with non-unique solutions. Realistic models are often too expensive for systematic exploration. We introduce a machine-learning-accelerated, differentiable framework for SAXS analysis of heterogeneous, polydisperse LNPs. The forward model combines a core-shell particle with a Gaussian random-field interior, a neural surrogate for the monodisperse SAXS map, and a differentiable layer integrating over particle-size distributions. The surrogate reduces prediction cost by four orders of magnitude, while differentiability enables large-scale multi-start fitting and ensemble identifiability analysis. Applied to synthetic and experimental MC3 LNP data, the framework shows that near-identical SAXS fits can arise from distinct parameter modes, with the experimental fits dominated by a trade-off between size-distribution and interior-structure parameters.
[LG-123] An accurate nucleic acid-small molecule docking framework via geometric deep learning with large-scale pretraining
链接: https://arxiv.org/abs/2606.05198
作者: Shi Li(1),Xujun Zhang(1),Mingquan Liu(2),Hui Zhang(1 and 4),Shuoying Jia(1 and 4),Yu Kang(1 and 4),Tingjun Hou(1 and 3),Peichen Pan(1 and 3) ((1) College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, P. R. China,(2) Faculty of Health Sciences, University of Macau, Macau SAR, China, (3) Zhejiang Provincial Key Laboratory for Intelligent Drug Discovery and Development, Jinhua Institute of Zhejiang University, Zhejiang, China, (4) Shanghai Innovation Institute, Shanghai, China)
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 34 pages, 4 figures, 4 tabels, Supplementary Materials includes 8 tabels
Abstract:Nucleic acids are increasingly recognized as therapeutic targets beyond conventional protein-centered drug discovery, yet accurate and efficient docking of small molecules to nucleic acid structures remains challenging. Physics-based docking methods often show limited accuracy and efficiency, whereas deep learning approaches are constrained by the scarcity of experimentally resolved nucleic acid-ligand complexes. Here, we present NucleoDock, a deep learning framework for nucleic acid-small molecule docking. To address data scarcity, NucleoDock combines physics-guided large-scale pretraining on millions of docking-generated synthetic complexes with fine-tuning on curated experimental co-crystal structures. It further integrates sequence- and structure-informed nucleotide representations with atomistic three-dimensional features to capture both biological context and binding-site geometry. A mixture density network-based geometric scoring head is used to model conditional interaction-distance distributions for pose ranking. On an external benchmark of 125 nucleic acid-ligand complexes, NucleoDock achieved a top-1 success rate of 56 percent at an RMSD cutoff of 2.0 Angstrom, outperforming rDock with 29 percent, while generating 100 poses in approximately 5 seconds per complex. Retrospective virtual screening on the ROBIN benchmark further showed improved early enrichment. NucleoDock represents a step toward bridging the methodological gap between protein- and nucleic acid-directed computational drug discovery.
附件下载


