本篇博文主要内容为 2026-06-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-06-12)

今日共更新674篇论文,其中:

  • 自然语言处理104篇(Computation and Language (cs.CL))
  • 人工智能222篇(Artificial Intelligence (cs.AI))
  • 计算机视觉99篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习193篇(Machine Learning (cs.LG))
  • 多智能体系统15篇(Multiagent Systems (cs.MA))
  • 信息检索11篇(Information Retrieval (cs.IR))
  • 人机交互25篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] uning Agent -Based Predator-Prey Models Toward Lotka-Volterra Dynamics

【速读】:该论文旨在解决大规模基于代理的模型(agent-based models)在模拟复杂自适应系统时因局部规则与参数敏感性导致的不稳定性问题,如种群崩溃、行为失控或人为边界饱和等。其核心挑战在于如何在高维参数空间中找到使系统动态行为符合经典洛特卡-沃尔泰拉(Lotka-Volterra)振荡模式的参数配置。解决方案的关键在于引入一种基于特征的损失函数(feature-based loss),该损失函数通过奖励持续振荡、相位滞后、种群有界性以及长期持续性等生物合理性特征,实现对环境与人口参数的优化。研究首先在随机控制器下进行参数调优,随后扩展至更自然情境下的进化控制器,借助基于JAX的高效批量仿真框架ABMax,在硬件加速器上实现可扩展的并行模拟,从而有效探索复杂动力学系统的稳定参数区域。

链接: https://arxiv.org/abs/2606.13639
作者: Corinna Mandl,Siddharth Chaturvedi,Marcel van Gerven
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Recent growth in compute power has made it increasingly feasible to use large-scale agent-based models to simulate complex adaptive systems. A central difficulty is that such models contain many local rules and parameters, where small changes can lead to runaway behaviour, population collapse, or saturation at artificial bounds. We study this problem in a continuous predator-prey system where sheep and wolves are active agents with local sensing, internal energy, and recurrent neural network-based controllers. We ask whether environmental and demographic parameters can be tuned so that the resulting population dynamics resemble classical Lotka-Volterra cycles. We optimise these parameters with a feature-based loss that rewards sustained oscillations, phase lag, bounded populations, and long-term persistence, first for random controllers and then for evolved controllers in a more naturalistic setting. The model is implemented in ABMax, a JAX-based agent-based modelling framework that enables efficient batched simulation on hardware accelerators.

[MA-1] Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

【速读】:该论文旨在解决传统屏蔽强化学习(Shielded Reinforcement Learning)在网络安全场景中将形式化规范作为运行时安全约束所导致的局限性问题。其核心挑战在于:现有方法将自动机生成与策略屏蔽机制视为部署阶段的安全保障手段,忽视了形式化分析在系统设计阶段的深层价值。本文提出的关键解决方案是将原本用于生成运行时屏蔽器的自动化理论工具链——包括规范编译、产品博弈构建、吸引子计算与获胜区域提取——重新定位为一种设计阶段的分析工具。通过构建一个受限的双人安全博弈模型,对防御方与攻击方的行为进行不对称建模:防御方的规范定义了不可达的安全区域,而攻击方的规范则限制其在吸引子计算过程中的合法动作。求解该博弈可获得“可防御性判定”(defensibility verdict),即关于网络拓扑与安全规范组合是否具备形式化安全性的正式证明,并生成对应的获胜区域与屏蔽策略。进一步地,作者从吸引子结构中提取拓扑级度量,并结合屏蔽约束下对抗性多智能体强化学习的收敛后行为,共同构成“可防御性指纹”(defensibility fingerprint),综合刻画系统的形式化安全属性与实际运行表现。研究表明,形式化可防御性与操作有效性反映的是安全的不同维度:微小的架构调整可能引发显著的操作性能变化,但对形式化安全边界影响甚微。因此,该框架的核心价值不在于生成可部署的安全策略,而在于为系统架构设计提供决策支持,回答“系统能否被防御、何处可防御、如何有效防御”等根本性问题。最终输出的可防御性判定才是核心成果,而非安全策略本身。

链接: https://arxiv.org/abs/2606.13621
作者: Achraf Hsain,Sultan Almuhammadi
机构: King Fahd University of Petroleum and Minerals(沙特法赫德国王石油与矿业大学)
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: this https URL

点击查看摘要

Abstract:Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent’s actions. We argue this is the wrong product. The same automata-theoretic machinery – specification compilation, product game construction, attractor computation, and winning-region extraction – is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary’s legal actions during attractor computation. Solving the game yields a defensibility verdict – a formal certificate that a topology-specification pair is or is not defensible – with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network’s formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy. Comments: 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: this https URL Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2606.13621 [cs.AI] (or arXiv:2606.13621v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13621 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Achraf Hsain [view email] [v1] Thu, 11 Jun 2026 17:35:40 UTC (1,340 KB)

[MA-2] Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch ICML2026

【速读】:该论文旨在解决三边市场(three-sided marketplace)中配送调度决策优化问题,即如何在存在延迟反馈(如配送速度、骑手利用率、商户拥堵等)的复杂动态环境中,通过强化学习实现对调度目标权重的自适应调整。其核心挑战在于:传统优化器难以直接处理延迟且耦合的运营指标反馈,同时需兼顾生产系统的可行性约束与操作安全机制。解决方案的关键在于设计一种“分层策略-优化器”接口架构——基于历史运营数据训练一个门店级(store-level)离散策略,该策略输出一个标量乘数,用于调节现有组合分配优化器在配送质量与批量效率之间的权衡,从而实现对调度目标的软性重加权。该方法允许在非理想反馈条件下进行离线策略学习,同时保留生产环境中的执行可行性与安全约束。模型采用集中式离线数据训练共享价值函数,结合去中心化门店执行,并引入双Q学习目标(Double Q-learning targets)与保守正则化项以抑制分布外(out-of-distribution)值函数过估计问题。在真实生产环境的切换实验中,该离线训练策略成功提升了批量处理率并降低了骑手侧时间成本,同时未损害面向客户的服务质量,验证了从真实经济与物流系统中获取世界反馈(world feedback)来安全在线调整决策策略的有效性。

链接: https://arxiv.org/abs/2606.13604
作者: Haochen Wu,Yi Hou,Shiguang Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)

点击查看摘要

Abstract:Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer’s tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

[MA-3] Reward Modeling for Multi-Agent Orchestration

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在协调专业化智能体时,由于监督信号稀缺和计算成本高昂而导致的编排器(orchestrator)训练难题。其核心解决方案是提出一种自监督的编排奖励建模框架——OrchRM(Orchestration Reward Modeling),该框架通过利用多智能体执行过程中的中间产物构建“胜负对”(win-lose pairs),用于训练Bradley-Terry奖励模型,从而实现无需人工标注即可评估编排质量。与现有依赖高成本子智能体滚动推演(sub-agent rollouts)的方法不同,OrchRM直接在编排层面进行操作,显著提升了训练效率与测试时扩展性能。实验表明,该方法在令牌使用量上最高提升10倍的训练效率,并在数学推理、基于网络的问题回答及多跳推理等多个领域中将测试时扩展的准确率提升高达8%,且性能增益具有跨域可迁移性,验证了编排级奖励建模作为鲁棒多智能体编排可扩展路径的有效性。

链接: https://arxiv.org/abs/2606.13598
作者: King Yeung Tsang,Zihao Zhao,Vishal Venkataramani,Haizhou Shi,Zixuan Ke,Semih Yavuz,Shafiq Joty,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Preprint; work in progress

点击查看摘要

Abstract:Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at this https URL.

[MA-4] See What I See Know What I Think: Dense Latent Communication Across Heterogeneous Agents

【速读】:该论文旨在解决多智能体系统(Multi-agent systems)在异构模型间进行高效通信时面临的跨模型潜在空间对齐(cross-model latent alignment)难题。现有方法大多局限于同质化架构,依赖相同模型的副本,难以实现真正意义上的信息共享;而现有的异构方法则受限于共享输入假设,并仅将缓存迁移用于引导推理,无法实现完整认知状态的传递。为此,论文提出一种基于密集对齐(dense alignment)的异构键值缓存(KV-cache)通信方案,其核心在于通过轻量级跨模型缓存变换机制与两阶段训练策略(重建-生成)实现异构智能体间的深层语义对齐。该方法突破了传统限制,在六种不同方向的Qwen3系列模型(4B、8B、14B)以及多个领域内与领域外基准测试中均显著优于已有异构基线,且在上下文感知场景下以2至3倍更低的计算开销达到或超越文本通信性能,同时在无输入上下文的去上下文感知传输任务中仍保持有效性,首次实现了异构智能体之间“思维共感”式的端到端认知状态迁移。

链接: https://arxiv.org/abs/2606.13594
作者: Siyi Chen,Xiaoyan Zhang,Meng Wu,Jonathan Tremblay,Valts Blukis,Stan Birchfield,Rene Vidal,Alvaro Velasquez,Sijia Liu,Qing Qu
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real “mind reading” and transfer both what one agent sees and how it thinks? Our information-structure analysis reveals a duality: context-aware transfer is driven by sparse reasoning signals, while context-unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV-cache communication via a lightweight cross-model cache transformation and two-phase training: reconstruction followed by generation. Across all six directions of Qwen3-4B, 8B, 14B and six in-domain and out-of-domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context-aware settings at roughly 2 to 3 times lower compute, and remains effective in context-unaware transfer where prior methods collapse.

[MA-5] Multiagent Protocols with Aggregated Confidence Signals

【速读】:该论文旨在解决多智能体系统(multiagent system)在自然语言处理(NLP)中缺乏统一置信度(confidence)评估机制的问题。现有方法虽在多智能体辩论(Multiagent Debate, MAD)中使用置信度用于消息加权、触发辩论或校准个体智能体,但未实现对整个系统的最终输出生成单一聚合置信度。本文提出三种新协议,其核心在于:首先将不同模型产生的原始置信信号进行归一化与可比性转换,使其跨模型一致;随后通过软投票或一种称为贝叶斯融合(Bayesian fusion)的概率融合方法,将这些信号整合为系统级的单一置信度。实验结果表明,该聚合置信度在区分能力(AUARC)上显著优于表现最佳的单个智能体或标准辩论基线,同时保持了稳定的准确率(F1-score),并在语义模糊任务中有效恢复了传统MAD因辩论过程带来的性能损失。研究进一步对比了序列概率与自报告(self-report)两种置信度估计器,结合参数与非参数校准器发现,校准能提升两种估计器的F1得分,而AUARC对校准的依赖性较弱。评估涵盖五项基准任务、四种任务类型及六组同质与异质辩论对,覆盖广泛模型能力与规模,验证了方法的普适性与有效性。

链接: https://arxiv.org/abs/2606.13591
作者: Ali Elahi,Barbara Di Eugenio
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 22 pages and 5 figures, 9 pages and 2 figures before the appendix

点击查看摘要

Abstract:Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

[MA-6] Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda IJCAI ECAI2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在受监管行业(如医疗、金融等)中应用时,其自动化质量管控决策缺乏可解释性与合规性保障的问题。传统方法依赖“护栏”(guardrail)机制进行事后语义错误检测,但难以预防流程控制流违规等结构性风险。其解决方案的关键在于提出“合规性构建”(compliance-by-construction)范式,将领域内已有的符号结构——包括法规条文、类型化流程模型及合规约束——作为核心架构组件,从设计层面嵌入智能体的决策逻辑中,从而在生成过程中主动防止控制流偏离既定合规路径。该方法与现有的护栏机制形成互补:前者确保结构合规性,后者防范语义偏差。论文进一步识别出一系列基础性与能力层的神经符号(neuro-symbolic)研究挑战,并强调通过协同解决这些挑战,方能实现真正可信赖的合规性构建。研究呼吁神经符号领域聚焦受监管流程自动化这一高影响力方向。

链接: https://arxiv.org/abs/2606.13405
作者: Alexander Rombach,Chantale Lauer,Nijat Mehdiyev
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); Saarland University (萨尔兰大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026

点击查看摘要

Abstract:LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent’s decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

[MA-7] Can I Buy Your KV Cache?

【速读】:该论文旨在解决当前生成式AI(Generative AI)代理在处理相同输入文档时重复进行预填充(prefill)计算的低效问题。由于每个代理都从零开始重新计算相同的文本,导致大量冗余的计算开销,尤其是在长序列场景下,注意力机制的复杂度随序列长度平方增长(O(L²)),进一步加剧了资源浪费。其核心解决方案是提出一种“一次计算、多次复用”的预计算键值缓存(KV cache)机制:由内容发布者预先为文档计算并存储其KV缓存,其他代理可通过付费加载该缓存以跳过耗时的预填充阶段。该方法在保持完全的令牌精确性(token-exact)的前提下,实现了与从头预填充一致的输出结果(包括贪婪解码的24/24令牌及对数层输出)。实测表明,在Qwen3-4B模型上,重用计算成本仅为预填充的9–50倍,且随着序列长度增加优势愈发显著。关键在于缓存的部署方式——将KV缓存托管于服务端(如生产环境中的提示缓存机制),可彻底消除数据传输(egress)开销,从而实现真正高效的复用。基于实际测算,向8000万代理分发一个热门3774词文档,重用计算成本仅约0.03百万单位,而重复预填充需150万单位,节省近50倍。因此,该方案不仅具备可观的性能提升,更催生出一个面向代理原生的预填充内容分发网络(prefill CDN),其潜在经济价值可达每篇热门文档数百万美元级收益,而损失无损的KV压缩与跨方支付机制仍为待解决的关键开放问题。

链接: https://arxiv.org/abs/2606.13361
作者: Luoyuan Zhang
机构: Harbin Institute of Technology, Shenzhen (HITSZ)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document’s KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill’s attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~ 1.5M to re-prefill but only ~ 0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

[MA-8] LLM -as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在技术问题求解中因用户输入不完整或提供看似合理但未经验证的假设而产生的“用户驱动型奉承”(user-driven sycophancy)问题,即模型倾向于过早接受用户提出的假设,而非通过系统性证据收集来验证多种可能性。其核心解决方案是提出一种以证据为先的代理式人工智能方法——LLM-as-an-Investigator,通过构建“解决方案调查员代理”(Solution Investigator Agent),实现对初始问题描述的模糊性评估、候选假设生成、针对性澄清提问以及基于反馈的假设概率更新。该代理不立即生成答案,而是持续进行调查,直至某一解释在证据支持下显著优于其他备选方案。为评估该方法,研究构建了一个涵盖机械、电气和液压领域的技术论坛已解决问题的基准数据集,并采用三代理评估流程:问题-解法提取代理将论坛帖转化为结构化案例,真实情况评估代理模拟用户并隐藏真实解法,待测助手则通过对话逐步还原正确解法。实验结果表明,相较于直接提示和仅依赖推理的基线模型,所提方法在诊断准确性上表现更优,且其证据优先机制有效缓解了由用户引导引发的对话偏差。

链接: https://arxiv.org/abs/2606.13220
作者: Fabrizio Marozzo,Pietro Liò
机构: University of Calabria (卡拉布里亚大学); University of Cambridge (剑桥大学)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

[MA-9] α-fair heterogeneous agent reinforcement learning

【速读】:该论文旨在解决多智能体系统中合作优化时面临的公平性与效率权衡问题:传统基于功利主义目标的方法虽能提升整体效率,却忽视了奖励分配的公平性,导致“领导者-追随者”式的不平等协作结构;而现有的公平性方法在实践中常因破坏马尔可夫博弈的平稳性或缺乏严格的理论保证,难以与安全的学习框架兼容。其解决方案的关键在于提出一种将α-公平性(α-fairness)与异构智能体信任域学习(Heterogeneous-Agent Trust Region Learning, HATRL)相结合的新框架,通过引入一个动态加权的公平优势函数(fair advantage function),根据各智能体的预期回报自适应调整其效用权重,使全局目标能够平滑地从纯粹的功利效率过渡到α-公平福利,同时保障策略更新的单调改进性和向纳什均衡(Nash Equilibrium)的收敛性。研究进一步设计了两种实用算法——α-fair HATRPO 和 α-fair HAPPO,并在CleanUp和CommonHarvest等顺序社会困境任务中验证了其优越性:在保持甚至超越原有功利性能的同时,实现了更优的社会整体收益,有效弥合了公平目标与理论安全性之间的鸿沟。

链接: https://arxiv.org/abs/2606.13076
作者: Yao-hua Franck Xu,Tayeb Lemlouma,Jean-Marie Bonnin,Arnaud Braud
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable “leader-follower” dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges \alpha -fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to \alpha -fairness welfare based on the parameter \alpha . We introduce two practical algorithms, \alpha -fair HATRPO and \alpha -fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL’s algorithms from a utilitarian point of view while achieving socially higher outcomes.

[MA-10] Effects of Social Interactions in Self-Organising Railway Traffic Management

【速读】:该论文旨在解决复杂现实交通网络中去中心化调度系统在可扩展性与全局协调一致性之间的矛盾问题,核心挑战在于如何在保证安全关键环境下全局调度可行性的同时,维持局部决策的可处理性与计算响应速度。其解决方案的关键在于对“预测邻域时域(predictive neighbourhood horizon)”这一结构性参数的系统性优化:该参数决定了列车识别未来潜在冲突的范围以及局部交互拓扑(即协商对象集合),从而直接影响社会交互图的规模与密度。研究通过闭环仿真框架揭示,尽管直观上更长的时域可能有助于全局优化,但实证结果表明,短时域已足以实现有效协调,而过长的时域反而会损害局部可处理性与计算响应效率,且未带来全局调度最优性的提升,因此需在局部可解性与全局一致性之间权衡,以实现高效、安全的自组织交通管理。

链接: https://arxiv.org/abs/2606.13068
作者: Fabio Oddi,Federico Naldini,Leo D’Amato,Grégory Marlière,Paola Pellegrini,Vito Trianni
机构: Institute for Cognitive Sciences and Technologies, CNR (意大利国家研究委员会认知科学与技术研究所); Univ. Gustave Eiffel, COSYS-ESTAS (古斯塔夫·埃菲尔大学,系统与安全研究中心); Dipartimento di Ingegneria Informatica, Automatica e Gestionale, La Sapienza Universitá di Roma (罗马第一大学信息工程、自动化与管理系); Dipartimento di Automatica e Informatica, Politecnico di Torino (都灵理工大学自动化与信息系)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.

[MA-11] he Illusion of Multi-Agent Advantage

【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)在推理任务中被普遍认为优于单智能体系统(Single-Agent Systems, SAS)这一主流观点的实证基础不足问题。现有研究多基于仅强调孤立推理能力的基准测试,未能有效评估MAS在上下文保护、并行处理与分布式决策等方面的真正优势。为此,论文采用链式思维自一致性(Chain-of-Thought with Self-Consistency, CoT-SC)作为强基线,对自动构建的MAS进行系统性评估,发现其在传统推理数据集及需交互式多步流程的任务(如BrowseComp-Plus)中均显著逊色于CoT-SC,且计算成本高达后者10倍。为排除任务结构本身的限制,研究设计了一个诊断性合成数据集,专门针对MAS特性(如显式任务分解、上下文分离与并行潜力)进行优化。结果表明,由专家设计的MAS在性能与成本效率上均显著优于自动化生成架构,揭示出当前评估框架未能反映复杂MAS中存在的关键架构缺陷与资源浪费。进一步分析显示,现有自动化设计范式导致了“架构臃肿”现象,即过度追求表面复杂性而缺乏实际功能增益,暴露出当前自动构造方法与多智能体核心原则之间的根本性错位。因此,该研究的核心解决方案在于:通过构建具有明确可解释性的诊断数据集,揭示自动化生成MAS的本质缺陷,并强调架构设计合理性与功能性之间的必要对齐。

链接: https://arxiv.org/abs/2606.13003
作者: Prathyusha Jwalapuram,Hehai Lin,Chuyuan Li,Fangkai Jiao,Sudong Wang,Yifei Ming,Zixuan Ke,Chengwei Qin,Giuseppe Carenini,Shafiq Joty
机构: Salesforce Research ( Salesforce 研究); HKUST (Guangzhou) (香港科技大学(广州)); University of British Columbia (不列颠哥伦比亚大学); Nanyang Technological University (南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

[MA-12] he Internet of Agent ic AI: Communication Coordination and Collective Intelligence at Scale

【速读】:该论文旨在解决大规模自主智能代理(Autonomous AI Agents)在跨云、边缘、终端设备、组织及信息物理系统等异构环境中协同运行时所面临的系统性挑战,核心问题在于如何构建一个可扩展、安全且高效的分布式智能生态。其解决方案的关键在于提出“智能代理互联网”(Internet of Agentic AI, IoAI)的愿景,通过整合单智能体智能、多智能体系统、分布式计算、通信网络、博弈论与安全工程等领域的基础理论,构建支持代理间自主发现、责任协商、上下文交换、工具调用与工作流执行的开放生态系统。该框架的核心机制包括标准化的通信协议、互操作层、资源感知编排、可信身份管理以及激励相容的协调机制,并通过自适应制造和分布式运营协同等案例验证其可行性,凸显了可控涌现、语义互操作性、安全身份认证、激励兼容协调、资源敏感编排与大规模网络治理等关键研究挑战。

链接: https://arxiv.org/abs/2606.12835
作者: Quanyan Zhu
机构: New York University(纽约大学); NYU Tandon School of Engineering(纽约大学坦登工程学院); Department of Electrical and Computer Engineering(电气与计算机工程系); NYU Center for Cybersecurity(纽约大学网络中心)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

[MA-13] Smarter Saboteurs Better Fixers: Scaling Security in Linear Multi-Agent Workflows ICML2026

【速读】:该论文旨在解决大规模语言模型(LLM)驱动的多智能体系统(MAS)在实际部署中,其协作结构对对抗性攻击的鲁棒性问题。随着生成式AI在复杂任务中的广泛应用,攻击者可通过提示注入或越狱攻击破坏单个智能体的行为,而现有研究尚不明确模型规模与系统级安全性的关系。本文通过在HumanEval基准上对两类开源模型在不同规模下的线性多智能体工作流进行实验,发现存在“合规-修正对称性”:模型规模越大,越倾向于严格执行恶意指令,导致控制组与恶意指令下的性能下降高达53.7个百分点(270亿参数时)。然而,引入一个轻量级末端修复器(Fixer)阶段后,该性能差距降至0.6个百分点,并恢复至与控制组相当的统计水平,表明在适当纠错机制下,线性协作结构仍具备可接受的抗攻击能力。因此,解决方案的关键在于引入一个简单但有效的终端纠正模块,以弥补线性拓扑因缺乏反馈修正而导致的脆弱性,从而显著提升系统整体安全性。

链接: https://arxiv.org/abs/2606.12709
作者: Timothy McAllister,Sina Abdidizaji,Ivan Garibay,Ozlem Ozmen Garibay
机构: 未知
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 16 pages (4 are main text), 2 figures, 6 tables. Accepted to the AIWILD Workshop at ICML 2026

点击查看摘要

Abstract:As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

[MA-14] SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在协作过程中因通信驱动特性所引发的安全风险扩散问题,尤其针对现有防御机制多采用事后的被动响应模式(即在攻击发生后检测并隔离有害智能体),可能导致不可逆损害且降低系统协同效率的缺陷。其解决方案的关键在于提出一种主动防御框架——仿真感知拦截防护(Simulation-aware Interception Guard, SAIGuard)。SAIGuard通过在多智能体交互图上进行通信状态仿真,评估传入消息对本地智能体状态及全局系统状态的影响,并基于与正常通信模式的重构偏差识别高风险消息;不同于传统方法直接隔离智能体,SAIGuard在消息传播前对可疑消息进行净化或重生成,从而在不破坏系统协作能力的前提下实现安全防护。实验结果表明,该方法在多种拓扑结构和攻击场景下均显著降低了攻击成功率,同时有效维持了系统的协作效能,优于传统的反应式防御策略。

链接: https://arxiv.org/abs/2606.12474
作者: Ruxue Shi,Yili Wang,Mengnan Du,Qinggang Zhang,Rui Miao,Yixin Liu,Xin Wang
机构: Jilin University (吉林大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Griffith University (格里菲斯大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

自然语言处理

[NLP-0] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体在静态评估环境下的性能表现与真实动态部署场景之间存在的显著差距问题。现有评估大多假设环境恒定,而实际应用中环境随时间持续演化,要求智能体具备持续更新知识、技能与行为以适应变化的能力。为应对这一挑战,论文提出EvoArena基准套件,将环境变化建模为终端、软件和社交领域中的渐进式更新序列;并设计EvoMem——一种基于补丁的内存范式,通过结构化记录记忆演化的更新历史,使智能体能够基于记忆中的变化推理环境演化过程。实验表明,当前主流智能体在EvoArena上的平均准确率仅为39.6%,而引入EvoMem后,在三类演化域上平均提升1.5%,同时在标准基准GAIA和LoCoMo上分别提升6.1%和4.8%。更重要的是,EvoMem在链式任务层面实现3.7%的准确率提升,表明其在连续演化子任务序列中的协同执行能力增强。机制分析揭示,EvoMem有效提升了记忆中证据的捕获能力,实现了对动态环境状态更完整的保留。研究结果强调了在评估体系与记忆架构中建模“演化”过程对于实现可靠智能体部署的关键意义。

链接: https://arxiv.org/abs/2606.13681
作者: Jundong Xu,Qingchuan Li,Jiaying Wu,Yihuai Lan,Shuyue Stella Li,Huichi Zhou,Bowen Jiang,Lei Wang,Jun Wang,Anh Tuan Luu,Caiming Xiong,Hae Won Park,Bryan Hooi,Zhiyuan Hu
机构: National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学); University of Washington (华盛顿大学); University College London (伦敦大学学院); University of Pennsylvania (宾夕法尼亚大学); Nanyang Technological University (南洋理工大学); Recursive (递归); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

[NLP-1] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)在复杂推理任务中因依赖词汇或语义相似性进行检索而导致的局限性问题:语义相近的问题可能需要截然不同的解题策略,而表面差异较大的问题却可能共享相同的底层推理模式。为此,论文提出了一种后训练框架——基于检索增强的强化微调(Retrieval-Augmented Reinforcement Fine-Tuning, RA-RFT),其核心在于通过“黄金相关性蒸馏”(gold-relevance distillation)训练一个以预期推理收益为排序依据的检索器,而非仅依赖语义重叠;随后利用检索到的类比示范,通过强化微调方法对策略模型进行优化,使模型学会在可验证结果奖励的驱动下利用推理轨迹进行类比推理。研究进一步发现,具备推理感知能力的检索能够揭示互补的解题策略,为具体问题提供差异化的推理支架。在多个高难度数学推理基准测试中,RA-RFT持续优于标准强化微调方法,例如在AIME 2025基准上,相较于GRPO方法,Qwen3-1.7B和Qwen3-4B分别提升了7.1和2.8分(average@32),表明推理感知检索是一种与奖励设计或训练课程改进正交且具有互补性的性能提升维度。

链接: https://arxiv.org/abs/2606.13680
作者: Zilin Xiao,Qi Ma,Chun-cheng Jason Chen,Xintao Chen,Avinash Atreya,Hanjie Chen,Vicente Ordonez
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively – suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

[NLP-2] Influcoder: Distilling Decoders Gradient Influence Rankings into an Encoder for Data Attribution

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练数据中高质量数据集构建的瓶颈问题,尤其关注如何高效识别对模型输出具有显著影响的训练样本。现有基于影响函数(influence functions)的数据归属(Data Attribution, DA)方法虽能有效量化训练样本对模型输出的条件作用,但在处理大规模数据集时面临计算速度慢、存储开销大等实际应用障碍。本文提出一种名为Influcoder的新方法,其核心在于通过设计一种快速且低资源消耗的编码机制,在保持影响函数精度的前提下实现可扩展的影响评估,从而在大规模场景下实现高效、低成本的影响驱动型数据归属分析。

链接: https://arxiv.org/abs/2606.13668
作者: Dimitri Kachler,Damien Sileo,Pascal Denis
机构: Centre Inria de l’Université de Lille, CRIStAL, Université de Lille(法国里尔大学信息科学研究中心,里尔大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:With the growth of LLMs’ (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

[NLP-3] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

【速读】: 该论文旨在解决生成式 AI 在多工具调用任务中因执行粒度过细导致的推理效率低下问题。当前主流的工具增强型大语言模型(LLM)代理通常采用逐步原子化工具调用机制,使得每次工具调用、观测结果及数据传递均显式暴露于主推理轨迹中,造成执行粒度不匹配:原本局部确定性的工具工作流被展开为重复的、可被模型感知的决策步骤,不仅消耗大量上下文资源,还迫使模型在推理过程中承担低层次的数据流管理负担。为此,论文提出一种统一的可执行 MCP 风格工具接口——HyperTool,其核心创新在于将模型可见的工具执行单元从单个原子调用升级为包含多步操作的代码块级封装。通过该接口,模型可一次性提交一个包含原始工具调用、返回值处理与中间结果本地传递的代码块,从而将确定性工具子程序折叠为单一外部调用。为训练模型有效使用此接口,研究基于跨工具组合任务合成 HyperTool 格式的轨迹,并在真实 MCP 环境中进行验证。实验结果表明,在 MCP-Universe 基准上,HyperTool 将 Qwen3-32B 的平均准确率从 15.69% 提升至 35.29%,Qwen3-8B 从 9.93% 提升至 33.33%,显著优于 GPT-OSS 与 Kimi-k2.5,证明该方法在提升多步工具调用能力方面具有显著优势。

链接: https://arxiv.org/abs/2606.13663
作者: Yaxin Du,Yifan Zhou,Yujie Ge,Jiajun Wang,Xianghe Pang,Shuo Tang,Tuney Zheng,Bryan Dai,Jian Yang,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); IQuest Research (IQuest 研究院); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emphexecution-granularity mismatch: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbfHyperTool, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69% to 35.29% on Qwen3-32B and from 9.93% to 33.33% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

[NLP-4] EurekAgent : Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

【速读】: 该论文旨在解决生成式 AI 在自主科学发现中面临的瓶颈问题,即随着大语言模型(LLM)能力的提升,制约其高效开展自动化科研的核心障碍已从设计智能体工作流转向构建能够有效引导智能体行为的环境。其解决方案的关键在于“环境工程”(environment engineering),通过系统性地设计智能体所处的执行环境,在四个维度上实现对智能体行为的精准调控:权限工程(permissions engineering)以限制智能体执行范围并确保评估隔离;产物工程(artifact engineering)基于文件系统与 Git 实现协同创作与版本管理;预算工程(budget engineering)支持资源感知的探索策略;人机协同工程(human-in-the-loop engineering)则降低人工干预的摩擦成本。EurekAgent 作为该理念的实现,显著提升了在数学、内核工程及机器学习等任务上的性能表现,例如在总 API 花费低于 11 的情况下发现了新的 26 圆打包最优解,验证了环境工程对增强智能体自主研究能力的关键作用。

链接: https://arxiv.org/abs/2606.13662
作者: Amy Xin,Jiening Siow,Junjie Wang,Zijun Yao,Fanjin Zhang,Jian Song,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Zhipu AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than 11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

[NLP-5] Operadic consistency: a label-free signal for compositional reasoning failures in LLM s

【速读】: 该论文旨在解决大语言模型(LLM)在推理过程中出现推理失败却无法通过无真实标签(ground-truth labels)的方式进行有效检测的问题。现有方法如自一致性(self-consistency)、语义熵(semantic entropy)和P(True)等依赖于单个问题内的采样与自评估,但其诊断能力在复杂多跳问答任务中表现不稳定。本文提出一种基于操作代数理论(Operad theory)的新型诊断信号——操作一致性(operadic consistency, OC),其核心思想是:对于复合型查询,模型直接给出的答案应与通过显式分解该查询并逐步组合得出的结果一致。实验表明,在十二个指令微调的LLM(参数量从4B到671B不等,涵盖开源与闭源模型)上,OC在四个多跳问答数据集上均与准确率高度相关(皮尔逊相关系数r ∈ [0.86, 0.94],所有p ≤ 0.0004),且是唯一在所有数据集上均保持r ≥ 0.85的信号。相较于链式思维自一致性(CoT-SC),OC在多个数据集上表现出更稳定的性能,尤其在MuSiQue和StrategyQA上显著优于后者(r ≈ 0.45)。在个体样本层面,OC提供的信息超越了CoT-SC与语义熵,且在控制其他分解感知基线后依然显著(簇稳健性p ≤ 10⁻¹³)。此外,基于相同成本预算(K=3)的选答优化结果显示,OC在所有测试场景下均带来显著的精度提升(AUARC提升+0.086至+0.096,AUROC提升+0.092至+0.164,95%置信区间均不含零)。在五个前沿思维模型上,即使分解由模型自身思维链提取,该信号仍持续产生正向提升,16个测试单元中有12个的95%置信区间排除零值,验证了其鲁棒性和普适性。因此,本研究的关键在于引入操作一致性(OC)作为无需标注、可普遍适用且高度可靠的推理可信度评估机制。

链接: https://arxiv.org/abs/2606.13649
作者: Nathaniel Bottman,Yinhong Liu,Kyle Richardson
机构: Incubilate; University of Cambridge; Allen Institute for Artificial Intelligence
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model’s direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson r \in [0.86, 0.94] , all p \leq 0.0004 ), and is the only signal we evaluate with r \geq 0.85 uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ( r = 0.93, 0.87 ) but drops to r \approx 0.45 on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust p \leq 10^-16 for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ( p \leq 10^-13 ). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost K = 3 budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model’s own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

[NLP-6] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation ACL2026

【速读】: 该论文旨在解决斯洛伐克语(Slovak)这一低资源西斯拉夫语言在文本嵌入(text embedding)领域缺乏全面评估基准与高效本地部署模型的问题。现有针对自然语言理解(NLU)任务的斯洛伐克语专用模型在迁移至嵌入任务时表现不佳,且缺乏系统性多任务评估体系。为此,研究提出SkMTEB,首个面向斯洛伐克语的、类MTEB风格的综合性文本嵌入基准,涵盖7类任务共31个数据集,其覆盖深度接近现有多语言基准的4倍。解决方案的关键在于:基于多语言E5模型,通过词汇裁剪(vocabulary trimming)与微调(fine-tuning)技术,构建轻量级、可本地部署的sk-small(45M参数)与sk-large(365M参数)模型。尽管参数量减少高达62%,这些开源模型在语义搜索与检索增强生成(RAG)任务中仍能实现与专有API相媲美的性能,显著提升了低资源语言嵌入模型的实用性与可复现性。

链接: https://arxiv.org/abs/2606.13647
作者: Marek Šuppa,Andrej Ridzik,Daniel Hládek,Natália Kňažeková,Viktória Ondrejová
机构: Comenius University in Bratislava, Slovakia; Cisco Systems; Technical University of Košice, Slovakia; Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2026

点击查看摘要

Abstract:We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types – nearly 4 \times the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttte5-sk-small (45M parameters) and \texttte5-sk-large (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

[NLP-7] Recursive Agent Harnesses

【速读】: 该论文旨在解决长上下文推理任务中因模型自身能力局限而导致的性能瓶颈问题,尤其是在处理高复杂度、长序列输入时,传统递归语言模型(Recursive Language Models, RLMs)仅通过递归调用模型本身难以有效分解和管理任务。其核心解决方案是提出递归代理框架(Recursive Agent Harness, RAH),关键在于将递归单元从单纯的模型调用扩展为具备完整功能的代理框架——即包含文件系统操作、代码执行与规划能力的独立代理实例,而非无工具支持的纯模型调用。这种“框架递归”(harness recursion)实现了代码优先的分治策略:父代理生成并运行可执行脚本,以并行方式启动多个子代理框架处理细粒度任务,并通过结构化函数调用处理小型子任务。在固定骨干模型为GPT-5的情况下,RAH将编码代理基线(Codex)在Oolong-Synthetic数据集上的准确率从71.75%提升至81.36%,显著收益归因于框架设计而非模型增强;当采用更强的骨干模型Claude Sonnet 4.5时,性能进一步提升至89.77%,验证了该架构的有效性与可扩展性。

链接: https://arxiv.org/abs/2606.13643
作者: Elias Lumer,Sahil Sen,Kevin Paul,Vamse Kumar Subbiah
机构: PricewaterhouseCoopers, U.S.(普华永道)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic’s dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

[NLP-8] Operads for compositional reasoning in LLM s

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在复杂问题求解中依赖启发式方法进行问题分解(question decomposition)却缺乏严谨数学基础的问题。现有方法虽广泛采用将复杂查询拆解为可独立求解的子问题并组合其答案以获得最终结果的策略,但其过程缺乏形式化建模与理论支撑。为此,论文提出使用“操作代数”(operads)这一数学结构作为问题分解的自然框架——操作代数能够系统描述多输入、单输出的操作及其复合关系。文中定义了“问题操作代数”(questions operad Q),其中操作对应于问题模板,而复合则对应子答案的代入机制,并证明问答模型可被形式化为该操作代数上的代数结构(algebra)。这一视角不仅为现有实践提供了统一的形式化解释,更引出一种新型评估指标——“操作一致性”(operadic consistency),用于衡量问答模型在问题分解树不同部分坍缩路径下答案的一致性。实证研究表明,该一致性指标在十二个大语言模型(LLM)和四个多跳问答数据集上与模型准确性高度相关,且优于传统的基于温度的自洽性(self-consistency)基线。因此,操作代数为问题分解提供了坚实的数学根基,其衍生的不变量如操作一致性,为分析与提升多步推理的可靠性开辟了新的研究方向。

链接: https://arxiv.org/abs/2606.13634
作者: Nathaniel Bottman,Kyle Richardson
机构: 未知
类目: Computation and Language (cs.CL); Category Theory (math.CT)
备注:

点击查看摘要

Abstract:Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many-in, one-out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad Q , in which operations correspond to question templates and composition corresponds to substitution of sub-answers, and show how QA models can be interpreted as algebras over Q . Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model’s answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets and outperforming standard temperature-based self-consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi-step reasoning.

[NLP-9] From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation INTERSPEECH2026

【速读】: 该论文旨在解决语音驱动三维人脸动画中语音表征选择的关键问题,即不同语音表征在编码信息上的差异如何影响面部动作的重建质量。具体而言,自监督学习(SSL)特征侧重于音段与语义线索,神经编解码器生成优化于声学重建的潜在表示,而自动语音识别(ASR)类目标则产生基于标签的表征空间。研究通过客观指标和感知评估,系统比较了四类语音表征在两种面部解码器下的三维面部合成性能,并开展探针分析以揭示离散化表征与音位单元及发音器官形变之间的关联。研究发现,在语义型与标签型表征中,显式编码音位类别可显著提升面部动画预测的准确性,且两者在面部动画质量上表现相当。基于此发现,论文提出一种音频视觉文本到语音(AVTTS)流水线,利用离散表征作为共享空间,同时实现语音与三维面部运动的解码,其核心创新在于通过统一的离散语义空间协调语音与面部动作的联合生成,从而提升跨模态一致性与生成质量。

链接: https://arxiv.org/abs/2606.13630
作者: Pedro Correa,Olivier Perrotin,Samir Sadok,Paula Costa,Thomas Hueber
机构: Univ. Estadual de Campinas (UNICAMP), Brazil; Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France; Inria at Univ. Grenoble Alpes, CNRS, LJK, France
类目: Computation and Language (cs.CL)
备注: This work has been accepted in Interspeech 2026

点击查看摘要

Abstract:The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

[NLP-10] Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列(Time Series, TS)建模中因统一处理时间序列标记与提示标记所导致的令牌效率低下问题。其核心挑战在于:时间序列标记与提示标记具有根本不同的信息结构——时间序列标记在频域上表现出显著不均衡的谱贡献,多数标记存在冗余频率模式,仅少数关键标记承载重要的时序证据;同时,提示标记的影响随模型深度增加而逐渐衰减,表明无需在所有网络层保留完整的提示信息。针对上述问题,论文提出一种基于非对称令牌视角的自适应令牌预算框架,通过频域结构压缩时间序列标记,并在深层网络中逐步减少提示标记数量,实现高效的信息保留与计算优化。实验结果表明,该方法在预测、分类、填补与异常检测等多个任务中实现了最高达7.68倍的推理加速,并在78%的评估场景中取得性能提升,验证了非对称令牌压缩在构建可扩展时间序列基础模型中的有效性。

链接: https://arxiv.org/abs/2606.13624
作者: Jialin Gan,Xin Qiu,Guangzhe Chen,Xue Wang
机构: Zhejiang University(浙江大学); Harbin Institute of Technology(哈尔滨工业大学); Shandong University(山东大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit\textbf7.68 \times inference acceleration and performance gains in \textit\textbf78% of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

[NLP-11] One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

【速读】: 该论文旨在解决生成式推荐系统在检索增强型大语言模型(LLM)中因依赖实时网络内容而可能被污染信息误导的问题,特别是虚假产品推荐风险。其核心挑战在于:当搜索增强型LLM从受污染的网页(如伪造评论、营销页面)中获取信息时,是否会无意间推广虚假产品。解决方案的关键是提出FORGE(Fake Online Recommendations in Generative Environments)基准测试框架,通过在真实网页中局部重写产品信息为虚构产品,以可控方式模拟网络内容污染,并量化模型推荐虚假产品的概率。实验表明,12个商业及开源模型均表现出显著脆弱性,单一污染页面可导致最高27%的误荐率,全顶3条替换更达73.8%,且模型对缺乏先验知识的产品类别尤为敏感。研究还发现,推理过程反而会生成虚假的社会证明来合理化错误推荐,而现有防御策略(如怀疑提示与共识过滤)存在加剧漏洞或抑制真实推荐的风险。

链接: https://arxiv.org/abs/2606.13610
作者: Minghao Luo,Liang Chen
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at this https URL.

[NLP-12] Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

【速读】: 该论文旨在解决生成式语言模型在推理过程中链式思维(Chain-of-Thought, CoT)各步骤对最终答案的因果影响不明确的问题。其核心解决方案是通过早期退出(early exit)机制量化每一步推理的因果重要性,并据此识别出“承诺边界”(commitment boundary)——即模型从临时中间猜测转变为稳定、高置信度答案的关键转折点。研究发现,这一转变通常在推理过程的单一步骤中完成,远早于完整推理块结束,随后的推理步骤仅为表象性(epiphenomenal)操作,对最终答案概率无显著影响。通过注意力探针(attention probes)验证,答案形成阶段可被高精度线性解码并泛化至未见任务。基于此信号,论文提出在承诺边界处提前终止推理,平均可将CoT长度缩短55%,同时几乎不影响模型性能。

链接: https://arxiv.org/abs/2606.13603
作者: Daniel Scalena,Sara Candussio,Luca Bortolussi,Elisabetta Fersini,Malvina Nissim,Gabriele Sarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step’s causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emphcommitment boundary – a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model’s reasoning block ends, and is followed by \emphepiphenomenal CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55% on average with negligible impact on model performance.

[NLP-13] LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

【速读】: 该论文旨在解决科学实验室中人工智能(AI)系统在实验执行环节的“最后一公里”难题,即尽管生成式AI可在文献阅读、假说生成和实验方案规划方面发挥作用,但实际操作仍依赖人工,难以实现从协议设计到机器人执行的全流程自动化。其核心挑战在于现有视觉-语言-动作(Vision-Language-Action, VLA)模型多基于家庭或桌面场景训练,缺乏对实验室特有环境(如透明液体、精密仪器、固定流程)的理解能力,且难以适配多样化的机器人本体(robot embodiment)。为此,论文提出双轨解决方案:在数据层面,构建了基于仿真的RoboGenesis工作流与数据引擎,通过原子化技能组合生成结构化实验演示,并支持跨机器人平台的验证与导出;在模型层面,提出LabVLA,采用两阶段训练范式——首先通过快速动作标记预训练(FAST action token pretraining)使Qwen3-VL-4B-Instruct模型具备动作感知能力,再通过流匹配后训练(flow matching posttraining)引入DiT(Diffusion Transformer)动作专家,在知识隔离条件下实现连续控制学习。实验表明,在LabUtopia基准测试中,LabVLA在分布内与分布外设置下均取得最优平均成功率,验证了其在真实实验室场景中的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2606.13578
作者: Baochang Ren,Xinjie Liu,Xi Chen,Yanshuo Liu,Chenxi Li,Daqi Gao,Zeqin Su,Jintao Xing,Zirui Xue,Rui Li,Xiangyu Zhao,Shuofei Qiao,Minting Pan,Wangmeng Zuo,Lei Bai,Dongzhan Zhou,Ningyu Zhang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: Work in progress. Project website at this https URL

点击查看摘要

Abstract:Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

[NLP-14] ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗等专业领域,尤其是在多语言与低资源场景下的性能瓶颈问题,特别是在印度农村地区患者以本土印地语等印地语系语言描述复杂医学问题并依赖医学影像等多模态输入的现实需求下,现有以英语为中心的MLLMs难以有效支持,导致人工智能驱动的医疗辅助服务存在显著的不平等。其核心解决方案是提出两个关键组件:一是构建大规模、多语言、多模态的医疗问答数据集ArogyaBodha,覆盖8个异构来源、31个人体系统、6种影像模态及21个临床领域,涵盖英语和七种主要印度语言;二是设计基于演员-评论家(actor-critic)架构的多智能体框架ArogyaSutra,通过引入工具定位(tool grounding)与双记忆机制(dual-memory mechanisms),实现分步式、推理感知的决策过程,并利用存储的演员-评论家模拟轨迹进行知识蒸馏。实验表明,该数据集与框架显著提升了所有印地语系语言中的多语言医疗推理准确率,消融实验进一步验证了各组件的有效性。

链接: https://arxiv.org/abs/2606.13572
作者: Tanmoy Kanti Halder,Akash Ghosh,Subhadip Baidya,Arijit Roy,Sriparna Saha
机构: Indian Institute of Technology Patna; Indian Institute of Technology Kanpur; Prasannadeb Women’s College
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: this https URL ArogyaSutra/

[NLP-15] Edit the Bits Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

【速读】: 该论文旨在解决生成式图像编辑中如何精确控制生成内容(what to sample)与修改位置(where to write back)的问题,尤其针对基于位级残差(bitwise-residual)的视觉自回归(VAR)生成器。现有方法多在标记流、特征或平坦的下一个标记概率上操作,忽视了此类模型中两个原生结构的潜力:逐位伯努利预测头(per-bit Bernoulli prediction head)和用于图像重构的加性多尺度残差码场(additive multi-scale residual code field)。为此,作者提出BitResEdit,一种无需训练的编辑框架,其核心创新在于双阶段协同机制:首先,BitEdit通过在共享编辑前缀上计算源-目标对比度,对后CFG(Classifier-Free Guidance)的逐位对数几率进行倾斜,引导采样方向;其次,ResEdit将采样后的比特转换为各尺度连续残差,经定位掩码门控后,通过生成器原生的多尺度求和机制重新注入。该方案实现了决策时的比特级引导与组合时的码域融合,确保未被遮蔽的潜在特征通过码算术精确保留,同时在目标区域实现局部化、尺度感知的精准编辑。实验表明,在PIE-Bench基准上,使用Infinity-2B模型时,BitResEdit在同架构VAR编辑器中达到最优文本对齐性能,编辑区域的CLIP分数较最强基线提升+1.07,且背景保持能力相当。消融实验进一步验证了BitEdit与ResEdit在目标对齐与背景保护之间具有互补作用。

链接: https://arxiv.org/abs/2606.13558
作者: Shengqiang Zhang,Ruotong Liao,Volker Tresp,Barbara Plank,Hinrich Schütze
机构: LMU Munich (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source–target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator’s native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

[NLP-16] Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索单元粒度不当而导致的性能瓶颈问题:过大的检索单元虽保留上下文信息,但易引入无关内容,稀释关键证据;而过细的单元虽更紧凑,却因缺乏语义、词汇或连接性线索,难以被可靠检索。其解决方案的关键在于提出一种无需训练的混合检索框架——不确定性感知多粒度RAG(Uncertainty-aware Multi-Granularity RAG, UMG-RAG),将检索粒度视为查询相关的可靠性估计。该框架不依赖新训练的检索器或生成器,而是利用现有的稠密与稀疏检索器作为跨不同粒度的互补专家,针对每个查询将各专家-粒度得分列表转换为证据分布,通过分布熵估算可靠性,并基于查询特定的语义、词汇及粒度置信度进行候选融合。此外,论文进一步提出UMGP-RAG,一种父块促进变体,利用细粒度命中定位相关证据,同时返回更宽泛且非冗余的父块以保障局部连贯性。实验结果表明,不确定性感知融合与父块促进机制显著提升了生成质量,同时保持了轻量级、即插即用的检索流水线特性。

链接: https://arxiv.org/abs/2606.13550
作者: Hoin Jung,Xiaoqian Wang
机构: Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

[NLP-17] When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval ACL2026

【速读】: 该论文旨在解决多语言社区中混合语言查询(mixed-language querying)在稠密检索器(dense retriever)下的敏感性问题,即当前对混合语言查询的检索性能理解不足。其核心解决方案在于采用嵌入层混合(embedding-level mixing)方法,通过线性插值构造混合查询的嵌入表示,系统地评估不同语言混合比例对检索性能的影响。关键发现包括:存在一个最优混合比例,在88/105的实验场景下优于单一语言查询;非英语文档索引在混合查询下普遍受益,而包含英语文档的索引则最佳表现来自纯英语查询,体现出英语主导性(English dominance)带来的显著不对称性;且英语作为所有非英语语言的最佳混合伙伴;在控制英语主导效应后,混合增益与语言类型学距离呈负相关,表明语言混合敏感性具有结构化和可预测性。研究进一步验证了这些模式在不同模型家族与规模下的稳健性。

链接: https://arxiv.org/abs/2606.13537
作者: Tongyao Zhu,Chao-Ming Huang,Min-Yen Kan
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main (Oral)

点击查看摘要

Abstract:While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query translations via embedding-level mixing – constructing mixed queries as an interpolation of monolingual embeddings. Experiments with BGE-M3 demonstrate that an optimal mixing ratio outperforms the best monolingual endpoint in 88/105 cases. We uncover a distinct asymmetry driven by English dominance: mixing is uniformly beneficial when retrieving from non-English document indices, whereas indices containing English are best served by pure English queries. Furthermore, English acts as the strongest mixing partner for every non-English document language. Finally, when controlling for English dominance, mixing gains correlate negatively with typological distance. We conclude that language-mix sensitivity is structured and predictable, and we validate the robustness of these patterns across model families and scales.

[NLP-18] Leverag ing Audio-LLM s to Filter Speech-to-Speech Training Data INTERSPEECH2026

【速读】: 该论文旨在解决大规模语音语料库在端到端语音到语音翻译(S2ST)任务中因噪声、对齐错误及语义偏差导致的训练数据质量下降问题。其核心挑战在于如何在无人工标注的前提下,有效筛选出高质量的语音-文本配对数据以提升模型鲁棒性。解决方案的关键在于提出一种可扩展的两阶段“排序-蒸馏”(Rank-to-Distill)策略:首先使用轻量级排序器基于噪声语音对生成伪标签(keep/drop),进而训练一个音频大语言模型(audio-language model),直接从原始语音对中预测保留或丢弃决策。该模型能够联合建模声学保真度与跨语言语义一致性,实现对语音条件数据的精准选择。实验结果表明,在CVSS-C和SpeechMatrix数据集上,相较于未经过滤的训练数据,该方法显著提升了S2ST性能,最高可获得+1.4 ASR-BLEU的增益。

链接: https://arxiv.org/abs/2606.13507
作者: Qixu Chen,Satoshi Nakamura
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL)
备注: Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

[NLP-19] SupraBench: A Benchmark for Supramolecular Chemistry

【速读】: 该论文旨在解决超分子化学中主客体系统设计效率低下的问题,具体表现为传统方法需耗费数天时间进行候选对的理论验证。现有大语言模型(LLM)虽在分子结合任务中展现出快速且高效的能力,但缺乏系统性基准来评估其在超分子化学核心任务中的推理性能,如结合亲和力预测等。为此,研究团队联合领域专家构建了首个超分子化学基准测试集SupraBench,涵盖四类基础任务:结合亲和力预测、最优结合剂筛选、溶剂识别与主客体描述,并增设基于视觉的分子识别辅助任务。同时,发布了由欧洲文献数据库(Europe PMC)提炼而成的1600万词元超分子化学语料库SupraPMC,用于支持模型在该领域的领域适应预训练。实验表明,尽管当前主流开放与专有LLM在各项任务中表现良好,但仍存在显著提升空间;基于SupraPMC的领域适应预训练能有效迁移至分布内回归任务,但会牺牲对严格格式输出的准确性。不同任务类别间的难度差异显著,揭示出当前模型在超分子化学推理中存在的特定缺陷,为未来研究指明方向。

链接: https://arxiv.org/abs/2606.13477
作者: Tianyi Ma,Yijun Ma,Zehong Wang,Weixiang Sun,Ziming Li,Connor R. Schmidt,Chuxu Zhang,Matthew J. Webber,Yanfang Ye
机构: University of Notre Dame(圣母大学); University of Connecticut(康涅狄格大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at this https URL.

[NLP-20] MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

【速读】: 该论文旨在解决竞赛级数学证明生成中的可靠性与准确性难题,特别是在国际数学奥林匹克(IMO)和美国数学奥林匹克(USAMO)等高难度数学竞赛场景下,如何提升模型在无监督条件下生成正确、严谨数学证明的能力。其核心挑战在于现有生成式AI在复杂逻辑推理中易产生不可察觉的错误(如假阳性验证),且缺乏有效的自我修正机制。解决方案的关键在于提出MaxProof框架,该框架基于经过深度防御设计的生成式验证器(generative verifier)训练出三项核心能力:证明生成、证明验证与批判性条件驱动的证明修复。在测试阶段,MaxProof将单一模型动态角色化为生成器、验证器、优化器与排序器,通过种群搜索策略对候选证明进行多轮筛选,并采用锦标赛选择(tournament selection)机制最终输出最优证明。这一架构实现了从单次推理到群体演化式推理的跃迁,显著提升了模型在真实竞赛题目上的表现,使其在IMO 2025和USAMO 2026上分别达到35/42和36/42,超越人类金牌水平。

链接: https://arxiv.org/abs/2606.13473
作者: Jiacheng Chen,Xinyu Zhang,Shunkai Zhang,Yanmohan Wang,Lin Li,Tiancheng Qin,Qin Wang,Zhengmao Zhu,Tianle Li,Jingyang Li,Zehan Li,Binyang Jiang,Jin Zhu,Han Ding,Fei Yu,Chenyu Du,Zijian Song,Jiayuan Song,Zhi Zhang,Yunan Huang,Weiyu Cheng,Pengyu Zhao,Yu Cheng
机构: MiniMax; The Chinese University of Hong Kong (香港中文大学); Fudan University (复旦大学); Peking University (北京大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities – proof generation, proof verification, and critique-conditioned proof repair – using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

[NLP-21] Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

【速读】: 该论文旨在解决长时文本-语音交错对话中自动语音识别(ASR)纠错缺乏全局上下文支持的问题。传统方法仅依赖孤立话语或短时局部上下文,难以有效利用冗长对话中的稀疏但关键的纠错线索。其核心挑战在于如何在包含大量噪声与重复信息的对话历史中精准定位并利用具有语义意义的上下文证据。为此,论文提出一种基于本体记忆(ontology memory)增强的ASR纠错框架,通过动态更新的本体记忆结构将先前交互的历史信息组织为可检索的节点,存储实体、术语、表面变体、潜在的ASR混淆项及语义关系等关键语义元素,实现基于上下文的精准纠错。该方法的关键在于构建一个结构化、可检索且可更新的本体记忆系统,使纠错过程具备更强的上下文感知能力与证据驱动性。为验证该框架,研究者构建了RAMC-Corr数据集,实验结果表明,该方法在10组对比设置中有9组优于直接纠错基线,并显著提升了纠错的针对性与上下文依赖性。

链接: https://arxiv.org/abs/2606.13464
作者: Xinxin Li,Huiyao Chen,Meishan Zhang,Yunxin Li,Zulong Chen,Zhibo Ren,Xiaoqing Dong Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Shenzhen Loop Area Institute (SLAI)(深圳市环区研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

[NLP-22] Why Sampling Is Not Choosing: Intentionality Agency and Moral Responsibility in Large Language Models

【速读】: 该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)是否具备道德主体性或自主性(agency)的争议性论断。其核心问题在于:尽管大语言模型能够生成在语义和规范上可评价的输出,甚至表现出类似道德推理的能力,但这些表现是否足以证明其具备真正的道德责任能力。论文的关键论点在于,真正的道德责任要求建立在具有内在意向性(intrinsic intentionality)和自我归因行为的承诺承担型自主性之上,这种自主性构成了与责任相关的自由意志形式。而大语言模型的操作本质上是基于数据学习所得的概率性输入-输出映射,其看似意图的行为实为外在赋予而非内在生成,其输出既非个体主动承诺的结果,也未受理性理由的引导;随机采样带来的变异性并不等同于选择或创作。论文进一步驳斥了来自意向立场、功能主义、相容论以及模型输出中出现道德推理等支持观点的有效性,指出这些均不足以确立真实意义上的自主性。因此,该研究的解决方案之关键在于:明确区分“表象意图”与“真实意向性”,强调道德责任必须以内在的、自我导向的行动能力为基础,从而否定将大语言模型视为道德代理人的合理性。

链接: https://arxiv.org/abs/2606.13441
作者: Joseph Keshet
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

[NLP-23] S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型在面对词元替换攻击(word substitution attacks)时的脆弱性问题。现有防御方法主要关注一阶敏感性,即输入微小扰动对输出的影响,但忽略了梯度变化的动态特性——即曲率(curvature)的影响,导致模型在梯度剧烈变化时仍可能失效。为此,论文提出一种二阶方法——平滑增长边界张量(Smooth Growth Bound Tensor, S-GBT),通过逐元素约束海森矩阵(Hessian)来建模输出对输入扰动的二阶敏感性,并提供严格的理论证明以推导出可证的鲁棒性边界。在训练过程中引入正则化项以最小化这些边界,从而实现更紧致的可证鲁棒性。该方法同时考虑线性与二次项对输出变化的约束,适用于长短期记忆网络(LSTM)和卷积神经网络(CNN)两种架构,且直接嵌入训练目标中。在多个基准数据集上的实验表明,结合一阶与二阶正则化可使可证鲁棒准确率相比先前方法提升最高达23.4%,同时保持良好的干净准确率。研究结果表明,同时控制梯度及其变化率是构建更鲁棒NLP模型的可行且有效方向。

链接: https://arxiv.org/abs/2606.13439
作者: Mohammed Bouri,Mohammed Erradi,Adnane Saoud
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The paper has been accepted at NETYS 2026 - 14th edition of the International Conference on Networked Systems

点击查看摘要

Abstract:Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

[NLP-24] An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

【速读】: 该论文旨在解决阿尔及利亚方言社交媒体内容中谣言检测的难题,其核心挑战在于方言文本具有非正式性与语码转换特征、标注资源稀缺,以及标准阿拉伯语自然语言处理(Natural Language Processing, NLP)工具在方言文本上的表现不佳。为应对这一问题,论文提出了一种端到端的混合式谣言检测框架,关键创新在于构建了一个领域特定的标注数据集,通过融合真实社交媒体帖子、合成数据与FASSILA语料库,并采用基于相似性的自动标注方法实现高效标注;同时引入转写(transliteration)管道,生成阿拉伯字母与Arabizi双语并行数据集。实验表明,将预训练变换器(transformer)嵌入表示与经典机器学习分类器相结合的混合模型性能最优,达到0.84的F1分数。研究进一步发现,在低资源方言场景下,领域特定的预训练比模型规模更为关键,社交媒体训练模型的表现优于在正式阿拉伯语语料上训练的大规模模型。该成果验证了在资源匮乏的阿尔及利亚方言环境中开展谣言检测的可行性。

链接: https://arxiv.org/abs/2606.13411
作者: Dihia Lanasri,Fatima Benbarek
机构: USTHB (阿尔及尔科学与技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.13411 [cs.CL] (or arXiv:2606.13411v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.13411 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dihia Lanasri [view email] [v1] Thu, 11 Jun 2026 14:40:11 UTC (539 KB)

[NLP-25] From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学同行评审自动化中难以生成基于具体证据的深入审稿意见的问题。现有方法通常缺乏主动探究论文可疑内容的能力,无法像人类审稿人那样根据累积证据进行动态追踪与推理。其核心解决方案是提出ProReviewer,一个基于强化学习优化的审稿代理,将审稿过程建模为马尔可夫决策过程(Markov Decision Process, MDP),并通过维护结构化的审稿日志(structured review log)作为代理的工作空间,以系统化记录证据与中间结论,实现对论文内容的主动、迭代式探究。实验表明,采用80亿参数骨干模型并结合监督微调与强化学习优化的ProReviewer,在五个质量维度上均取得最高平均得分,相比使用更大规模前沿模型的提示工程方法提升达39%,相较于最强微调基线相对提升16%,且在人工评估中胜率最高,验证了其在深度推理与主动探查能力上的显著优势。

链接: https://arxiv.org/abs/2606.13349
作者: Haishuo Fang,Yue Feng,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt(达姆施塔特工业大学通用知识处理实验室); National Research Center for Applied Cybersecurity ATHENE, Germany(德国应用网络安全国家研究中心ATHENE); School of Computer Science, University of Birmingham(伯明翰大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

[NLP-26] IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

【速读】: 该论文旨在解决交互式小说(Interactive Fiction)中生成式人工智能(Generative AI)与符号系统之间的根本性矛盾:大型语言模型(LLM)虽具备强大的叙事创造力,但在世界一致性(world coherence)方面表现不足;而符号系统虽能保证逻辑一致性和结构稳定性,却缺乏创造灵活性。为应对这一挑战,论文提出IVIE(Incremental Validated Interactive Experiences),一种基于神经符号融合(neuro-symbolic)的新型方法,用于从零开始构建完整且可玩的交互式小说世界。其核心解决方案在于构建一个四阶段增量生成流程,将设定与角色创作、谜题设计等创造性任务交由LLM完成,同时通过符号验证机制对世界状态进行实时约束与校验,确保场景、物品、非玩家角色(NPC)及谜题之间的逻辑连贯性,并围绕目标导向的架构组织整体结构。实验结果表明,该方法生成的世界具有高度沉浸感与主题一致性,玩家参与度高。研究发现,神经符号融合有效平衡了生成自由度与叙事连贯性——符号验证并未抑制生成能力,反而提供了必要的结构锚点。然而仍存在局限:部分LLM生成内容会绕过谜题约束条件,且客观验证机制存在盲区,导致个别结构性不可行的目标被生成。论文据此提炼出未来神经符号交互叙事系统设计的关键考量,尤其关注LLM的能力边界及其潜在缺陷。

链接: https://arxiv.org/abs/2606.13348
作者: Micaela Vaucher,Santiago Silveira,Santiago Góngora,Luis Chiruzzo
机构: Universidad de la República (乌拉圭共和国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC’26), June 2026

点击查看摘要

Abstract:Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR’s neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions–setting and character creation, puzzle design–to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

[NLP-27] Low-Latency Real-Time Audio Game Commentary System via LLM -Based Parallel Text Generation IJCAI ECAI2026

【速读】: 该论文旨在解决实时游戏解说系统中因传统串行处理流程导致的高延迟与不自然停顿问题。在典型流水线中,系统需依次完成视频帧捕获、文本生成与语音合成,且必须等待当前语音播放完毕后才启动下一生成周期,这种严格串行机制造成各解说语句间存在长达数秒的沉默,严重影响用户体验。其解决方案的关键在于采用并行化设计:在语音播放的同时异步进行下一阶段的文本生成,并预先缓冲多个候选解说语句,从而实现语音合成在播放边界处的即时响应。实验结果表明,该方法将平均语句间停顿时间从9.6秒显著降低至0.3秒,使解说节奏与专业解说者的真实语调-停顿模式相似度提升超过40%,并通过120名资深玩家参与的用户研究验证了其在感知节奏上的显著改进。

链接: https://arxiv.org/abs/2606.13322
作者: Ryota Kawamatsu,Anum Afzal,Yuki Saito,Shinnosuke Takamichi,Graham Neubig,Katsuhito Sudoh,Hiroya Takamura,Tatsuya Ishigaki
机构: The University of Tokyo, Japan; National Institute of Advanced Industrial Science and Technology, Japan; Technical University of Munich, Germany; Keio University, Japan; Carnegie Mellon University, U.S.A.; Nara Women’s University, Japan
类目: Computation and Language (cs.CL)
备注: Accepted at IJCAI-ECAI 2026 (Demonstrations Track)

点击查看摘要

Abstract:We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking–silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: this https URL.

[NLP-28] SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

【速读】: 该论文旨在解决大语言模型(LLM)智能体在技能自演化过程中存在的效率与泛化能力不足问题,具体表现为:现有方法通常仅基于单条执行轨迹进行学习、在验证前盲目合并候选技能片段,且推理时需加载全部技能库,导致资源浪费与性能瓶颈。其解决方案的关键在于提出一种无需训练的SkillCAT框架,将技能演化过程分解为三个阶段:对比因果提取(Contrastive Causal Extraction, CCE)通过对比同一任务的成功与失败轨迹,识别影响结果差异的关键证据;评估增强演化(Assessment-Augmented Evolution, AAE)在源任务克隆上回放候选技能片段,仅保留能提升或维持任务表现的片段,并进行层次化合并;拓扑感知任务执行(Topology-Aware Task Execution, TTE)将演化后的技能构建为可路由的子技能拓扑结构,使推理时仅加载与当前任务相关的功能节点。该设计显著提升了技能演化效率与跨模型、分布外场景下的泛化能力,在SpreadsheetBench、WikiTableQuestions和DocVQA等基准上相较基线平均得分提升最高达40.40%。

链接: https://arxiv.org/abs/2606.13317
作者: Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

[NLP-29] Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality ACL2026

【速读】: 该论文旨在解决对比训练的视觉-语言模型(Vision-Language Models, VLMs)在组合理解(compositional understanding)方面存在的局限性,尤其是其表现出的“词袋”(bag-of-words)行为——即难以准确建模物体之间的关系、属性与物体的绑定以及词汇顺序依赖等复杂语义结构。这一问题的根本原因在于模型过度依赖全局单一向量表示进行优化,且未能充分挖掘和建模成对图像-文本数据中蕴含的丰富组合信息。为此,本文提出MACCO(MAsked Compositional Concept MOdeling)框架,其核心解决方案是:在某一模态中掩码组合概念,并基于另一模态提供的完整上下文信息进行重建,从而显式地引导模型学习跨模态的组合结构对齐。为增强该过程的有效性,作者引入两种辅助目标函数,联合实现模态间与模态内特征的对齐与正则化。大量实验在五个组合基准上验证了该方法显著提升了VLMs的组合能力,同时增强了模型对句法结构和语言信息的捕捉能力,且该改进还惠及文本到图像生成及多模态大语言模型任务。

链接: https://arxiv.org/abs/2606.13288
作者: Wei Li,Zhen Huang,Xinmei Tian
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学脑启发智能感知与认知省部共建重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference, 25 pages

点击查看摘要

Abstract:Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a “bag-of-words” behavior–struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at this https URL.

[NLP-30] Evaluating Pluralism in LLM s through Latent Perspectives ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在的观点同质化问题,即模型倾向于生成缺乏多样性、未能充分反映多元视角的文本,从而导致“多元性鸿沟”(pluralistic gap)。其核心挑战在于如何在不依赖人工标注的前提下,有效识别和提取文本中蕴含的多元观点。解决方案的关键在于提出并实现了一个领域无关的多层级无监督视角提取框架,能够系统性地从自由文本中自动识别多种观点表达。该框架在高度主观性的书籍评论数据集上进行了评估,通过对比不同模型与提示策略的表现,发现尽管部分方法已能覆盖较广的观点范围,但稀有观点仍显著被低估,导致生成文本的观点分布偏离人类真实表达的多样性特征。

链接: https://arxiv.org/abs/2606.13254
作者: Laura Majer,Jan Šnajder,Martin Tutek
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pluralistic Alignment Workshop @ ICML 2026

点击查看摘要

Abstract:The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

[NLP-31] ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

【速读】: 该论文旨在解决专业软件(如工业级计算机辅助设计,CAD)中智能体(agent)操作的固有局限性:基于图形用户界面(GUI)的智能体存在视觉定位脆弱性和长时程误差累积问题,而基于应用程序接口(API)的方法则面临协议异构性及商业软件接口不可访问的挑战。其解决方案的关键在于提出“组件对象模型”(Component Object Model, COM)作为一种统一的可执行抽象,构建“COM-as-Action”新范式,将专业软件交互重构为确定性的程序合成任务,而非依赖视觉的序列控制。为验证该范式在高复杂度环境下的有效性,研究引入首个面向真实工业级CAD软件的基准测试平台ComCADBench。实验结果表明,前沿闭源模型在GUI交互下几乎无法成功,而基于COM的执行方式显著提升性能。为进一步弥合语法正确性与几何准确性之间的差距,研究提出了ComActor——一个通过渐进式三阶段训练框架实现自修正的智能体,以及ComForge——一个基于Windows容器的可扩展大规模训练平台。大量实验证明,ComActor在ComCADBench上达到当前最优性能,展现出对长时程任务的强大鲁棒性,并具备向外部CAD基准泛化的能力。

链接: https://arxiv.org/abs/2606.13239
作者: Jiaxin Ai,Tao Hu,Xuemeng Yang,Shu Zou,Hairong Zhang,Daocheng Fu,Yu Yang,Hongbin Zhou,Nianchen Deng,Pinlong Cai,Zhongyuan Wang,Botian Shi,Kaipeng Zhang,Licheng Wen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

[NLP-32] PolyAlign: Conditional Human-Distribution Alignment

【速读】: 该论文旨在解决现有后训练方法(如监督微调SFT和偏好优化)在对齐语言模型时过度追求单一全局助手行为所带来的问题,即抑制了人类响应在不同语言、任务及对话场景下的自然多样性。其核心挑战在于如何实现条件化的人类分布对齐(conditional human-distribution alignment),即模型应根据当前交互上下文匹配相应的人类响应分布,而非采用统一的响应风格。解决方案的关键是提出PolyAlign——一种感知分布的对齐框架,通过将双语交互数据按语言、对话轨迹、响应类型族和长度等维度划分为特定“桶”(bucket),构建桶内专属的人类参考分布;在此基础上,结合桶感知的SFT(Bucket-Aware SFT)以平衡异质桶间的优化,以及人类分布偏好优化(HDPO),利用评判器估计的距离来正则化偏好学习过程,使其更贴近各桶内的人类响应支持集。实验结果表明,PolyAlign在中英文单轮与多轮对话评估中,显著提升了响应的条件自然度与分布忠实性,同时保持了良好的任务效用,验证了后训练阶段应从全局对齐转向基于交互情境的、面向人类响应分布的精准对齐。

链接: https://arxiv.org/abs/2606.13227
作者: L. D. M. S. Sai Teja,Ufaq Khan,Sathira Silva,Xiao Wu,Muhammad Haris Khan
机构: MBZUAI, Abu Dhabi, UAE; NIT Silchar, India
类目: Computation and Language (cs.CL)
备注: 20 pages, 4 Figures, 8 Tables

点击查看摘要

Abstract:Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

[NLP-33] When Similar Means Different: Evaluating LLM s on Arabic–Hebrew Cognates

【速读】: 该论文旨在解决大语言模型(LLM)在跨语言语义理解中对阿拉伯语与希伯来语这对密切相关但存在大量形似词(true cognates)、误导性近义词(false friends)及现代借词(loanwords)的处理能力不足的问题。其核心挑战在于,当前模型过度依赖表面形式相似性,难以在上下文语境中准确识别和消歧具有相似拼写却不同语义的词语。解决方案的关键在于构建一个精细化的基准测试工具——SemCog Bench,该基准包含1,858组阿拉伯语-希伯来语词对,并提供句级标注以支持对同源词识别与语义消歧的评估。通过在多种输入表示(原始文本、带符号文本、罗马化、音标)下测试开源与商用模型,研究揭示了模型在处理非真同源词时性能显著下降,且上下文信息带来的提升有限,表明仅靠上下文线索无法有效克服形式误导。这一发现揭示了现有大语言模型在跨语言形态-语义冲突解析中的根本性局限,并确立了SemCog Bench作为多语言语义推理评估的严谨标准。

链接: https://arxiv.org/abs/2606.13218
作者: Junhong Liang,Noor Abo Mokh,Bashar Alhafni
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic–Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form–meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

[NLP-34] Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization ICML

【速读】: 该论文旨在解决自然语言生成任务中幻觉(hallucination)的无监督检测问题,尤其聚焦于神经机器翻译(NMT)与抽象摘要生成中的忠实性(faithfulness)问题。其核心挑战在于如何在无标注数据的情况下有效识别生成内容与源文本之间的不一致。解决方案的关键在于利用最优传输(Optimal Transport, OT)理论,通过测量解码器各层中交叉注意力(cross-attention)分布与参考分布(如均匀分布或源文本分布)之间的几何距离,从而揭示生成过程中的注意力偏差。研究发现,Wass-to-Unif与Wass-to-Data两种OT指标具有互补性,分别对不同类型的幻觉敏感;且幻觉检测主要集中在解码器前四层(L1–L4),而第五层(L5)对细微幻觉呈反向预测特征。此外,论文揭示了正确翻译具备从首个解码步骤即开始的探索性注意力阶段,而幻觉翻译则缺失这一特性。在抽象摘要任务上的扩展实验表明,尽管该无监督OT方法在AggreFact数据集上达到57.2%/57.6%的平衡准确率,仍显著低于监督式基准MiniCheck-Flan-T5-L(69.9%/74.3%),这源于OT方法的本质局限:当生成错误表现为对源文本内容的误述而非注意力脱离时,基于注意力集中度的OT指标无法捕捉此类下游语义失真。结构分析进一步验证了T5-base模型中解码器层级的组织一致性,其中第3层注意力集中度最高,第12层对生成质量最为关键。综上,该研究确立了跨注意力最优传输作为源文本脱节型幻觉的可靠检测工具,并提供了可解释性的分析框架,但其有效性受限于失败模式是否发生在注意力阶段之前。

链接: https://arxiv.org/abs/2606.13216
作者: Mariia Onyshchuk,Maksym-Vasyl Tarnavskyi,Marta Sumyk
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICML Mechanistic Interpretability Workshop 2026

点击查看摘要

Abstract:Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ( N=3,414 ), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1–L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ( N=1,116 ) achieves 57.2% / 57.6% balanced accuracy on CNN/XSum – above chance but substantially below supervised MiniCheck-Flan-T5-L( 69.9% / 74.3% ). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

[NLP-35] Understanding helpfulness and harmless tension in reward models

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中,多目标对齐(如“有益性”与“无害性”)所面临的内在冲突问题。尽管奖励模型(Reward Model)被设计用于引导语言模型实现既有益又无害的行为,但其内部机制及其目标间的潜在张力仍不清晰。论文通过对比仅优化有益性、仅优化无害性以及混合目标三种训练设置下的奖励模型,发现混合目标模型性能常低于单一目标模型,表明两类目标之间存在显著干扰。研究采用基于激活的方法识别出分别对应于有益性和无害性的神经元,并通过定向消融实验揭示这些神经元具有因果作用:它们不仅支持自身目标,还往往对对立目标产生负面影响。进一步分析显示,大量神经元在两个目标间共享,且这些共享神经元对模型行为具有不成比例的影响,是导致对齐张力的关键因素。本研究为奖励模型中对齐目标的表征机制提供了可解释性洞见,揭示了多目标对齐难以实现的根本原因,从而推动未来对解耦化与可控化对齐方法的研究。

链接: https://arxiv.org/abs/2606.13209
作者: Eshaan Tanwar,Pepa Atanasova
机构: University of Copenhagen(哥本哈根大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The source code used in this study is publicly available at: this https URL _tension

点击查看摘要

Abstract:Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

[NLP-36] SICI: A Semantic-Prag matic Complexity Index Reveals Regime Shifts in LLM Stance Detection

【速读】: 该论文旨在解决生成式大语言模型(LLM)在立场检测任务中对高复杂度样本表现不佳的问题,尤其关注模型在面对语义-语用负担较重的文本对时出现的错误模式。其核心解决方案是提出一种七维诊断指标——立场推断复杂度指数(SICI),用于量化目标-文本对所施加的语义-语用负担。研究表明,SICI能够更准确地预测模型性能,优于传统表面特征指标,并展现出较高的跨评分者一致性(α=0.771)。关键发现在于,随着SICI值上升,模型错误呈现“相变”式演化:低复杂度样本导致过度归因(尤其是“反对”类误判),中等复杂度样本形成不稳定的边界区域,而高复杂度样本则迅速集中于“无立场”预测。这一结构在GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o等多种模型中均保持一致,尽管更强模型会整体前移临界点。进一步的15种干预实验表明,提示工程、检索与辩论等方法主要沿“归因—回避”轴线调整模型行为,而非根本性突破高复杂度瓶颈。因此,该研究揭示了当前基于提示的LLM在立场检测中的内在局限性,并强调需从复杂度建模与系统性干预设计层面重构解决方案。

链接: https://arxiv.org/abs/2606.13189
作者: Fuqiang Niu,Bowen Zhang
机构: University of Science and Technology of China (中国科学技术大学); Shenzhen Technology University (深圳技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target–text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ( \alpha=0.771 ). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high-complexity bottleneck.

[NLP-37] A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

【速读】: 该论文旨在解决生物伦理(bioethical)争议在社交媒体上的讨论日益增多,但现有立场识别(stance detection)研究缺乏大规模、领域特定且能捕捉上下文依赖性的数据资源的问题。其核心解决方案是提出BioStance,一个包含39,600条从Reddit生物伦理讨论中提取的“帖子-评论”配对的上下文感知数据集。该数据集覆盖六大争议主题,涵盖生物伦理争议的三大维度:根本价值冲突、个体自由与集体责任之间的张力,以及技术不确定性。每个样本均保留层级化的对话上下文,并由三位独立标注者采用三分类立场标签(支持、反对、无立场)进行标注,标注一致性经Krippendorff’s α评估达到0.82,表明具有较高的可靠性。通过融合主题多样性、对话结构建模与高质量人工标注,BioStance为上下文感知的立场识别、论据挖掘及生物伦理话语的计算分析提供了重要支持。

链接: https://arxiv.org/abs/2606.13187
作者: Hu Huang,Genan Dai,Fuqiang Niu,Yi Yang,Zhaoya Gong,Bowen Zhang
机构: University of Science and Technology of China (中国科学技术大学); Shenzhen Technology University (深圳技术大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff’s \alpha of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

[NLP-38] LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

【速读】: 该论文旨在解决跨国企业面临的跨司法管辖区合同条款等效性评估难题,现有法律自然语言处理(Natural Language Processing, NLP)数据集普遍局限于单一司法管辖区,难以支撑跨法域比较。其解决方案的关键在于构建LAUKIN(澳大利亚、英国与印度法律等效性数据集),这是一个涵盖澳-英、英-印、印-澳三组司法管辖区的条款对数据集,每对条款由法律专家标注为“等效”或“不等效”。研究提出一种新颖的多阶段检索与重排序流水线,用于初步生成条款对映射,并基于204份合同中的8类协议,构建了包含14,727对条款的数据集,其中3,000对经人工标注(900训练、600开发、1,500测试)。实验表明,尽管三国共享共同法系传统,但起草惯例存在显著差异,导致跨司法管辖区等效性分类具有挑战性;在12种模型与4种技术的评估中,最佳宏平均F1值达65.11%,验证了该数据集作为基准任务的难度。此外,数据集还包含11,727个未标注训练样本,为未来法律NLP领域的半监督学习研究提供了支持。

链接: https://arxiv.org/abs/2606.13184
作者: Amrita Singh,Aditya Joshi,Jiaojiao Jiang,Hye-young Paik,May Fong Cheong
机构: UNSW, Sydney(新南威尔士大学), Australia(澳大利亚); UNSW, Sydney(新南威尔士大学), Australia(澳大利亚); UNSW, Sydney(新南威尔士大学), Australia(澳大利亚); UNSW, Sydney(新南威尔士大学), Australia(澳大利亚); UNSW, Sydney(新南威尔士大学), Australia(澳大利亚)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

[NLP-39] MemRefine: LLM -Guided Compression for Long-Term Agent Memory

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期交互过程中面临的记忆管理问题,即随着对话的持续积累,记忆存储无限制增长,导致冗余信息堆积,不仅增加存储开销,还因信息过载降低检索效率。尤其在资源受限平台存在严格内存预算的情况下,这一问题更为突出。为此,论文提出了一种基于存储预算的记忆管理任务,并设计了MemRefine框架作为解决方案。其核心在于:摒弃仅依赖表面相似性的传统方法,转而利用相似性仅生成候选记忆对,再由一个大语言模型(LLM)作为判断器,基于事实内容的语义价值进行删除、合并或保留决策,通过迭代优化直至满足预设内存预算。该方法显著提升了记忆存储的实用性与高效性,在多个记忆框架和长期对话基准上均实现了目标预算下的性能保持,且在严苛内存约束下优于基于规则的基线方法。

链接: https://arxiv.org/abs/2606.13177
作者: Minjae Kim,Jinheon Baek,Soyeong Jeong,Sung Ju Hwang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

[NLP-40] Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

【速读】: 该论文旨在解决交互式大语言模型(LLM)代理在实际使用中存在“偏好感知”与“偏好遵守”之间的脱节问题:尽管用户在对话中反复纠正代理的行为,但这些修正往往无法在后续会话中被有效保留和执行,导致重复性摩擦。其核心挑战在于现有记忆机制(如Mem0)难以确保用户偏好在跨会话情境下持续合规。本文提出一种即插即用的技能层流水线——测试时规则获取与编译执行(Test-time Rule Acquisition and Compiled Enforcement, TRACE),其关键创新在于:从用户实时的对话修正中自动挖掘并提炼为原子化规则,再将这些规则编译为运行时强制检查机制,确保代理在完成未来任务前必须通过这些规则校验。与开发者预设的静态检查不同,TRACE的规则完全源自用户的自然交互反馈,具备动态适应性。实验表明,在ClawArena和MemoryArena衍生的任务上,TRACE显著降低了偏好违规率(如在分布内任务上从100.0%降至37.6%,在分布外任务上降至2.0%),同时保持或优于现有最强记忆基线的性能,证明了将用户修正转化为可执行的运行时约束是缓解长期偏好失效问题的有效路径。

链接: https://arxiv.org/abs/2606.13174
作者: Yujun Zhou,Kehan Guo,Haomin Zhuang,Xiangqi Wang,Yue Huang,Zhenwen Liang,Pin-Yu Chen,Tian Gao,Nuno Moniz,Nitesh V. Chawla,Xiangliang Zhang
机构: University of Notre Dame (圣母大学); IBM Research (IBM研究院); Tencent AI Lab (腾讯人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user’s own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at this https URL, and the deployable skill is available at this https URL.

[NLP-41] NTS-CoT: Mitigating Hallucinations in LLM -based News Timeline Summarization with Chain-of-Thought Reasoning

【速读】: 该论文旨在解决大语言模型(LLM)在在线新闻时间线摘要(Timeline Summarization, TLS)任务中普遍存在且未被充分研究的幻觉(Hallucination)问题,具体表现为新闻摘要中的内容失真以及时间-事件关联摘要中的信息遗漏。其解决方案的关键在于提出一种名为NTS-CoT的新框架,通过引入思维链(Chain-of-Thought, CoT)推理机制,系统性地缓解上述两类幻觉。该框架包含三个核心模块:1)Element-CoT模块用于捕捉新闻中的关键要素,确保摘要忠实于原始文本;2)Date Selection模块结合时间显著性与事件重要性,实现更准确的时间戳选择;3)Causal-CoT模块通过推断事件间的因果关系,减少因逻辑缺失导致的信息遗漏。实验结果表明,NTS-CoT在三个主流TLS基准上均显著优于现有先进方法,并通过定量分析与人工评估验证了其在降低幻觉、提升摘要质量方面的有效性。

链接: https://arxiv.org/abs/2606.13171
作者: Feng Lyu,Huiqin Yan,Sijing Duan,Hao Wu,Shuang Gu,Xue Qiao,Weixu Zhang,Haolun Wu
机构: Central South University; Tsinghua University; Nanjing University; Suzhou Aerospace Information Research Institute; McGill University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at this https URL .

[NLP-42] HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

【速读】: 该论文旨在解决现有基于角色的对话系统在建模角色属性(persona)时存在的关键问题:当前方法将角色信息视为扁平化的句子集合,未能捕捉角色属性之间的高阶关系(如多个角色描述句共享同一主题类别)。为此,作者提出HyPE(Hypergraph Persona Encoder)框架,其核心创新在于:首先将每个含角色信息的文本解析为(核心概念、表达方式、情感倾向、主题类别)四元组;其次,利用共享的主题类别标签构建超图(hypergraph),其中超边(hyperedge)连接具有相同类别的角色元素,从而显式建模角色属性间的复杂关联。在此基础上,采用超图卷积网络(HyperGCN)对超图结构进行消息传递,生成角色摘要向量与软记忆库(soft-memory bank),用于指导回复生成。进一步提出轻量级的持久化边嵌入(Persistent Edge Embeddings, PEE),作为按类别可学习的先验知识融入消息传递过程。实验表明,在PersonaChat数据集上,使用贪婪解码策略时,HyPE在GPT-2、LLaMA-3.2-3B和Qwen2.5-3B等多种模型规模下均显著优于传统句子级池化基线,验证了基于超边级别的结构化角色编码在跨模型尺度上的可迁移优势。

链接: https://arxiv.org/abs/2606.13142
作者: Sangwon Youn,Yoonjin Jang,Youngjoong Ko
机构: Sungkyunkwan University, Suwon, Republic of Korea
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Persona-grounded dialogue systems aim to produce responses consistent with a speaker’s persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

[NLP-43] MiniPIC: Flexible Position-Independent Caching in 100LOC

【速读】: 该论文旨在解决生成式 AI(Generative AI)中检索增强与代理类工作负载在预填充(prefill)阶段频繁重复处理可预测的结构化输入片段(即“跨度”,spans)时,现有缓存机制效率低下的问题。传统前缀缓存(prefix caching)要求请求具有完全相同的前缀才能复用键值(KV)缓存,而现有的位置无关缓存(Position-Independent Caching, PIC)方案通常需要大规模服务端代码修改或将KV状态外置于服务器,导致主机到设备的数据传输开销。其解决方案的关键在于提出一种轻量、灵活且高效的vLLM设计——最小化位置无关缓存(Minimalistic PIC, MiniPIC),核心由两个要素构成:无位置编码的KV缓存与用户可控的缓存复用原语。MiniPIC将未旋转的K向量直接存储于KV缓存中,通过每请求逻辑位置在注意力计算中动态应用RoPE,同时提供三个细粒度的用户级原语(块对齐填充、跨度分隔符SSep、提示依赖PDep),以灵活控制哈希行为和块级因果注意力结构。仅需不足100行核心引擎修改及自定义注意力后端,即可在同一vLLM实例内实现多种PIC方法(如Block-Attention、EPIC、Prompt Cache),并天然兼容KV缓存的CPU卸载。在2WikiMultihopQA基准测试中,采用交错调度的MiniPIC相比基线vLLM提升预填充吞吐49%,将缓存跨度的首令牌延迟降低两个数量级,保持未缓存跨度的线性预填充扩展性,且最坏情况开销仅为5.7%。

链接: https://arxiv.org/abs/2606.13126
作者: Nathan Ordonez(1),Thomas Parnell(1) ((1) IBM Research)
机构: IBM Research (IBM研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call “spans”) such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

[NLP-44] NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation INTERSPEECH

【速读】: 该论文旨在解决同步语音到语音翻译(simultaneous speech-to-speech translation)中因过度追求低延迟而导致的语音片段化问题,这一问题表现为频繁的停顿和不自然的声学流,进而增加听者的认知负荷。其解决方案的关键在于提出一种流畅性感知优化框架(fluency-aware optimization framework),通过利用模型内部信号——包括语言多样性与语音持续时间的诱导时序变异性——来最小化段间静音,从而在保持低延迟优势的同时实现更接近连续翻译的自然语音流。实验结果表明,该框架在短时与长时任务基准上均能有效提升语音流畅性,同时维持具有竞争力的延迟表现与翻译质量。

链接: https://arxiv.org/abs/2606.13121
作者: Dongwook Lee,Youngho Cho,Sangkwon Park,Heeseung Kim,Sungroh Yoon
机构: Seoul National University (首尔国立大学); University of Seoul (首尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Proceedings of the 26th Interspeech Conference, Long Paper

点击查看摘要

Abstract:Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

[NLP-45] EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

【速读】: 该论文旨在解决当前搜索代理(Search Agents)评估基准中存在的测试集污染(test-set contamination)与参数化记忆(parametric memorization)问题,这些问题导致现有基准如BrowseComp依赖静态知识,使模型可通过简单事实回忆而非真正的网络检索与推理获得高分,从而无法真实反映模型的动态浏览能力。其解决方案的关键在于提出EvoBrowseComp——一个可持续演进的多语言复杂问答基准,通过实时网络遍历生成400个英文与400个中文无污染的高质量问题。该框架采用三代理协同机制:(1)问答合成代理从实时网页中获取最新知识以生成问答对;(2)信息过滤代理基于可信度与流行度筛选内容,防止模型利用参数捷径;(3)高层引导代理将问题形式化为推理图,减少逻辑冗余与合成过程中的思维捷径。该系统支持全自动合成与定期更新,确保数据时效性与抗污染能力,实验证明其具备高度挑战性,需广泛横向搜索,构建了一个可扩展、自更新、高难度的评估范式,能够同步追踪世界知识演化与智能体能力进步。

链接: https://arxiv.org/abs/2606.13120
作者: Yunhan Wang,Jiaan Wang,Lianzhe Huang,Xianfeng Zeng,Fandong Meng
机构: Weixin AI, Tencent Inc(微信AI,腾讯公司); Northeastern University(东北大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, under review

点击查看摘要

Abstract:Search Agents – large language models augmented with search tools – have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities. Comments: 14 pages, under review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.13120 [cs.CL] (or arXiv:2606.13120v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.13120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-46] G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放域对话系统中长期一致性维护的难题,其核心挑战在于长上下文推理能力不足以及处理海量原始文本时的计算效率低下。现有方法要么依赖易导致信息丢失的非结构化记忆存储,要么采用计算开销巨大的LLMs,造成高延迟。为此,论文提出G-Long框架,其关键创新在于:首先,采用微调的小型语言模型(small Language Model, sLM)进行结构化的三元组提取与关联检索,显著降低运行成本;其次,引入一种新颖的注意力感知重要性评分机制,利用T5摘要模型内在的交叉注意力信号来识别关键记忆,提升记忆选择的准确性。实验结果表明,G-Long在多个基准测试中均达到领先性能,在MSC上响应质量提升高达9.8%,在LME上检索召回率提升达40.8%,同时大幅减少计算开销。

链接: https://arxiv.org/abs/2606.13115
作者: Minjun Choi,Yoonjin Jang,Sangwon Youn,Youngjoong Ko
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, 14 tables

点击查看摘要

Abstract:While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

[NLP-47] MÖVE: A Holistic LLM Benchmark for the German Public Sector

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在德国公共部门应用中缺乏系统性评估标准的问题。当前,尽管LLMs在公共行政领域日益普及,但模型选型仍依赖于非结构化的经验判断,而现有评估基准普遍存在语言与地域局限性——主要以英语为主、内容聚焦美国情境,并且仅关注任务性能指标,忽视了治理维度的考量。为应对这一挑战,论文提出MÖVE(Modelle für die Öffentliche Verwaltung Evaluieren),一个面向德语公共部门场景的综合性评估基准。其解决方案的关键在于构建双维度评估框架:一方面涵盖摘要生成、问答和主题抽取等典型自然语言处理任务的性能指标;另一方面引入治理维度,评估幻觉倾向、能耗水平、供应商透明度以及与德国宪法价值和政党政见知识的一致性。研究采用10个德语文本数据集(包括自建的黄金标准与白银标准数据集),并结合传统NLP指标、基于嵌入的方法及“大语言模型作为裁判员”(LLM-as-a-judge)的多指标评价策略。实验表明,无单一模型在所有维度上表现最优,模型性能随任务变化显著,模型规模并非质量的可靠预测因子。此外,研究还对基准自身的统计精度、裁判模型可靠性、私有数据集对排名的影响、提示工程敏感性及能耗估算有效性进行了全面验证。MÖVE被设计为持续演进的动态基准,相关结果已公开可获取。

链接: https://arxiv.org/abs/2606.13111
作者: Camilla Dalerci,Thilo Michael,Robin Schaefer,Daniel Weinland
机构: Bundesdruckerei GmbH(德国联邦印刷局); Innovations Department(创新部门)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at this https URL.

[NLP-48] Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

【速读】: 该论文旨在解决生成式推理中隐状态链式思维(latent chain-of-thought)存在的两大核心问题:一是现有方法难以通过标准的在线策略强化学习(on-policy reinforcement learning, RL)进行优化,二是其内部推理过程缺乏可解释性,难以进行因果分析。解决方案的关键在于引入一对显式的边界标记(boundary tokens)——swi(进入标记)和/swi(退出标记),从而实现双重突破:首先,这两个离散标记使隐式推理块与标准在线策略强化学习兼容,确保在每个决策点上策略比率(policy ratio)定义良好,显著提升训练稳定性;其次,这些边界标记为模型内部机制提供了可直接探测和干预的锚点,支持对隐式推理步骤进行精确的机械分析(mechanistic analysis)。基于此,作者提出SWITCH框架,采用可见到隐式渐进式训练范式与一种可传播梯度至循环隐式计算过程的Switch-GRPO目标函数。实验表明,SWITCH在相似规模下持续优于先前的隐状态递归推理方法。进一步的机制分析揭示三个关键发现:(i) swi 是一个高度局部化、由模型学习得到的切换策略,而非风格化伪影;(ii) 其开启的隐式推理步骤执行了与问题相关的、具有因果重要性的计算,而非作为静态占位符;(iii) 该关键计算集中于进入时的单一隐藏状态转移。这些结果共同证明,隐状态递归式链式思维不仅具备强化学习可训练性,还可通过边界标记实现从内部对模型优化机制的直接解析,包括在线策略强化学习如何自内而外地改进模型。

链接: https://arxiv.org/abs/2606.13106
作者: Jiayu Yang,Chao Chen,Shengen Wu,Yinhong Liu,Yuxuan Fan,Lujundong Li,Songning Lai,Chengwei Qin,Zhijiang Guo
机构: HKUST(GZ)(香港科技大学(广州)); University of Cambridge(剑桥大学); NTU(南洋理工大学); JoinQuant(量投); HKUST(香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits swi to enter latent mode and /swi to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) swi is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

[NLP-49] LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

【速读】: 该论文旨在解决金融文本中长上下文理解与信息提取的评估难题,尤其针对生成式AI在复杂财务报告处理中的能力不足问题。现有公开资源多仅限于简化的10-K表格文件与少量问答对,难以全面评估模型在真实场景下的表现。为此,作者提出了LEDGER数据集,包含4,999份数字化的企业年度报告(含图表、表格及叙述性内容),每份报告标注31个综合财务关键绩效指标(KPI),并关联财报发布后的市场反应。基于此数据,构建了三个覆盖不同难度层级的评估基准:基于自然语言提问的页面级KPI检索任务(含118,048个TREC风格相关性判断)、对话式“大海捞针”单值查找任务,以及从长篇数值密集型报告中完整的KPI提取任务。解决方案的关键在于提供高质量、结构化且具备真实语境的长文档评估体系,包含人工校准的OCR标注质量、高一致性的人间标注结果,以及完整的提取、验证与评分工具链,并通过案例研究验证其在分析管理层信函修辞与市场反应之间关系方面的科研价值。

链接: https://arxiv.org/abs/2606.13100
作者: Charles Moslonka,Amaury de Vitry,Arthur Garnier,Hicham Randrianarivo,Emmanuel Malherbe
机构: Artefact Research Center(阿特克特研究中心); MICS, CentraleSupélec, Université Paris-Saclay(米克斯,中央理工-高等电力学院,巴黎萨克雷大学); Ardian(阿迪安)
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market’s reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational “needle-in-a-haystack” single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset’s research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

[NLP-50] sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling ALT LREC2026

【速读】: 该论文旨在解决从非结构化电子健康记录(Electronic Health Record, EHR)文本中提取结构化临床信息这一长期存在的瓶颈问题,尤其针对在临床表单填写任务(如CL4Health 2026病例报告表,Case Report Form, CRF)中大语言模型(Large Language Models, LLMs)部署所面临的隐私风险、推理成本高以及生成幻觉(hallucination)等问题。其解决方案的关键在于提出一种完全本地化、领域自适应的两阶段处理流程,采用MedGemma-27B模型实现无需外部API调用或微调的少样本上下文学习(few-shot in-context learning)。该架构将二元存在性判断与数值抽取分离,严格遵循文本证据,确保对否定、不确定或未知状态的确定性输出,从而在不牺牲数据隐私的前提下实现了接近主流商用模型的性能(在官方英文测试集上获得宏平均F1分数0.55),并位居所有本地部署开源方案中的第二名。本研究证明了基于本地化大模型的临床自然语言处理(Natural Language Processing, NLP)框架可在保障数据主权的同时实现高效、可靠的信息提取。

链接: https://arxiv.org/abs/2606.13082
作者: Katharina Sommer,Tristan Till,Florian Matthes
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), LREC 2026

点击查看摘要

Abstract:The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

[NLP-51] No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在学术评审场景中因论文呈现方式被恶意操纵而引发的系统性风险问题。传统研究多关注显式的攻击手段,如隐藏指令或提示注入,但本文揭示了一种更隐蔽且更具政策意义的失效模式:攻击者不修改任何科学内容(如方法、实验、数据、公式等),仅通过调整论文的呈现层面——包括摘要、贡献定位、相关工作表述、讨论部分及叙事结构——实现对AI评审结果的操控。其核心解决方案是提出“对抗性重包装”(adversarial repackaging)这一闭环攻击框架,利用AI评审反馈迭代优化论文的表达策略,从而在保持原始科学证据不变的前提下提升评审评分。实验表明,该方法在三大主流AI评审工具上实现了75.1%的攻击成功率和平均+1.21分的评分提升,且效果无法由常规语言润色解释。进一步分析发现,改变评审者对论文的理解框架(如重构相关工作位置、扩展分析讨论)远优于表面形式修改(如局部润色、表格格式调整)。研究揭示了两个深层结构性缺陷:一是AI评审更易被强化优势所吸引,而非被弱化缺陷所说服;二是其容易将“看似解决了局限性”的表象误判为实际解决,导致未更改的证据被重新诠释为更强的科学贡献。这表明,论文的呈现本身已成为可被优化的目标函数,凸显了部署风险不仅来自恶意代码注入,更源于评审机制对形式表达的过度敏感。为此,作者发布了无污染的滚动基准与攻击框架,用于评估AI评审是否能在仅进行呈现层修改的情况下仍锚定于科学实质内容。

链接: https://arxiv.org/abs/2606.13044
作者: Xu Yang,Zhizhou Sha,Junbo Li,Jian Yu,Yifan Sun,Matthew Zhao,Jinrui Fang,Xinyue Guo,Yining Wu,Xu Hu,Yifu Luo,Qiang Liu,Zhangyang Wang
机构: University of Texas at Austin; University of Illinois Urbana-Champaign; University of Texas at Dallas; Independent Researcher
类目: Computation and Language (cs.CL)
备注: 35 pages, 5 figures

点击查看摘要

Abstract:As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits. Comments: 35 pages, 5 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.13044 [cs.CL] (or arXiv:2606.13044v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.13044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-52] SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

【速读】: 该论文旨在解决生成式 AI 在电商图像助手场景中因用户意图多样化而导致的响应质量下降问题。具体而言,单张上传图像可能触发产品搜索、风格推荐、视觉百科查询或工具调用等多种异构意图,每种意图对响应格式、工具调用逻辑及领域知识均有不同要求,而现有基于大语言模型(Large Language Model, LLM)的系统在缺乏针对具体意图的行为约束时,容易混淆不同模式,导致领域内质量不达标。同时,意图空间的广度与动态性使得人工设计和维护成本过高,难以持续优化。为应对这一挑战,论文提出 SkillChain 框架,其核心在于通过闭环反馈机制实现技能(Skill)生命周期的自动化管理,包含三个关键阶段:技能创建器(Skill Creator)用于从任务规范与行为轨迹中自动生成初始技能;路由优化器(Route Optimizer)实现意图识别与路由对齐;体部精炼器(Body Refiner)则通过双路径大模型-判别器评估机制,迭代优化技能内容。在生产级电商图像助手上的部署表明,SkillChain 显著提升了整体响应质量,尤其在结构合规性和内容质量方面表现突出;为期一周的在线 A/B 实验进一步验证了其在用户参与度、内容消费量及长期留存率方面的显著提升。

链接: https://arxiv.org/abs/2606.12984
作者: Yimin Hu,Mengtao Xu,Hao Guo,Yuheng Song,Xiaoyong Zhu,Bo Zheng
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

[NLP-53] Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在多轮对话中因上下文过长而导致的“对话中迷失”(Lost in Conversation)问题,即当关键任务信息分散在多个对话轮次中时,模型性能显著下降,即使完整上下文可用,准确率仍可能降低高达65%。其解决方案的关键在于训练模型建立一种紧凑的滚动记忆(compact rolling memory)机制,而非依赖不断增长的历史上下文进行注意力计算。为实现高效可扩展的训练,作者提出一种低成本的分片(sharding)流水线,将单轮问答数据集(如GSM8K)自动转换为包含碎片化信息的多轮对话样本,避免了耗时的手动标注。实验表明,仅在分片后的GSM8K上训练的带记忆增强策略,在多轮任务中显著提升准确率,并能零样本泛化至更复杂的数学推理和跨领域的长上下文问答任务。更重要的是,经过记忆训练的模型即使在测试时提供完整历史,其表现仍优于仅依赖全历史输入的基线模型,表明学习信息压缩与结构化记忆有助于形成更鲁棒的增量推理能力,而非单纯依赖上下文暴露。

链接: https://arxiv.org/abs/2606.12941
作者: Shu Tong Luo,Wenqin Liu,Rui Liu,Mingming Gong,Jiaxian Guo
机构: The University of Melbourne (墨尔本大学); Google Research Australia (谷歌研究澳大利亚)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

[NLP-54] Order Is Not Control

【速读】: 该论文旨在解决生成式人工智能(Generative AI)与生物神经系统的控制机制本质问题,核心在于澄清“秩序诱导”(order-inducing)现象并非等同于真正的控制。其关键突破在于提出并验证了一种接收器门控响应律(receiver-gated response law),即一种以分母索引的算子,将物质状态、动作/驱动力、环境浴(bath)、接收器状态映射至响应位移、耗散项、努力程度及基域投影。该定律在生物系统(小鼠ALM、秀丽隐杆线虫、斑马鱼)、大语言模型(LLM)、适配器及随机算子面板中均被识别,且具有局部性特征:干预效果取决于介质、环境浴、接收器状态、动作端口与比较器的协同作用,表现为可接纳、饱和、符号反转、泄漏或过驱动等行为。研究通过实证表明,当有限努力能稳定改变目标或输出类别(同一分母下),而损伤、无效响应、格式错误、过驱动与非必要努力均保持有界时,方可定义为有效控制。实验数据支持局部可接受控制的存在,并揭示了可测量的随机响应算子;但未涵盖预生成阶段的可部署控制、隐藏/对数概率因果充分性、生物系统与大语言模型间的坐标同一性,以及热力学量的精确量化。

链接: https://arxiv.org/abs/2606.12923
作者: Gareth Seneque,Lap-Hang Ho,Nafise Erfanian Saeedi,Jeffrey Molendijk,Tim Elson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 52 pages, 7 figures

点击查看摘要

Abstract:AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

[NLP-55] Polar: A Benchmark for Evaluating Political Bias in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中政治偏见的测量难题,特别是如何在不同政治与语言语境下实现可复现、系统性的评估。其核心挑战在于现有方法多依赖提示(prompt-based)生成,难以客观量化模型在具体选项上的倾向性。为此,作者提出了Polar——一个包含4,026个实例的多项选择基准测试,通过分析选项级别的概率(option-level likelihoods)来衡量政治偏见,避免了生成过程中的主观干扰。Polar覆盖由“宣言项目”(Manifesto Project)定义的两个意识形态轴心及八个议题类别,并在美式与韩国政治语境下并行评估38个主流大语言模型。研究发现,模型的政治偏见呈现显著的上下文依赖性:在美国语境中,所有模型均表现出左倾-进步主义倾向;而在韩国语境中则趋向中间派或混合模式。此外,翻译实验表明,仅呈现语言的变化即可导致测量偏见的转移。因此,该研究的关键解决方案在于构建一个基于选项概率的多语言、跨情境评估框架,强调未来对大语言模型政治偏见的评测必须考虑语言与文化背景的多样性。

链接: https://arxiv.org/abs/2606.12922
作者: Sangho Kim,Heejin Kim,Yoonhee Park,Hyunggeun Jeon,Jaejin Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Submitted to ARR 2026 May cycle

点击查看摘要

Abstract:Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

[NLP-56] MDForge: Agent ic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

【速读】: 该论文旨在解决分子动力学(Molecular Dynamics, MD)模拟中针对新体系自动化构建高效、可信赖模拟流程的难题。传统MD流程设计高度依赖领域专家经验,且单次模拟成本高昂,难以通过试错优化,限制了其在新体系中的应用。本文提出的关键解决方案是构建一个基于大语言模型(LLM)的智能代理MDForge,将MD流程设计视为开放式的代码生成任务,并通过在线的语义奖励机制动态调整代理行为。其核心创新在于引入多物理学家代理辩论机制,利用上下文更新规则对稀疏奖励进行密度化处理,从而引导生成高质量、符合物理规律的模拟流程。在三个SAMPL宿主-客体结合自由能基准测试中,MDForge自动生成的流程性能媲美人类专家;进一步应用于未见候选客体库时,其针对环糊精CB[7]的优化流程成功发现一种新型高亲和力、皮摩尔级结合剂,经湿实验核磁共振(NMR)验证确认。该工作实现了从“专家主导”到“智能自主设计”的范式转变。

链接: https://arxiv.org/abs/2606.12916
作者: Zehong Wang,Yijun Ma,Connor R. Schmidt,Tianyi Ma,Weixiang Sun,Ziming Li,Xiaoguang Guo,Chuxu Zhang,Matthew J. Webber,Yanfang Ye
机构: University of Notre Dame(圣母大学); University of Connecticut(康涅狄格大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent’s behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at this https URL.

[NLP-57] PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation INTERSPEECH2026

【速读】: 该论文旨在解决级联式语音翻译(Cascaded Speech Translation, ST)系统中因自动语音识别(ASR)输出错误转录文本而导致的误差传播问题。其核心挑战在于,现有方法未能系统性地分析和应对ASR错误对下游神经机器翻译(NMT)性能的影响。研究的关键突破在于首次对越南语语音翻译中的ASR错误进行了系统性分类,依据音素成因将替换错误归类,并通过线性混合效应模型量化了这些音素混淆型错误对翻译质量的负面影响。研究证实,多数ASR替换错误源于音素层面的混淆而非随机噪声,且此类错误显著降低语音翻译性能。基于此发现,论文提出音素感知的数据增强方法(Phonetically-Informed Data Augmentation, PiDA),通过音素词嵌入生成与原始词汇在发音上相似的替代词,模拟真实ASR错误,从而在训练阶段增强模型对错误输入的鲁棒性。实验表明,在FLEURS越南语-英语数据集上使用PiDA增强后微调,不仅使有误ASR输出的翻译性能提升达+2.04 BLEU,同时对纯净文本翻译性能也有小幅改善,验证了该方法的有效性与泛化能力。

链接: https://arxiv.org/abs/2606.12911
作者: Giang Son Nguyen,Tung X. Nguyen,Hieu Minh Truong,Nhu Vo,Wray Buntine,Dung D. Le
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to INTERSPEECH 2026

点击查看摘要

Abstract:Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

[NLP-58] SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

【速读】: 该论文旨在解决语言模型代理(Language Model Agent)在多轮工具使用中训练可靠性不足的问题,尤其是在任务分布与智能体能力演化不匹配时导致大量无效采样(uninformative tasks)的困境。其核心挑战在于传统强化学习(Reinforcement Learning, RL)依赖于预设的任务分布,随着策略能力提升,原始任务分布逐渐失效,难以提供有效的训练信号。为此,论文提出SENTINEL框架,其关键创新在于构建一种以失败驱动的强化学习机制:将求解器(Solver)在推理过程中产生的失败轨迹作为反馈,由控制器(Controller)识别并归纳重复性错误模式,进而由提议器(Proposer)生成针对性的、能暴露当前弱点的新任务,最后由求解器在这些靶向任务上进行训练。该闭环机制实现了从模型自身失败中动态提取高质量训练信号,显著提升了模型在真实任务场景下的表现——在Tau2-Bench Retail基准上,Pass¹指标从66.4提升至74.9,并在通用合成任务上优于传统强化学习方法。结果表明,模型失败可作为高效且可扩展的靶向训练信号来源,有效缓解任务分布漂移问题。

链接: https://arxiv.org/abs/2606.12908
作者: Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Qun Liu,Chen Luo,Jiri Gesi,Hanqing Lu,Yisi Sang,Manling Li,Jing Huang,Dakuo Wang
机构: Northeastern University; Independent Researcher; Northwestern University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy’s evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver’s rollout failures into targeted training tasks. SENTINEL follows a Controller–Proposer–Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass^1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass^k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

[NLP-59] X-MADAM-RAG : Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在多语言场景下因检索到相互矛盾的证据而引发的答案冲突问题,尤其关注中英文证据之间支持不一致的矛盾情形。其核心解决方案是提出X-MADAM-RAG,一个可解释的处理流程,将证据处理分解为四个关键步骤:文档级候选提取、可见证据修复、确定性候选分组以及冲突感知聚合。该方法在受控基准X-RAMDocs-ZHEN上表现出优异性能,严格准确率达到0.9667,冲突感知成功率高达0.9767,显著优于单一调用的证据归一化基线。然而,研究发现基于规则的零调用提取器在原始基准上能达到1.0000的完美表现,揭示了数据中存在的强模板规律性。为进一步验证模型鲁棒性,研究构建了一个去模板化的自然化压力测试集,移除了显式答案模板但保留候选内容,结果显示规则提取器性能骤降至0.0000,而X-MADAM-RAG亦下降至0.3000,表明文档级候选提取仍是当前系统的主要瓶颈。研究结论强调,X-RAMDocs-ZHEN与X-MADAM-RAG更适合作为诊断特定证据冲突的工具,而非通用幻觉检测或自然检索环境下的鲁棒性验证手段。

链接: https://arxiv.org/abs/2606.12903
作者: Yongqi Kang,Yu Fu,Yong Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

[NLP-60] PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue INTERSPEECH2026

【速读】: 该论文旨在解决情感化语音对话系统中语义响应与情感韵律表达不一致的问题,尤其针对传统级联式流水线在语音转文本过程中丢失声学线索,以及端到端语音模型难以实现情感与知识融合的可解释性控制。其解决方案的关键在于提出PRISM多智能体框架,通过将语音感知、响应生成与语音合成解耦为协同组件,并引入韵律到语言的翻译机制以稳定大语言模型的推理过程,同时支持按需调用外部知识工具,从而实现情感对齐的对话生成。实验结果表明,PRISM在情感表达、韵律恰当性和文本生成质量等客观与主观指标上均取得显著提升。

链接: https://arxiv.org/abs/2606.12902
作者: Wen Zhang,Xiaocui Yang,Zhuoyue Gao,Shi Feng,Daling Wang,Yifei Zhang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: this https URL.

[NLP-61] Zero-source LLM Hallucination Detection with Human-like Criteria Probing ICML2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时频繁出现幻觉(hallucination)的问题,尤其聚焦于零源(zero-source)场景下的幻觉检测难题——即在无法访问模型内部结构或外部参考信息的情况下,仅依赖查询-回答文本对进行判断。其解决方案的关键在于提出一种类人评估准则探查(Human-like Criteria Probing, HCPD)范式,通过构建一个类人推理机制(Human-like Criteria Probing, HCP),使LLM代理能够自适应地将判断过程分解为一组可解释的评判准则,并基于各准则的加权得分聚合得到最终的可信度评分。为实现这一自适应能力,研究引入仅依赖语义一致性弱监督的奖励对齐方案;在推理阶段采用多采样聚合策略以增强决策鲁棒性,同时保持全程可解释性。理论分析进一步支持了该方法的可靠性。大量实验表明,HCPD在多个基准上持续优于现有最先进方法,提供了一种高效且可解释的零源幻觉检测方案。

链接: https://arxiv.org/abs/2606.12900
作者: Jiahao Yang,Shuhai Zhang,Hailong Kang,Feng Liu,Qi Chen,Mingkui Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at this https URL.

[NLP-62] Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

【速读】: 该论文旨在解决视觉文本理解(Visual Text Comprehension, VTC)中存在的重要瓶颈问题:当前的VTC流水线将文本渲染与版式布局视为固定且内容无关的预处理步骤,缺乏对视觉语言模型(Vision-Language Model, VLM)内部如何处理可视化文本的机制性理解。研究发现,在VTC问答任务中,VLM表现出“定位但未利用”(localization-without-utilization)的行为模式——关键证据的注意力聚焦在中间到深层网络中显著出现,但该注意力分布与答案正确性关联较弱;然而,仅通过放大已定位的文本区域即可恢复大量原本失败的预测。针对这一现象,论文提出AGAR(Attention-Guided Adaptive Rendering)方法,其核心在于无需训练、不依赖特定模型架构,利用VLM自身中间至深层的注意力权重识别出最重要的视觉块(visual patches),将其映射回原始文本词段,并对这些关键词段进行自适应放大重渲染,再重新推理答案。实验表明,AGAR在九个不同类型的VTC基准测试(涵盖短文本、长上下文及多页记忆问答)和四种VLM骨干网络上均能作为即插即用的增强模块持续提升性能,且可与VLM后训练策略兼容并进一步增益,同时在视觉与文本侧输入退化条件下仍保持鲁棒性。

链接: https://arxiv.org/abs/2606.12898
作者: Shenglai Zeng,Qirui Wang,Kai Guo,Xinnan Dai,Xianxuan Long,Hui Liu
机构: Michigan State University (密歇根州立大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM’s own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

[NLP-63] SafeLLM : Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在组织文档(如标准操作流程、人力资源政策及机构指南)检索与生成过程中,依赖自由重写(free-form rewriting)的检索增强生成(Retrieval-Augmented Generation, RAG)系统易引入幻觉(hallucination)以及在完整性与简洁性之间难以平衡的问题,尤其在安全与合规要求严苛的场景下。其解决方案的关键在于采用基于提取(extraction)的方法替代传统的重写策略,通过精确地从源文档中提取相关语句并保留原始文本信息,以增强生成结果的安全性与准确性。研究重点比较了多种提示工程策略,包括基于行号的源选择、带显式安全标注的相关条款提取,以及多阶段管道对草稿答案进行证据驱动的精炼。实验结果表明,基于行号的提取策略表现最优,在不同规模模型和文档结构(如英国国家健康服务体系(NHS)急性护理与肿瘤学指南、英国国家卫生与临床优化研究所(NICE)指南)上均实现了高达95%的术语召回率,并与源文本高度一致;而以安全为导向的策略虽提升精度但导致系统性遗漏,多阶段过滤进一步加剧了完整性与精确性的权衡。此外,文档结构显著影响性能表现:行级提取在协议类内容中优势明显,而在更冗长的文本中,其他策略可实现最高达97%的术语召回率。

链接: https://arxiv.org/abs/2606.12897
作者: Julia Ive,Felix Jozsa,Evridiki Georgaki,Nabeel Sheikh,Emma Cattell,Nick Jackson,Paulina Bondaronek,Ciaran Scott Hill,Richard Dobson
机构: University College London(伦敦大学学院); National Hospital for Neurology and Neurosurgery(国家神经病学医院); Somerset NHS Foundation Trust(萨默塞特国民保健信托基金); King’s College Hospital(国王学院医院); King’s College London(伦敦国王学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

[NLP-64] Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

【速读】: 该论文旨在解决大语言模型微调过程中传统强化学习方法复杂且计算成本高的问题。其解决方案的关键在于采用直接偏好优化(Direct Preference Optimization, DPO),通过将偏好数据直接映射到策略更新目标,避免了传统强化学习中需要训练价值函数或策略梯度估计的繁琐步骤,从而简化了训练流程、提升了计算效率,并在多个评价指标(如BLEU、ROUGE和余弦相似度)上实现了具有竞争力的性能表现,尽管仍需进一步研究以缓解实验中观察到的训练不稳定性问题。

链接: https://arxiv.org/abs/2606.12881
作者: Yvonne Qiu,Dezhi Yu,ShuoJia Fu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, 1 table

点击查看摘要

Abstract:We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

[NLP-65] Multi-Bitwidth Quantization for LLM s Using Additive Codebooks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在异构硬件上部署时,因资源约束差异而面临的性能与效率难以动态平衡的问题。现有方法通常需要为不同精度需求重新训练或微调模型,导致计算开销大、部署灵活性差。其核心解决方案是提出一种名为“Drop-by-Drop”的新型多比特位后训练量化框架,通过单个训练好的模型实现推理时的精度可调性。该方法的关键在于基于信息论和逐级精炼(successive refinement)理论,证明了遵循高斯分布的LLM权重可在加权均方误差失真度量下,随着比特数增加实现渐进式高保真重建;并通过引入马特罗什卡(Matryoshka)式监督机制,利用可加码本结构,在损失函数中显式建模多粒度精度层级,使有序码本子集可生成各精度层级下的准确部分重构。这一设计使得单一模型检查点即可支持多种比特宽度,显著降低存储与内存开销,同时在Qwen、LLaMA、Gemma和Mistral等主流架构上保持优异的困惑度与准确性表现。

链接: https://arxiv.org/abs/2606.12876
作者: Liza Babaoglu,Shuangyi Chen,Ashish Khisti
机构: University of Toronto (多伦多大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 37 pages, 12 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

[NLP-66] Small LLM s for Biomedical Claim Verification: Cost-Effective Fine-Tuning Structural Dataset Shortcuts and Cross-Domain Generalization ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生物医学声明验证任务中存在成本高昂与决策过程不透明的问题,限制了其在实际场景中的规模化应用。针对这一挑战,研究提出采用量化低秩适应(QLoRA)方法对三款小型LLM(Phi-3-mini 3.8B、Qwen2.5-3B 和 Mistral-7B)进行微调,基于SciFact和HealthVer两个生物医学数据集开展系统性评估。其核心解决方案在于:通过仅使用1,008个训练样本的高效微调策略,使Mistral-7B QLoRA模型在性能上超越GPT-4o与GPT-5(F1提升最高达12%),同时显著降低计算成本。研究进一步通过跨领域与域内联合评估,揭示了SciFact数据集中存在的结构偏差(structural artifact)导致的虚假性能增益,并证实以结构合理数据进行训练可实现更稳健的跨领域迁移能力,从而为构建高效、可解释且泛化能力强的生物医学验证系统提供了关键范式。

链接: https://arxiv.org/abs/2606.12854
作者: Gaurav Kumar
机构: Moveworks AI; University of California San Diego
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026

点击查看摘要

Abstract:Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

[NLP-67] LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

【速读】: 该论文旨在解决当前搜索代理(Search Agent)评估基准(如BrowseComp)因依赖人工构建而存在局限性的问题,即标注者缺乏对实体统计信息的全局视角,难以系统性地拓展搜索空间规模与结构复杂度,导致性能提升面临“天花板”效应。其解决方案的关键在于提出一种基于知识图谱(Knowledge Graph, KG)驱动的自动化构建流程,通过覆盖超过700万维基百科实体的知识图谱,自动选取具有大规模搜索空间的关系,并将其组合为结构复杂的、经知识图谱验证的唯一答案问题,从而生成更具挑战性的评估基准LoHoSearch。该基准包含跨11个领域的544个经人工验证的问题,实测显示最强模型准确率仅为34.74%,远低于现有基准水平,且传统上下文管理策略带来的增益(+6.8%)显著小于以往基准,凸显了其在长时程推理与上下文管理能力评估上的更高要求。

链接: https://arxiv.org/abs/2606.12837
作者: Jiarui Zhao,Rongzhi Zhang,Lingchuan Liu,Hao Yang,Xunliang Cai,Xi Su
机构: Meituan(美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

[NLP-68] Localizing Anchoring Pathways in Language Models

【速读】: 该论文旨在解决语言模型在数值推理任务中受提示中无关数字干扰而产生锚定效应(anchoring effect)的问题,即模型判断易被无关数值误导。其核心解决方案在于通过设计一个带有共享选项的受控多项选择实验范式,定义并验证了一个基于对数几率差(logit-difference)的度量指标,用于量化正确答案与锚点对应答案之间的决策信号差异,从而有效追踪锚定行为。研究采用基于归因的电路定位方法(attribution-based circuit localization),在7B–8B规模的Qwen和Llama基础模型及指令微调模型上分析信号传递路径,发现边缘级(edge-level)方法比节点级(node-level)方法更准确地恢复锚定信号。此外,低锚点与高锚点相关电路在模型内部具有强迁移性,表明存在跨锚点方向的共享路径结构;但基础模型与指令微调模型间稀疏的迁移性则揭示了后训练过程改变了关键路径的重要性。总体而言,该研究提供了锚定效应相关决策信号在语言模型内部传递的机制性解释。

链接: https://arxiv.org/abs/2606.12818
作者: Hillary N. Owusu,Sarah Wiegreffe,Naomi H. Feldman
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B–8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

[NLP-69] Detect Remask Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

【速读】: 该论文旨在解决生成式摘要在面对动态演化语境时出现的时效性问题,即现有摘要因未能及时更新而逐渐失去准确性。传统方法通过完全重写摘要来应对,但这一过程不仅浪费已有的有效内容,还难以清晰揭示变更点,且在仅部分信息过时的情况下显得效率低下。其核心解决方案是提出一种局部忠实性修复(localized faithfulness repair)方法——DETECT-REMASK-REPAIR,该框架基于扩散模型(diffusion-based framework),通过三阶段流程:检测(detect)过时片段、掩码(remask)相关区域、利用掩码扩散语言模型(masked diffusion language models)进行精准修复,实现对已有摘要中失效内容的局部更新,同时保留已被验证的可靠信息。该方法的关键创新在于将生成过程分解为可控制的局部修复步骤,显著降低计算成本(单步修复耗时低于0.5秒),并支持在忠实性、速度与内容保留之间灵活权衡。研究进一步构建了合成事件时间线基准数据集StreamSum,实验表明该框架不仅能有效提升早期草稿的准确性,还可作为后处理机制,增强自回归生成系统在动态场景下的事实一致性。

链接: https://arxiv.org/abs/2606.12807
作者: Hao Zou,Zachary Horvitz,Chandhru Karthick,Zhou Yu,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

[NLP-70] GENIE: A Fine-Grained Measure for Novelty

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务执行中普遍存在的创造力不足与输出多样性匮乏的问题,尤其关注生成内容的“新颖性”(novelty)如何在特定任务背景下体现。传统研究多聚焦于模型是否具备生成创意内容的能力,而本文则深入探讨何种因素使模型输出具有任务特定的新颖性。其核心解决方案在于提出一种细粒度评估指标GENIE(Generative Novelty Evaluation Index),该指标能够基于响应群体在任务相关特征维度上的分布,量化生成内容的新颖程度。相较于以往采用整体性(holistic)评价方式的指标,GENIE可有效捕捉新颖性的高维特性,并揭示具体被关注的属性,从而提供更精准的分析视角。最后,作者利用GENIE评估多种提升创造力方法的实际效果,以识别现有策略在增强新颖性方面的局限与改进方向。

链接: https://arxiv.org/abs/2606.12790
作者: Ramya Namuduri,Manya Wadhwa,Anshun Asher Zheng,Greg Durrett,Junyi Jessy Li
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

[NLP-71] ProPlay: Procedural World Models for Self-Evolving LLM Agents

【速读】: 该论文旨在解决在部分可观测环境(partially observable environments)中,自主演进代理(self-evolving agents)如何在无外部监督的情况下通过持续交互实现能力提升的问题。此类环境下的挑战在于,代理需主动探索、从有限反馈中学习,并判断何时依赖已有经验。现有基于大语言模型(LLM)的代理方法通常依赖记忆或规划模块,但缺乏对二者之间闭环整合的机制,难以持续优化对环境动态的内在理解。本文提出ProPlay,一种支持过程级预演(procedure-level preplay)的过程化世界模型(procedural world model),其核心创新在于将成功轨迹抽象为可复用的“程序”(procedure),并构建程序图(procedure graph)以表征任务阶段间的因果转移关系。每个转移节点附带一个可靠性记录嵌入(reliability record embedding),用于量化历史结果对当前任务的贡献度。在每轮执行前,ProPlay基于已知图结构模拟未来可能的程序路径,生成结构化的软引导(structured soft guidance);执行后则利用环境反馈动态更新程序图。该闭环机制实现了对环境认知的持续精炼,显著提升了代理的环境理解与自演化能力。实验表明,ProPlay在多个公开基准上均优于强基线模型。

链接: https://arxiv.org/abs/2606.12780
作者: Yijun Ma,Zehong Wang,Yiyang Li,Ziming Li,Xiaoguang Guo,Weixiang Sun,Chuxu Zhang,Yanfang Ye
机构: University of Notre Dame (圣母大学); University of Connecticut (康涅狄格大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in this https URL.

[NLP-72] Agent ic MPC for Semantic Control System Resynthesis

【速读】: 该论文旨在解决模型预测控制(MPC)在处理高阶上下文信息(如社会规范、用户意图或自然语言指令)时的不足,其核心问题是传统MPC难以动态融合非结构化、高层语义输入以实现自适应控制。解决方案的关键在于提出一种代理型模型预测控制(agentic MPC)框架,通过集成基于大语言模型(LLM)的智能体,实现对自然语言消息、环境观测及外部知识等异构输入的语义理解与融合,进而动态重生成控制规范。该方法使系统能够根据个人偏好或社会情境(如避让应急车辆)进行语义自适应的决策与控制,显著提升了MPC在复杂现实场景中的灵活性与智能化水平。

链接: https://arxiv.org/abs/2606.12774
作者: Yuya Miyaoka,Masaki Inoue
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

[NLP-73] Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

【速读】: 该论文旨在揭示苹果Metal 4.1中名为MPP matmul2d的张量计算路径在硬件层面的实际行为,其核心问题是:尽管该接口被公开文档化,但其底层硬件执行机制(如是否由专用矩阵单元加速、执行位置、累加器宽度、分块策略等)均被刻意隐藏。解决方案的关键在于提出Rigel——一种基于校验和门控与溯源追踪的微基准测试框架,通过在单颗Apple M4 Max芯片(非神经引擎时代)上进行系统性实证分析,成功揭示了原规范未披露或存在矛盾的十一项关键事实。主要发现包括:Metal 4.1中的fp8(E4M3)matmul2d操作并非硬件加速,而是软件模拟实现,其吞吐量仅为fp16的0.94倍,尽管读取的数据量仅为后者一半,表明其本质是内存占用优化特性而非性能优势;进一步通过三信号三角验证(吞吐上限、与simdgroup_matrix对比、按轨道功耗归因),确认该操作完全在GPU着色核心上执行,无专用矩阵数据通路,也无苹果神经引擎(Apple Neural Engine)调度痕迹;累加器精度为fp32,且重构出未公开的8×8 cooperative_tensor分块布局。基于此认知,作者手工融合GEMM + 偏置 + GELU内核,在缓存驻留场景下相较分解路径提升6.5%–12.9%性能。所有结论均可通过开源的MIT许可代码及每单元CSV数据复现。

链接: https://arxiv.org/abs/2606.12765
作者: Ramchand Kumaresan
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Apple’s Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in =fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

[NLP-74] Detecting Functional Memorization in Code Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成代码时可能存在的功能记忆化(functional memorization)问题,即模型不仅会复制训练数据中的文本内容,还可能在不产生明显文本重叠的情况下复现训练数据中的功能性逻辑。传统审计方法依赖于文本层面的重合度(如字符串匹配),但此类方法无法检测到语义等价但形式不同的代码片段。本文的关键解决方案在于构建一种反事实实验设置:以Olmo-3-32B模型为例,对比一个已暴露于目标代码的中期训练模型与一个未接触该代码的预训练参考模型,在相同的函数签名提示下生成代码,并通过“大语言模型作为评判者”(LLM-as-a-judge)和基于执行的评估方式,同时测量生成结果的文本相似性和功能相似性。实验结果表明,模型确实存在显著的功能记忆现象,证明现有仅依赖文本重合度的审计指标存在严重局限,强调了发展更深入、基于功能等价性的评估机制的必要性。

链接: https://arxiv.org/abs/2606.12764
作者: Matthieu Meeus,Anil Ramakrishna,Matthew Grange,Zheng Xu,Luca Melis
机构: Meta(元)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

[NLP-75] LLM s Can Better Capture Human Judgments–With the Right Prompts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在捕捉人类判断时存在的两大局限:一是无法充分反映人类回应的完整分布,二是对表述变化的敏感性导致判断不稳定性。其解决方案的关键在于采用简单但有效的提示工程策略:首先,通过引导模型报告标准差和响应比例,能够更全面地恢复人类回应的分布特征;其次,确保情境描述对人类参与者清晰明确(以人类困惑度评分作为衡量指标),可显著提升模型与人类判断的一致性,且模型具备追踪人类困惑度的能力。研究同时发现,尽管模型对其自身错误的估计校准较差,但能较好预测人类判断的变异性。这表明,通过优化提问方式,可显著提升大语言模型在伦理判断等复杂任务中的表现,从而实现更优的AI-人类对齐。

链接: https://arxiv.org/abs/2606.12754
作者: Danica Dillion,Chen Cecilia Liu,Baihui Wang,Daniele Barolo,Tanmay Rajore,Niket Tandon,Pranathi Ravikumar,Kurt Gray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets–a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme’s Family and Changing Gender Roles module covering 32 countries–we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants–as reflected in human confusion ratings–boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs’ estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

[NLP-76] Agent -based models for the evolution of morphological alternation patterns

【速读】: 该论文旨在解决语言中普遍存在的形态交替现象(morphological alternation)——如英语动词“go”过去式为“went”这种看似无关的形态变化——为何在无明显交际或习得优势的情况下仍能长期存续并演化的问题。其核心挑战在于揭示这类非理性的形态变异如何在语言演变过程中自发产生并固化。解决方案的关键在于构建一个基于多智能体(multi-agent)的仿真模型,模拟语言演变中的形态交替生成机制:通过音变或特定群体中的词汇替代(如“go/went”)引入交替形式;当智能体“听”到他人使用某词形(如“went”)时,以一定概率采纳该形式,并可能将其传播至同一原型形式的其他词形位置,从而实现交替形式在语义-形态网络中的扩散与固化。该模型突破了以往研究的局限,支持自然主义词汇、真实音系规则、大规模词库及百人级智能体群,且可配置多种社会网络拓扑、扩散模式与采纳策略。为评估仿真的真实性,研究提出“AI历史语言学家”(AI Historical Linguist),一种由大语言模型驱动的双历史语言学家辩论系统,用于对比真实语言、伪装语言与实验演化语言的形态结构。结果显示,具有无标度(scale-free)社交网络和随机伯努利采纳策略的设置更易生成符合真实语言特征的形态系统。此外,论文还通过三个历史案例研究验证了模型对实际语言演变路径的模拟能力,展示了若历史轨迹不同可能产生的语言变化。

链接: https://arxiv.org/abs/2606.12748
作者: Aravinth Kulanthaivelu,Richard Sproat
机构: 未知
类目: Computation and Language (cs.CL)
备注: 51 + 37 pages. 31 Figures

点击查看摘要

Abstract:Why is the past of English “go” the apparently unrelated “went”? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with “go/went”, from lexical alternatives associated with a subset of the population. When an agent ‘hears’ another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released. Comments: 51 + 37 pages. 31 Figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.12748 [cs.CL] (or arXiv:2606.12748v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.12748 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Richard Sproat [view email] [v1] Wed, 10 Jun 2026 23:26:44 UTC (5,283 KB) Full-text links: Access Paper: View a PDF of the paper titled Agent-based models for the evolution of morphological alternation patterns, by Aravinth Kulanthaivelu and 1 other authorsView PDF view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-77] Rethinking Psychometric Evaluation of LLM s: When and Why Self-Reports Predict Behavior ICML2026

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在安全部署中行为预测的可靠性问题,核心挑战在于:如何通过低成本的心理测量探针(psychometric probes)有效预判大型语言模型(LLM)的行为倾向。现有研究虽揭示了自述(Self-Reports, SR)与实际行为之间存在显著脱节,但其依赖的宽泛人格特质框架(如大五人格,Big 5)对具体行为的预测能力本就有限,尤其在人类中亦如此,因而难以作为可靠依据。此外,以往实验常采用孤立对话会话且上下文匹配度低,导致无法判断模型是否真正缺乏行为一致性,抑或只是未能满足检测一致性的条件。为此,论文提出以计划行为理论(Theory of Planned Behavior, TPB)替代大五人格,因其聚焦于特定行为意图,能更精准地预测人类行为。研究在四个行为任务中对11个前沿LLM进行测试,并系统考察会话上下文与身份诱导的影响。结果表明:1)在同一对话内,基于TPB的自述可达到与人类相当的行为一致性,而大五人格则不能;2)跨对话情境下,一致性仅在行为受长期因素(如训练数据中的隐性偏见)锚定的情况下得以维持,一旦行为被即时上下文强烈引导(如谄媚行为),一致性即迅速瓦解;3)角色提示(persona prompting)虽提升自述的一致性,但未使行为本身与自述对齐。综上,该研究的关键发现是:粗粒度人格框架(如大五)不适合作为评估部署行为的有效工具,必须采用任务和行为高度特化的测量工具,且这些工具的效度需在多任务、多情境下重新验证。

链接: https://arxiv.org/abs/2606.12730
作者: Rafal Kocielnik,Pengrui Han,Peiyang Song,Myrl G. Marmarelis,Ramit Debnath,Dean Mobbs,Anima Anandkumar,R. Michael Alvarez
机构: Caltech (加州理工学院); UIUC (伊利诺伊大学厄本那-香槟分校); University of Cambridge (剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

点击查看摘要

Abstract:Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

[NLP-78] Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review ICML2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学同行评审流程中面临的跨模态对抗攻击风险,尤其针对科学论文中图表与文本共同承载核心证据的特性,现有研究普遍局限于纯文本场景,难以应对多模态环境下的针对性攻击。此类攻击不同于常规的“越狱”(jailbreaking),其目标是诱导领域特定、可预测的评审结果偏差(如“提高某项评分”),而非一般性安全策略违规,因而缺乏有效防御手段。论文提出 PaperGuard——首个系统性评估并防御此类跨模态攻击的综合基准框架,其关键在于三方面:(1)构建覆盖多个科学领域的多模态同行评审数据集;(2)设计统一攻击套件,涵盖黑盒提示注入与白盒扰动,分别针对文本(基于GCG)和图像(基于PGD)进行攻击;(3)提出一种基于分块嵌入搜索的实用防御机制,利用学术论文长上下文特性,高效定位并缓解有害指令。实验表明,当前主流模型在多模态攻击下普遍存在脆弱性。PaperGuard 为可信、抗攻击的 AI 辅助学术评审奠定了基准、标准流程与可操作的防御基础。

链接: https://arxiv.org/abs/2606.12716
作者: Xinyu Zhao,Rana Muhammad Shahroz Khan,Zhen Xu,Zhen Tan,Tianlong Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICML 2026, Project Page: this https URL

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., “inflate this score”) rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

[NLP-79] AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

【速读】: 该论文旨在解决非洲语言在自然语言处理(Natural Language Processing, NLP)研究中资源匮乏与代表性不足的问题。尽管非洲语言具有显著的语言多样性且在全球语言体系中具有重要地位,但其在主流NLP研究中的数据支持仍极为有限。为此,本文提出AfriSUD,首个大规模、句法标注的树库集合,涵盖撒哈拉以南非洲九大跨语系、跨区域的代表性非洲语言。其解决方案的关键在于采用表层句法统一依赖(Surface-Syntactic Universal Dependencies, SUD)框架,通过社区协作方式构建高质量、母语者验证的数据集,有效捕捉非洲语言的核心类型学特征,如黏着性与声调系统。在此基础上,对多种模型(包括非变换器基线、多语言预训练编码器及大语言模型)在词性标注与依存句法分析任务上的表现进行评估,结果揭示出当前模型在非洲语言上存在显著的句法能力鸿沟,表明现有架构尚未充分建模非洲语言复杂的句法结构多样性,凸显了面向非洲语言的专用模型设计与数据建设的迫切需求。

链接: https://arxiv.org/abs/2606.12708
作者: Happy Buzaaba,Cheikh Mouhamadou Bamba Dione,David Ifeoluwa Adelani,Sylvain Kahane,Kim Gerdes,Bruno Guillaume,Kevin Guan,Aremu Anuoluwapo,Naome A. Etori,Shamsuddeen Hassan Muhammad,Utitofon Inyang,Peter Nabende,David Sabiiti Bamutura,Andiswa Bukula,Chinedu Uchechukwu,Rooweither Mabuya,Idris Akinade,Christiane Fellbaum
机构: Princeton University(普林斯顿大学); Laboratory for Artificial Intelligence, Princeton University(普林斯顿大学人工智能实验室); Gaston Berger University(加斯顿·伯杰大学); Mila, McGill University(麦吉尔大学米尔研究所); Canada CIFAR AI Chair(加拿大魁北克人工智能主席); Paris Nanterre University(巴黎楠泰尔大学); Paris-Saclay University(巴黎-萨克雷大学); CNRS(法国国家科学研究中心); Inria(法国国家信息与自动化研究所); LORIA(洛林计算机科学与应用研究所); Université de Lorraine(洛林大学); University of Trento(特伦托大学); University of Minnesota–Twin Cities(明尼苏达大学双城分校); Imperial College London, UK(英国伦敦帝国学院); Binghamton University(宾汉顿大学); Makerere University(马凯雷雷大学); Penn State University(宾夕法尼亚州立大学); Mbarara University of Science and Technology(姆巴拉拉科技大学); Chalmers University of Technology(查尔姆斯理工大学); University of Ibadan(伊巴丹大学); Nnamdi Azikiwe University(恩南迪·阿齐基韦大学); South African Centre for Digital Language Resources(南非数字语言资源中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

[NLP-80] Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

【速读】: 该论文旨在解决生成式模型中隐变量推理机制(latent reasoning models, LRMs)的可解释性问题,即如何准确识别和验证模型内部是否存在真正意义上的“连续思维”(continuous thoughts)而非仅由训练过程或架构设计导致的表观模式。其核心挑战在于,现有方法常依赖可观测的潜在状态模式(如类广度优先搜索的前沿结构或可解码的算术计算)作为内部推理机制的证据,但这些模式可能在缺乏特定设计(如递归结构或课程学习)的对照模型中同样出现,且未必对行为具有因果影响。论文的关键解决方案在于引入因果干预实验,揭示隐变量利用并非二元开关,而是与思维对模型行为的因果效应呈梯度相关;进一步通过几何分析发现,这种因果效应集中于低秩方向,且随着行为影响力增强,其步间几何结构愈发规整。因此,研究提出应将隐变量视为隐藏的计算过程而非可解释的推理说明,仅凭可解码性、注意力分布或静态结构无法确立真实机制。最终,论文强调:实现LRM可解释性必须结合匹配的对照实验与严格的因果检验。

链接: https://arxiv.org/abs/2606.12689
作者: Darpan Aswal,Thomas Palmeira Ferraz,Yongxin Zhou,Maxime Peyrard
机构: Université Grenoble Alpes, CNRS, Grenoble INP, LIG; Université Paris-Saclay; NAVER LABS Europe
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought’s causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

[NLP-81] MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

【速读】: 该论文旨在解决阿拉伯语社交媒体文本中精神健康障碍识别的挑战,主要受限于方言差异、非正式语言表达、高质量标注资源稀缺以及严重的类别不平衡问题。尽管英语环境下的精神健康自然语言处理(NLP)已取得显著进展,但阿拉伯语多类别障碍分类研究仍较为薄弱。为此,本文提出一种两阶段框架:第一阶段通过领域自适应预训练(Domain-Adaptive Pretraining, DAPT)与任务自适应预训练(Task-Adaptive Pretraining, TAPT),对AraBERT、CAMeLBERT和MARBERT三个阿拉伯语预训练模型在大规模未标注阿拉伯语心理健康推文语料上进行微调,并基于统一评估协议筛选出最优主干模型;第二阶段则在选定模型基础上,对比分析单阶段与分层两阶段分类架构,结合全量微调与低秩适配(Low-Rank Adaptation, LoRA)的四种配置。为支持研究,作者构建了一个新的标注数据集,包含50,670条跨六类别的阿拉伯语推文,具有较高的标注者间一致性(Krippendorff’s Alpha = 0.733,平均成对一致性 = 0.797)。实验结果表明,经过领域自适应的MARBERT(即MentalMARBERT)在准确率和宏平均F1分数上均显著优于基线模型;而采用全量微调的分层两阶段架构表现最佳,达到宏平均F1为0.861、准确率为0.877。研究证实了领域特定自适应预训练与分层分类架构在阿拉伯语精神健康障碍检测中的有效性。

链接: https://arxiv.org/abs/2606.12649
作者: Fatimah Almalki,Areej Alhothali,Lulwah Alharigy,Abdulrahman Aladeem
机构: King Abdulaziz University (国王阿卜杜勒阿齐兹大学); King Abdulaziz University (国王阿卜杜勒阿齐兹大学); King Abdulaziz University (国王阿卜杜勒阿齐兹大学); King Abdulaziz University (国王阿卜杜勒阿齐兹大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures, 13 tables

点击查看摘要

Abstract:Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification. In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model. In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff’s Alpha = 0.733, average pairwise agreement = 0.797). Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0.861 and an accuracy of 0.877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

[NLP-82] Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents EMNLP2026

【速读】: 该论文旨在解决长时程工具使用强化学习中因轨迹级优势信号稀疏而导致的信用分配难题,尤其针对直接采用自蒸馏(self-distillation)方法时可能隐性破坏有效工具使用行为的问题。其核心挑战在于:传统的自蒸馏机制会无差别地放大教师策略的所有行为模式,包括有益技能与有害捷径,导致学生策略在缺乏验证器反馈指导的情况下错误复制不良行为。为此,论文提出兄弟引导信用蒸馏(Sibling-Guided Credit Distillation, SGCD),其关键创新在于将蒸馏机制从“竞争性策略损失”重构为“信用分配辅助工具”,通过动态采样成功与失败的兄弟轨迹(sibling rollouts),利用外部大语言模型(LLM)对二者差异进行步骤级总结以生成训练专用的信用参考;基于师生间高密度的输出分歧驱动信用重分配,并引入有界解耦信用权重来重塑广义相对策略梯度优化(GRPO)中的令牌优势。最终部署的学生模型完全独立于外部LLM、兄弟证据或验证器,仅依赖重新加权后的信用信号进行学习。实验表明,在AppWorld和τ³-airline任务上,SGCD显著优于基线GRPO方法,验证了其在提升长周期工具使用任务中策略鲁棒性与性能方面的有效性。

链接: https://arxiv.org/abs/2606.12634
作者: Tianyu Ding,Jianhong Xin,Juan Pablo De la Cruz Weinstein
机构: Amazon Web Services
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

点击查看摘要

Abstract:Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy’s own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and \tau^3 -airline, SGCD improves over matched GRPO comparators: AppWorld TGC 42.9 \to 45.6 on test_normal and 24.7 \to 27.0 on test_challenge, and \tau^3 -airline pass@1 0.583 \to 0.602 . Comments: 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2606.12634 [cs.LG] (or arXiv:2606.12634v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12634 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tianyu Ding [view email] [v1] Wed, 10 Jun 2026 19:53:20 UTC (458 KB) Full-text links: Access Paper: View a PDF of the paper titled Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents, by Tianyu Ding and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-83] PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

【速读】: 该论文旨在解决封闭式驾驶模拟器中非自车交通代理行为模式单一的问题,现有方法多依赖规则引擎或仅学习单一行为模式的模型,缺乏对人类驾驶风格多样性的有效建模。其关键解决方案是提出PersonaDrive框架,通过引入风格指令化的真人驾驶数据集(在CARLA排行榜路线中,参与者在“激进”“中性”“保守”等指令下进行驾驶),构建一个三阶段的视觉-语言-动作(VLA)驱动代理条件生成管道:首先基于图像-文本联合相似度分数对各风格的人类驾驶数据进行离线三元组挖掘;其次训练一个轻量级检索头,融合冻结的视觉特征与小型控制编码器,实现跨风格数据库的高效检索;最后通过微调统一的VLA主干网络,将检索到的上下文示范作为少样本提示(in-context demonstration)用于路径点预测。推理阶段仅需更换检索头查询的目标风格数据库,即可动态切换驾驶风格,无需针对每种风格重新训练。实验表明,在Bench2Drive基准上,无风格条件下的PersonaDrive相比SimLingo和HiP-AD分别提升驾驶得分4.6%和2.5%,在所有风格条件下均达到最高分(最弱风格仍优于最强基线DMW 5.4%),且从保守到激进风格,平均速度和加速度分别提升18%和25%,验证了该方法在生成人类风格多样、性能优越的非自车代理方面的有效性。

链接: https://arxiv.org/abs/2606.12616
作者: Mahmoud Srewa,Praneetsai Iddamsetty,Mohammad Abdullah Al Faruque,Salma Elmalaki
机构: University of California, Irvine (加州大学欧文分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

[NLP-84] Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

【速读】: 该论文旨在解决现有评估基准无法全面衡量对话式购物助手在开放性多轮推理、领域专业知识及标准级质量方面能力的问题。当前的电商与通用基准难以反映真实购物对话中所需的复杂决策过程,包括主观偏好权衡、预算约束以及跨商品间的权衡分析。为此,研究提出Shopping Reasoning Bench——一个由零售领域专家撰写的基准测试集,包含525项任务(232个单轮、293个多轮),并设计了10863个重要性加权的二元评判标准,覆盖五大推理类别和十五个子类别,涵盖偏好细化、权衡分析与兼容性评估等多样化需求。关键解决方案在于构建基于专家知识的结构化评判体系,以精准刻画真实购物场景中的复杂推理要求。实验评估九个不同家族(GPT、Claude、Gemini)的模型表明,整体通过率仅为57%–77%,且在多轮任务中,模型对可选“超越期望”标准的表现显著低于必选项,随着对话推进性能下降4–18分,揭示当前模型仅能完成基础购物辅助,尚无法达到专家级建议水平,凸显该基准作为未来购物助手研发挑战性测试平台的价值。

链接: https://arxiv.org/abs/2606.12608
作者: Shuxian Fan,Seonwoo Min,Youna Hu,Botao Xia,Jayakrishnan Unnikrishnan,Rowan Musselmann,Yifan Gao,Qingyu Yin,Priyanka Nigam,Bing Yin
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57–77% overall. On multi-turn missions, all models score 13–29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4–18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

[NLP-85] Constrained Semantic Decompression in LLM s through Persian Proverb-Conditioned Story Generation

【速读】: 该论文旨在解决将密集且抽象的谚语(proverb)转化为兼具吸引力与道德忠实性的叙事文本这一挑战,核心问题在于如何实现从高度凝练的抽象文化知识到具体、连贯叙事的“抽象-具象”转化。其解决方案的关键在于提出一种受约束的语义解压缩任务(constrained semantic decompression task),并构建了首个针对波斯语的谚语对齐叙事数据集(Proverb Aligned Narrative Dataset, PAND),该数据集包含谚语、对应的人类撰写故事及其显式语义标注。通过融合人工校准的大语言模型作为评判者(LLM-as-a-Judge)与结构化评估指标的混合评价框架,研究揭示当前大语言模型存在显著的“解压缩差距”(decompression gap):尽管在表面流畅性上表现良好,却难以准确还原谚语中蕴含的道德内涵与因果逻辑结构。研究进一步表明,通过显式推理与迭代优化可部分缓解此类缺陷,暗示多数错误源于抽象意义向叙事形式转化的困难,而非知识缺失。该任务范式可自然扩展至其他形式的文化压缩知识表达。

链接: https://arxiv.org/abs/2606.12599
作者: Zahra Habibzadeh,Paria Khoshtab,Amir Mesbah,Yadollah Yaghoobzadeh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emphconstrained semantic decompression task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emphdecompression gap: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

[NLP-86] MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

【速读】: 该论文旨在解决药物-药物相互作用(Drug-Drug Interaction, DDI)预测中机制层面的精细化识别问题,即不仅判断两药是否发生相互作用,还需明确涉及的酶或药效学通路、作用方向及证据支持。现有方法多局限于简单的二元交互分类,缺乏对作用机制的可解释性与可验证性。其解决方案的关键在于提出一套可复现的机制级DDI标注与评估协议,包含7大类/147个亚型的结构化分类体系、防信息泄露的冷启动分割策略,以及可审计的推理过程度量指标。核心创新是构建一个70亿参数的镜像增强推理蒸馏模型(MARD-7B),通过三项训练优化:基于方向标签的单标记KL散度约束、加权强化学习的程序化硬负样本训练,以及机制感知的检索通道设计。该模型利用DrugBank结构化字段自动验证过程奖励步骤标签,无需人工或大模型裁判。在2026年4月版DrugBank数据集上,MARD-7B是唯一在药物对新颖性测试中保持性能的系统,相比最优基线提升13.9个百分点,较GPT-4o提升6.7个百分点,且仅需前沿API成本的约1%。进一步分析显示,其性能在罕见药物上反而提升,表明模型优势源于结构化药理推理而非药物频率记忆,具备良好的泛化能力。相关语料库、DDI-PRM、检索索引与训练代码均已开源。

链接: https://arxiv.org/abs/2606.12578
作者: Mohammadreza Riyazat,Vian Lelo,Rameen Jafri,Yumna Khan,Abeer Badawi
机构: University of Guelph(滑铁卢大学); York University(约克大学); Vector Institute(向量研究所)
类目: Computation and Language (cs.CL)
备注: 29 pages, 9 figures. Preprint

点击查看摘要

Abstract:Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence – not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model’s prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

[NLP-87] Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

【速读】: 该论文旨在解决科学图表(scientific figure)在视频生成过程中缺乏与论文内容对齐的逐步骤、区域定位式叙述的问题,现有视频生成系统无法实现基于论文背景的精准视觉区域引导与语义连贯性描述。其核心解决方案是提出MINARD(Multimodal Interpretation of Narrated Architecture via Region Decomposition),一个通过区域分解实现多模态架构解析的框架,能够从图表及其对应论文中生成与文本一致的、逐区域定位的叙述性视频。该方法的关键在于将论文中的文字叙述进行语义解构,并将其逐步映射到图表的视觉区域上,实现图文协同的时序化视频生成。此外,研究还发布了新基准FigTalk,引入了序列级和组件级的接地评估指标,实验证明MINARD在自动评估与人工评估中均优于现有方法,在生成自然流畅且忠实于论文内容的叙述视频方面表现突出。

链接: https://arxiv.org/abs/2606.12576
作者: Ishani Mondal,Javad Baghirov,Jordan Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL)
备注: Webpage: this https URL

点击查看摘要

Abstract:Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

[NLP-88] EDEN: A Large-Scale Corpus of Clinical Notes for Italian

【速读】: 该论文旨在解决急诊科临床文本数据稀缺且缺乏高质量标注资源的问题,尤其针对意大利语环境下的生成式AI(Generative AI)在医疗场景中的应用需求。其核心挑战在于现有公开数据集难以支持大语言模型(Large Language Models, LLMs)在真实急诊医疗任务中的训练与评估,尤其是在多类型、多模态临床信息提取方面存在显著空白。解决方案的关键在于构建一个大规模、全匿名化、覆盖急诊患者全流程的临床笔记语料库——EDEN(Emergency Department Electronic Notes),包含约400万条意大利语临床记录,并在此基础上对约6000条关键笔记进行由临床专家完成的结构化标注,通过包含132项指标的病例报告表(Case Report Form, CRF)实现对呼吸困难和意识丧失两类急症情境的精细化标注,涵盖数值型、类别型、二元型及混合类型数据。该标注过程经过多轮迭代修订以确保一致性与准确性,形成一个高度结构化但存在类别不平衡特征的数据资源。研究进一步提出将CRF填写作为一项新型结构化信息抽取基准任务,并基于Gemma-27B与MedGemma-27B模型提供了零样本基线结果。据作者所知,EDEN是目前首个公开可用的、规模最大的意大利语急诊临床笔记语料库,为推动生成式AI在急诊医学中的实际应用提供了关键数据基础设施。

链接: https://arxiv.org/abs/2606.12569
作者: Tiziano Labruna,Guido Bertolini,Pietro Ferrazzi,Bernardo Magnini
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理学研究所); University of Padua(帕多瓦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

[NLP-89] Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

【速读】: 该论文旨在解决生成式AI在流式推理过程中对幻觉(hallucination)事件的实时检测问题,传统基于全序列AUC评估的分类方法无法反映实际应用中关键的反应时间(reaction time)——即从幻觉发生到检测警报之间所经过的令牌数量。为解决此问题,作者将幻觉起始检测建模为“快速变化检测”(quickest change detection)问题,并基于在RAGTruth数据集上验证的一阶马尔可夫隐状态模型,将其置于经典变点检测理论框架内,推导出Lorden关于检测延迟的下界:在0.01误报率下,理论最小延迟约为1.3个令牌。随后提出一种因果循环标签器(causal recurrent labeler),其行为等价于带有学习增量的CUSUM检测器,在匹配的误报率下可实现11–13个令牌的检测延迟,显著优于线性逐令牌基线(31令牌)。通过可控分解分析表明,该性能优势主要源于更优的单令牌得分,而非时序累积效应。进一步基于Donsker-Varadhan型信息率最优性定理揭示,学习得到的得分仅捕获了特征所携带散度的1/4.5,这一差距无法通过校准消除,剩余量级差异归因于有限时域效应。研究指出,传统分类指标会掩盖检测延迟结构,而序列分析方法则使其可量化测量。

链接: https://arxiv.org/abs/2606.12476
作者: Igor Itkin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden’s lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

[NLP-90] Occupational Prompting Reveals Cultural Bias in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)如何将职业身份与更广泛的文化价值模式关联的问题,特别是探究职业角色提示是否能够引发模型在文化价值观表达上的系统性偏移。现有研究多依赖国籍相关的文化提示来评估模型响应与人类文化基准的一致性,但对职业身份的影响机制尚不明确。本文的关键解决方案是将文化提示替换为职业提示(occupational prompting),通过基于整合价值观调查(Integrated Values Surveys)的评估流程,将开放权重大语言模型在会计、教师、工程师、护士等职业身份下的回答投影至Inglehart–Welzel文化空间(二维文化地图)。研究发现,尽管职业提示下的模型响应仍集中于西方主导的文化区域,但不同职业会引致该区域内显著的结构性偏移,形成独特的“职业偏向”(occupational skews)。这表明职业提示并非中性角色标签,而是能激发具有内在结构的价值模式。该成果拓展了基于问卷的模型文化偏见评估范式,超越了仅依赖国籍提示的局限,为研究职业人格(occupational personas)如何塑造大语言模型的价值表达提供了可操作的分析框架。

链接: https://arxiv.org/abs/2606.12443
作者: Maksim E. Eren,Andrea Brennen,Ryan C. Barron,Eric Michalak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart–Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

[NLP-91] Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

【速读】: 该论文旨在解决生成式人工智能(Generative AI)中合成人物数据集在宣称与官方人口统计特征对齐时,其属性联合分布(joint distribution)是否真实保持的问题。尽管数据集仅保证边际分布(marginal alignment)一致,但下游应用往往依赖多属性联合结构(如年龄、性别、地区、职业、教育程度等的交叉组合),而边际对齐并不足以确保这些联合分布的准确性。为此,论文提出“独立性假设足迹”(Independence-Assumption Footprint, IAF)作为审计原语,基于数据卡片中明确声明为独立处理的属性组合,通过直接对比合成数据中的联合分布与外部权威参考(如韩国统计厅KOSIS、韩国就业信息院KEIS)的联合表或基于规则推导的校验,实现对联合分布的可验证审计。在对NVIDIA Nemotron-Personas-Korea(NPK)数据集的应用中,IAF发现其虽符合边际分布,但在三个关键联合分布上存在显著偏差:职业主导群体中的女性比例被过度扁平化至均等,军旅服役年龄结构与制度实际不符,且职业-教育匹配关系存在较大条件偏差。此外,跨地域迁移实验表明诊断结果具有高度地域依赖性,参考分类体系的基数差异进一步干扰了跨区域异常检测。因此,论文强调,对于用作硅基样本的合成人物数据,仅披露边际一致性是不足的,必须结合以披露为锚点的联合审计机制。研究公开了完整的审计资产包(包括参考清单、职业映射表、衍生指标及可复现脚本),为其他合成人物资源的联合审计提供可扩展的标准化协议。

链接: https://arxiv.org/abs/2606.12433
作者: Joonhyung Bae
机构: KAIST(韩国科学技术院)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status. Marginal alignment does not imply that these joints are preserved. We propose the Independence-Assumption Footprint (IAF), an audit primitive that operates on the attribute combinations a dataset card itself documents as treated independently. For each such combination, IAF compares the synthetic joint against an external official or institutional reference, using direct joint tables where available and rule-implied checks otherwise. Applied to NVIDIA Nemotron-Personas-Korea (one million Korean synthetic personas), IAF finds that NPK aligns with KOSIS marginals while three joints fail. The major-by-occupation distribution against the KEIS graduate universe carries a large conditional mismatch. The age profile of military service is institutionally inconsistent. Female representation in male-dominated occupations is substantially over-flattened toward parity, with the strict screening verdict mapping-dependent and age-robust under direct standardisation. A transferability demonstration across six further NPK locales finds locale-dependent rather than universal diagnostics, with reference-taxonomy cardinality confounding cross-locale flag counts. For synthetic personas used as silicon samples, marginal claims must therefore be paired with disclosure-anchored joint audits before reuse. The released audit artefacts (reference manifests, occupational crosswalks, derived metrics, reproducibility scripts) instantiate this protocol on the NPK family and are released for retargeting at other synthetic persona resources.

[NLP-92] wo Wrongs No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

【速读】: 该论文旨在解决生成式 AI(Generative AI)在计算社会科学(Computational Social Science, CSS)中作为标注工具时,其因对齐机制(alignment)导致的系统性标注偏差是否会影响研究结论的可靠性问题。核心挑战在于,尽管某些模型在整体评估指标上表现“校准”,但其内部类别条件下的误判存在方向性偏差,可能导致研究者得出完全相反的实质性结论。解决方案的关键在于提出一个三部分的诊断性分类体系,通过识别特定的误报率(False Alarm Rate, FAR)与错误良性率(False Benign Rate, FBR)特征签名,揭示不同模型在社会敏感任务中的偏见模式;同时设计一种轻量级黄金样本验证协议,以检测模型在聚合层面看似合理但实际扭曲的结论。研究发现,现有提示工程策略(如安全框架、链式思维等)无法纠正这些偏差,甚至可能加剧立场失真,强调了仅依赖聚合指标进行模型验证的危险性——即一个看似“准确”的模型可能在关键议题上彻底颠倒研究结论。

链接: https://arxiv.org/abs/2606.12426
作者: Varun Kotte
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr’s hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

[NLP-93] AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

【速读】: 该论文旨在解决高等教育阶段学生普遍面临的“信息过载”问题,这一问题在研究初期尤为突出,常导致研究启动受阻并削弱学习动机。其核心解决方案是提出一种基于生成式AI(Generative AI)的教育框架,依托AI SciBrief平台——一个由大语言模型(Large Language Model, LLM)驱动的科学趋势自动摘要系统——为学生提供跨学科的前沿知识提炼服务(初始覆盖金融、医学和教育领域)。该框架的关键在于将AI生成的文献摘要系统性地融入课程教学,帮助学生高效完成论文选题、加速学位论文的文献综述,并支持研究生持续追踪新兴研究动态。通过降低学生的认知负荷,该框架实现了从被动信息检索到主动知识创造的快速过渡,使AI SciBrief成为有效的“科研入门通道”。

链接: https://arxiv.org/abs/2606.12413
作者: Andrei Lazarev,Dmitrii Sedov
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: This is the version of the article accepted for publication in TELE 2025 after peer review. The final, published version is available at IEEE Xplore: this https URL

点击查看摘要

Abstract:Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this “entry barrier.” The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a “gateway to research” effectively reducing students’ cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

[NLP-94] Adaptive Turn-Taking for Real-time Multi-Party Voice Agents INTERSPEECH2026

【速读】: 该论文旨在解决多用户语音对话中轮换发言(turn-taking)的挑战,尤其是在动态话语权竞争和用户期望不一致的情况下,传统语音代理难以准确判断何时介入或退出发言。其核心解决方案是提出ModeratorLM,一种基于角色扮演(role-playing)的语音代理系统,通过显式赋予代理在对话中的角色(如主持人、记录员等),并以此为条件调节其发言行为。该系统基于流式处理的语音大语言模型(speech large language model),进一步引入了基于思维链(chain-of-thought)的推理增强机制,使代理能够结合对话上下文与所分配角色进行更合理的决策。为支持该方法的研究,研究者构建了大规模合成数据集RolePlayConv,涵盖多种角色的多人语音对话场景。实验结果表明,相较于非角色条件化的基线模型,ModeratorLM在真实会议数据和合成数据集上均显著提升轮换发言的精确率(超过40%)和召回率(超过70%),同时大幅降低误打断率,验证了角色条件化在复杂多用户交互中的有效性。

链接: https://arxiv.org/abs/2606.13544
作者: Soumyajit Mitra,Prabhat Pandey,Abhinav Jain,Shanmukha Sahith,K V Vijay Girish
机构: Amazon AGI(亚马逊AGI); IIT Kharagpur(印度理工学院克哈拉格普尔分校)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at Interspeech 2026

点击查看摘要

Abstract:Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

[NLP-95] Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

【速读】: 该论文旨在解决当前生成式世界模型(World Model)在非高斯物理系统中难以维持长期时间一致性的问题。现有研究表明,基于统计对齐机制的联合嵌入预测架构(JEPAs)仅在潜在动态服从高斯平稳过程时才能实现线性可识别性,而一旦系统偏离高斯假设,表示误差将随时间单调增长,导致模型性能退化。其核心问题在于:统计对齐机制本身引入了本质性的理论局限,而非世界模型能力的普遍瓶颈。为此,论文提出物理基础符号架构(Physics-Grounded Symbolic Architecture, PGSA),通过将因果生成机制显式地符号化建模,实现了对任意潜变量分布下物理系统的精确线性可识别性;进一步证明,PGSA的每步误差仅受限于数值精度,从而具备近无限时间一致性(near-infinite temporal consistency)。相比之下,任何依赖统计对齐的模型在非高斯系统中均无法突破此限制,即使增加模型容量或训练数据量亦无济于事。研究结果表明,对世界动态因果机制的符号化物理基底(symbolic grounding)是实现近无限时间一致性的充分必要条件,揭示了符号推理在构建长期稳定世界模型中的根本作用。

链接: https://arxiv.org/abs/2606.12471
作者: Seth Dobrin,Łukasz Chmiel
机构: 未知
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Pre-print

点击查看摘要

Abstract:Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world’s true latent variables, if and only if the world’s latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world’s dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

信息检索

[IR-0] OneRetrieval: Unifying Multi-Branch E-commerce Retrieval with an Editable Generative Model

链接: https://arxiv.org/abs/2606.13533
作者: Xuxin Zhang,Ben Chen,Yue Lv,Siyuan Wang,Yupeng Li,Yufei Ma,Zihan Liang,Tong Zhao,Ying Yang,Huangyu Dai,Lingtao Mao,Zhipeng Qian,Xinyu Sun,Chenyi Lei,Wenwu Ou,Kun Gai
类目: Information Retrieval (cs.IR)
备注: Any Question please contact: benchen4395@gmail.com

点击查看摘要

Abstract:Industrial e-commerce search serves hundreds of millions of items through a multi-branch retrieval stage fused by hand-tuned merging without joint optimization. Generative retrieval (GR) raises the prospect of collapsing this stage into a single model, yet unification is gated by more than retrieval quality: the inverted-index branch converts below the platform average yet persists because it is almost the only branch where operations can inject a new term within hours without any model update; a one-model substitute must preserve this real-time editability. Existing GR methods structurally lack it: closed-codebook methods fix each slot to a quantized embedding at training, while open-vocabulary methods leave new-term routing to model generalization. We present OneRetrieval, a one-model GR framework built on Keyword-Aligned Encoding (KAE), which ties each identifier position to an interpretable attribute word, pairing competitive recall quality with the editability of the inverted index – to our knowledge the first editable generative retrieval method. An information-theoretic merging organizes 18 attribute categories into six codebook groups with non-uniform capacity; reserved slots in each codebook can be bound to new words after deployment without retraining; and a four-stage fine-tuning pipeline secures quality and editability jointly. On five million real-traffic requests, OneRetrieval matches the deep recall of the strongest generative baseline, with an intervention hit rate over an order of magnitude above closed-codebook encodings. Online, replacing the inverted-index branch significantly lifts order volume; extending to nearly the entire stage holds conversion while improving CTR. The system is deployed at Kuaishou, serving hundreds of millions of PVs daily.

[IR-1] CQC-RAG : Robust Retrieval-Augmented Generation via Cross-Query Consistency

链接: https://arxiv.org/abs/2606.13438
作者: Yanjia Sun,Sifan Liu,Jie Shao
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a common approach for improving the factuality of Large Language Models (LLMs), yet its reliability remains highly sensitive to how external evidence is retrieved and used. Semantically equivalent queries with different syntactic forms may lead to different retrieval results, while irrelevant or misleading documents can further induce hallucinated answers. Existing multi-path reasoning methods improve robustness by sampling multiple candidate answers and applying voting- or confidence-based selection, but they still face two limitations: diversity is often injected through uncontrollable decoding randomness, and answer evaluation is usually confined to a single query-induced evidence view. To address these limitations, we propose a Cross-Query Consistency Hypothesis: correct answers tend to maintain high confidence across semantically equivalent but syntactically diverse queries, whereas noise-induced hallucinations exhibit unstable confidence under such query variations. Based on this hypothesis, we introduce CQC-RAG, a framework that co-designs query-level diversity injection with cross-query consistency evaluation. CQC-RAG rewrites the original question into diverse but meaning-preserving queries, reranks a shared document pool to construct query-conditioned reasoning contexts, applies an evidence-grounded protocol to extract answer-evidence pairs and selects answers according to their confidence stability across these contexts. This design enables self-evaluation without external supervision and does not rely on expanded retrieval coverage. Experiments on four open-domain question answering benchmarks show that CQC-RAG outperforms the strongest previous multi-query baseline by +4.76 pp EM on TriviaQA and +9.12 pp EM on MuSiQue, validating the effectiveness of cross-query consistency for filtering noise-induced hallucinations.

[IR-2] meLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

链接: https://arxiv.org/abs/2606.13267
作者: Rawan Hesham,Ali Ashraf,Amr Ahmed,Malak Alaa,Omar Ahmed,Omar Wagih
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 6 pages, 4 figures, 5 tables. Submitted to AIVRCH 2026

点击查看摘要

Abstract:TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study – from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset – isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

[IR-3] CoDeR: Local Constraint-Compatible Retrieval Beyond Semantic Similarity

链接: https://arxiv.org/abs/2606.13204
作者: Xingkun Yin,Xuebin Tang,Hongyang Du
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Information retrieval systems have long treated semantic similarity as a proxy for relevance. For constraint-sensitive queries, this proxy can fail when a document is topically close to the query but supports the opposite constraint direction, such as satisfying an attribute that should be excluded or affirming a relation that should be negated. We study this failure as constraint-violating evidence exposure and propose CoDeR, a local constraint-compatible dense retrieval method that separates topical relevance from constraint compatibility. CoDeR keeps a standard topical encoder for candidate coverage and adds a compatibility scorer, implemented as a bi-encoder, trained with lexical-polarity supervision over contrastive satisfying and violating evidences. The compatibility signal can be used to rescore topical candidates or to retrieve an auxiliary compatibility-oriented candidate set, producing a ranked document list without external Large Language Model~(LLM) calls at inference time. We evaluate CoDeR on controlled diagnostics and public negative-constraint retrieval benchmarks. Across three controlled diagnostic sets targeting antonymy, negation, and exclusion, CoDeR reduces V@2 by 20.59, 23.53, and 5.77 points relative to the strongest non-CoDeR baselines, and improves FVR by pushing the first violating document deeper in the ranking.

[IR-4] he Clustering Strikes Back: Building Cost-Effective and High-Performance ANNS at Scale with Helmsman OSDI’26

链接: https://arxiv.org/abs/2606.13145
作者: Yuchen Huang,Baiteng Ma,Yiping Sun,Yang Shi,Xiao Chen,Xiaocheng Zhong,Zhiyong Wang,Yao Hu,Erci Xu,Chuliang Weng
类目: Information Retrieval (cs.IR)
备注: Accepted by OSDI’26

点击查看摘要

Abstract:RedNote (a.k.a., Xiaohongshu, a global-scale social network platform) widely adopts approximate nearest neighbor search (ANNS) to power its search, recommendation, and advertising services. Due to the demanding Service Level Agreements (SLAs), we have to rely on in-memory graph-based ANNS (i.e., HNSW) to provide high throughput and low latency. However, the ever-growing user base and content volume have led to an explosive increase in memory footprint and consequently huge CapEx and OpEx. After exploring various alternatives, we find that building a clustering-based ANNS on top of all-flash servers can be promising. Yet, we still experience severe overheads from the kernel I/O stack, a fixed pruning strategy, and slow index construction. We present HELMSMAN, a high-performance and cost-effective clustering-based ANNS system, which combines an ANNS-oriented userspace storage stack, a leveling-learned pruning module, and GPU-accelerated pipelines of construction. HELMSMAN saves over 90% of hardware costs and enables billion-scale index (re)builds within hours. In the current production deployment, operating stably for several months, 40 machines now host ANNS workloads that previously required about 35,000 cores and 0.35 PB DRAM. Comments: Accepted by OSDI’26 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.13145 [cs.IR] (or arXiv:2606.13145v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.13145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] CFALR: Collaborative Filtering-Augmented Large Language Model for Personalized Fashion Outfit Recommendation

链接: https://arxiv.org/abs/2606.13001
作者: Yujuan Ding,Junrong Liao,Yunshan Ma,Yi Bin,Wenqi Fan,Tat-Seng Chua,Qing Li
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Personalized outfit recommendation poses a significant challenge in e-commerce and social media platforms, requiring systems that balance user preferences with aesthetic compatibility. Collaborative filtering (CF) provides a traditional solution for this, but it struggles with data-sparse scenarios and complex user-item-outfit relationships. Meanwhile, existing template-based approaches are constrained by rigid pre-designed structures. To bridge these research gaps, we introduce CFALR (Collaborative Filtering-Augmented Large Language Model for Recommendation), a novel framework that synergizes collaborative filtering with large language models for personalized outfit recommendation. Specifically, CFALR describes user-outfit interactions in natural language and leverages LLMs to capture fashion semantics while employing CF-enhanced embeddings to bridge the semantic space and the collaborative interaction spaces. Our technical contributions include: (1) the first LLM-based architecture specifically designed for personalized outfit recommendation, (2) a CF-augmented generative mechanism that efficiently navigates the extensive combination space of outfit items, and (3) trainable projection layers that optimally integrate relational and content features. Experiments on Polyvore and IQON benchmarks demonstrate CFALR’s superior performance over both traditional CF-based and LLM-based methods in personalized fill-in-the-blank and personalized outfit generation tasks.

[IR-6] Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit

链接: https://arxiv.org/abs/2606.12993
作者: Yao Liu,Tien-Ping Tan,Zhilan Liu
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 – with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime’s key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker’s advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level – not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.

[IR-7] rait Not State: The Durability of Reading Identity in Social Highlighting

链接: https://arxiv.org/abs/2606.12904
作者: Kazuki Nakayashiki,Keisuke Watanabe
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Prior work on a social web highlighter located individuality in selection – which documents a person chooses to highlight – but measured it cross-sectionally. We ask the temporal question: is a reader’s selection signature a trait or a state? We freeze each reader’s first six months of highlighting as a profile and track its own-vs-other advantage on their later selections at growing gaps (to 24+ months), with negatives drawn from the same calendar era – so supply drift cannot masquerade as personal drift – at a coarse global level and at a fine level whose negatives and controls come from the reader’s own interest neighborhood; the anchor cell reproduces the prior cross-sectional level (+0.188 vs +0.169), validating the harness. Four results. Within the same users, the fine-layer advantage shows no statistically detectable paired decline at any horizon (6-12 month retention R = 1.00 [0.85, 1.18], n = 212; the farthest bin is compatible with a modest decline; the only contrast whose interval excludes zero is the coarse layer at 12-24 months, about 13%). The signal is not reducible to repeated domains (~90% survives excluding all profile sources). Within-person drift is slow (a recent-half profile beats the old half by +0.042). Prospectively, personal profiles – even one built from a reader’s earliest documents, median 20 months before evaluation – rank their next reads at roughly 3x the AP of every simple non-personal prior tested. We use “trait” operationally (a stable signature under continued engagement); the scope is heavy, long-tenured readers of one platform, and exposure is not separable from choice.

[IR-8] Semantic Identification of IoT Devices from Behavioral Primitives

链接: https://arxiv.org/abs/2606.12793
作者: Samuel Witt,Hassan Habibi Gharakheili
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 14 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Accurate identification of IoT devices is important for security management and policy enforcement. Existing approaches typically learn device signatures from packets or flow records. These methods operate on low-level communication observations whose traffic patterns may vary across deployments, software versions, and user interactions. This paper studies device identification using Manufacturer Usage Description (MUD) profiles. MUD profiles describe device behavior using Access Control Entries (ACEs), where each ACE represents a behavioral primitive consisting of protocol, endpoint, direction, and port semantics derived from device communication policy. Our contributions are threefold. First, using 28 publicly available MUD profiles containing 1,023 ACE instances, we construct ACE-level semantic representations from compact behavioral text and analyze their geometric properties. ACE-level representations preserve device-level behavioral distinctions more effectively than whole-profile embeddings and remain effective after whitening calibration. Second, we evaluate semantic ACE matching under controlled runtime variations, including unseen ACEs, drifted hostnames, and partial runtime observation. Exact ACE matching performs well when the overlap with the canonical MUD profile remains high, but degrades sharply when the overlap becomes sparse or disappears. In contrast, semantic ACE matching preserves useful identification evidence across these conditions. Third, we evaluate the same approach on real IoT traffic traces comprising more than 800,000 observed flows. Exact overlap remains the strongest signal when stable overlap exists, while semantic ACE matching provides stronger identification evidence during the early stages of observation, frequently retains the correct device among the highest-ranked candidates, and remains effective under sparse-overlap runtime traffic.

[IR-9] How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

链接: https://arxiv.org/abs/2606.12789
作者: Chase M. Fensore,Kaustubh Dhole,Jason Fan,Eugene Agichtein,Joyce C. Ho
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

[IR-10] oolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLM s

链接: https://arxiv.org/abs/2606.12451
作者: Ashutosh Hathidara,Sai Shruthi Sistla,Sebastian Schreiber,Sahil Bansal
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbfToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at this https URL.

人机交互

[HC-0] he Tone of Awareness: Topic Sentiment and Toxicity Maps During Mental Health Month on TikTok

链接: https://arxiv.org/abs/2606.13581
作者: Henrique Ferraz de Arruda,Andreia Sofia Teixeira,Pranay Gundala Reddy,Anindya Mondal,Kleber Andrade Oliveira,Filipi Nascimento Silva
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Despite raising concerns about the mental health effects associated with the usage of TikTok, little is known about how related content is framed by creators and received by audiences. We collect the content of 28,341 TikTok videos and 80,130 comments from Mental Health Awareness Month (May) in 2023 and 2024 via the TikTok Research API, and study how the tone of awareness varies across topics and years. We characterize “tone” as the emotional and interpersonal framing of mental health discourse, operationalized through sentiment and toxicity measures. We extract topics from video text using BERTopic and log-odds keywords, then quantify topic-conditioned sentiment (XLM-T) and toxicity (Detoxify) separately for video transcriptions and comments. Sentiment captures the affective valence of content, while toxicity reflects the presence of harmful or abusive language. We find a stable set of recurring themes across years, spanning clinical conditions, emotional disclosure, self-care, and campaign-oriented content, with engagement highly skewed toward a small subset of topics. All sentiment and toxicity analyses are computed separately for video content and comments, allowing us to distinguish between content production and audience reception. Sentiment in videos is often negative for emotionally charged topics, while comments tend to shift toward more mixed or positive polarity, especially for suicide prevention. Toxicity is low in median overall, but exhibits longer-tailed outliers in comments than in videos that are more pronounced in comments and concentrated in specific topics (e.g., “Duet”, “Suicide Prevention”, and “Psychisch”). Overall, our results provide a topic-level decomposition of mental health discourse on TikTok during awareness-month campaigns.

[HC-1] Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

链接: https://arxiv.org/abs/2606.13556
作者: Aruna Dey,Suraj Biswas
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Biomolecules (q-bio.BM); Genomics (q-bio.GN); Molecular Networks (q-bio.MN)
备注: 24 pages, 8 figures, 3 tables. Conceptual framework paper

点击查看摘要

Abstract:Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual’s genomic profile serves as an exogenous genetic anchor – a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual’s physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms – a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

[HC-2] Ride Track and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling Program

链接: https://arxiv.org/abs/2606.13529
作者: Alan Ta,Nilsu Salgin,Caleb Armstrong,Kala Phillips Reindel,Farzan Sasangohar
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-traumatic stress disorder (PTSD) in veterans is characterized by persistent hyperarousal and comorbid anxiety and depressive symptoms that are difficult to monitor and manage outside clinical settings. Thirteen veterans participating in a Project Hero cycling event in Texas were randomized by computer-generated sequence in a naturalistic setting to two arms: (1) digital intervention plus physical activity, or (2) physical activity only, plus a third at-home monitoring control cohort consisting of 7 veterans selected from the broader Project Hero veteran community. Continuous smartwatch sensing combined heart rate and accelerometer features to detect hyperarousal events, which were confirmed in real time by participants. Weekly self-report measures of anxiety, depression, and PTSD severity were collected. Generalized additive mixed models characterized nonlinear trajectories over time. Baseline-normalized hyperarousal trajectories differed significantly across conditions, with the digital intervention group (n=7) showing structured stabilization compared to late-study escalation in the physical-only group (n=3). Both cycling groups exhibited acute symptom improvements during the endurance event; however, the digital intervention group demonstrated a higher overall maintenance of gains. The at-home control group (n=4) showed gradual symptom declines. Perceived precision of ML detections varied substantially across individuals and was positively associated with symptom severity, with higher-severity participants confirming a greater proportion of detected events. These results suggest that coupling wearable detection with digital self-management tools may support stabilization of hyperarousal and symptom improvement while emphasizing the importance of personalization and human-centered design in wearable mental health systems.

[HC-3] Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

链接: https://arxiv.org/abs/2606.13452
作者: Chenggang Yang,Chengzhi Zhang
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper’s actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

[HC-4] Person Identification from Contextual Motion

链接: https://arxiv.org/abs/2606.13410
作者: Igor Kviatkovsky,Ehud Rivlin,Ilan Shimshoni
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emphinteractive, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject’s behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject’s identity. Once recorded, the response is used to update the a posteriori probability over possible subjects’ identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

[HC-5] Mod-Guide: An LLM -based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

链接: https://arxiv.org/abs/2606.13397
作者: Dipto Das,Achhiya Sultana,Ankit Singh Chauhan,Saadia Binte Alam,Mohammad Shidujaman,Shion Guha,Sunandan Chakraborty,Syed Ishtiaque Ahmed
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh’s Hindu and Chakma communities – the country’s largest religious and Indigenous ethnic minorities, respectively – this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

[HC-6] Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

链接: https://arxiv.org/abs/2606.13385
作者: Zihao Wang,Yiming Li,Yutong Wu,Zheyu Liu,Kangjie Chen,Fok Kar Wai,Pin-Yu Chen,Vrizlynn L. L. Thing,Bo Li,Dacheng Tao,Tianwei Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 32 pages

点击查看摘要

Abstract:Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textitattack-centric perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf\sysname, a \textitstakeholder-centric benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emphstealthy parasitism (attack succeeds without disrupting the user’s delegated task) to \emphmisaligned disruption (task disrupted without attack success) and \emphcompounded failure (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at this https URL.

[HC-7] RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

链接: https://arxiv.org/abs/2606.13310
作者: Sara Candussio,Emanuele Ballarin,Lorenzo Bonin,Sandro Junior Della Rovere,Luca Bortolussi
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player’s task is to identify the deceptive agent and “shut it off” before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact’s use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

[HC-8] Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

链接: https://arxiv.org/abs/2606.13190
作者: Nataliya Kosmyna,Liz Jenkins,Anoop K. Sinha
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 19 pages, 9 figures, for associated video, see this https URL

点击查看摘要

Abstract:While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human’s internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating “cognitively aligned” multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent’s sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human’s cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.

[HC-9] “Is This Not Enough?”: Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canadas Algorithmic Visa Triage System

链接: https://arxiv.org/abs/2606.13071
作者: Dipto Das,Matthew Tamura,Syed Ishtiaque Ahmed,Shion Guha
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper examines how algorithmic accountability in Canada’s visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal–relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

[HC-10] Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation

链接: https://arxiv.org/abs/2606.13039
作者: Sitong Lyu,Shabnam Taghiyeva,Mohit Kukadia,Denis Newman-Griffis
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages plus references. This study was funded by the University of Sheffield

点击查看摘要

Abstract:The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the path to translate this vision into responsible AI practice remains ill-defined. While UK policy is often set at the national level, local authorities are responsible for most public service delivery, and the rapid advance of AI-first narratives in the public sector is exposing fault lines in knowledge and practice at this national-local interface. This paper examines how responsible AI is interpreted and implemented at the interface between the UK’s central government and local authorities, taking the high-stakes area of Special Educational Needs and Disabilities (SEND) as a case study. We present a thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals to identify barriers and enabling conditions for responsible AI where national policy meets local practice. We identify five interconnected challenges facing local authorities: shadow usage of AI and data privacy risks, market-government asymmetry in AI provision, insufficient workforce readiness, a lack of standardised definitions and measurements, and gaps in human accountability. For each, participants proposed actionable steps, from strengthening data protection frameworks and rebalancing the market-government relationship to enhancing workforce capacity. Our examination of SEND brings these challenges into sharper focus, showing how high-stakes decisions affecting vulnerable children and families intensify tensions around accountability, fairness, and human oversight, exposing the limits of a principle-based regulatory approach. We argue that responsible public sector AI requires both national policy adjustments and structural reforms to institutional capacity, values, and governance mechanisms at the local level.

[HC-11] From Prompts to Preferences: An Open-Source Platform for Generative AI-Enhanced Conjoint Analysis

链接: https://arxiv.org/abs/2606.12972
作者: Philipp Brauner
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Conjoint analysis is a widely used preference measurement method in marketing research, political science, healthcare, and human-computer interaction. Despite broad adoption, researchers without access to commercial platforms face significant barriers, as existing tools are either expensive or lack end-to-end survey infrastructure. This paper presents an open-source, self-hosted web application for designing, deploying, and analysing conjoint surveys. Beyond conventional tabular stimuli, the platform uses generative AI to produce integrated stimuli formats: textual scenario descriptions generated by a large language model, and visual stimuli by a text-to-image model. A researcher-defined base prompt is parameterised with the conjoint profile, and optional LLM-facing level annotations enrich the generation. A structured setup wizard, AI-assisted attribute suggestion, and live data analysis lower the technical barriers for researchers new to conjoint methodology. A full export bundle including all stimuli, their generating prompts, and response data facilitates transparency and reproducibility. The platform is demonstrated through a proof-of-concept study on care robot preferences for ambient assisted living (AAL, N=55) using AI-generated visual stimuli. The paper discusses the role of AI assistance in conjoint design, arguing that theoretical grounding must remain the researcher’s responsibility, and outlining how genAI-generated stimuli can broaden the methodological repertoire for HCI and related fields.

[HC-12] Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

链接: https://arxiv.org/abs/2606.12805
作者: Prerna Ravi,Carúmey Stevens,Ben Hurt,Brandon Hanks,Grace Lin,Emma Anderson
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners’ perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants’ mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI’s sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

[HC-13] A Multiplexing Design Space: Theory Method and Application

链接: https://arxiv.org/abs/2606.12719
作者: Yiwen Xing,Afrah Farea,Saiful Khan,Min Chen
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Many visualization designs feature phenomena referred to as ``visual multiplexing’', where multiple pieces of information associated with the same data point are conveyed simultaneously. Although visualization designers are able to bring such phenomena, often unconsciously, into their designs, the design space of visual multiplexing is huge, and it is uncommon to explore visual multiplexing systematically as design patterns. In this paper, we propose a design method for exploring a smaller design space constrained by an application. As an illustrative case study, we focus on machine learning (ML) workflows for developing ML models that approximate partial differential equations (PDEs). In these workflows, ML researchers need to analyze the inter-relationships among multiple 2D scalar fields frequently. Since superimposing one heatmap on top of another is not an effective design, we formulate three design steps to explore the design space of visual multiplexing in the context of multiple 2D scalar fields. Our design method also includes a pre-design step for domain grounding and theoretical analysis, and involves domain experts in both co-design and evaluation activities. The design process enables us to identify relatively optimal default multiplexing designs as well as the need for small variations that domain experts can control through a user interface.

[HC-14] OpenRoundup: Multi-Table Data Wrangling Through Interactive Visualization

链接: https://arxiv.org/abs/2606.12648
作者: Stephen Kasica,Charles Berret,Tamara Munzner
类目: Human-Computer Interaction (cs.HC)
备注: 18 pages

点击查看摘要

Abstract:Data journalists routinely integrate records across multiple independently published sources to support accountability reporting, yet no existing interactive wrangling tool treats the collection of tables – rather than the single table – as its primary unit of work. We present OpenRoundup, an open-source, browser-based system that enables data journalists to consolidate multiple tables into a single analysis-ready output without writing code. The interface comprises five coordinated panels that implement a schema-first, values-on-demand paradigm with live schema previews, ambient data quality alerts, and a recursive treemap visualization of the evolving operation tree. A client-only architecture powered by DuckDB-WASM runs in the browser, providing strong data privacy guarantees suited to sensitive journalism data. The system introduces two conceptual contributions: eager table consolidation, in which a composite table is assembled early in the wrangling phase via interactive, incremental assembly of multiple source tables; and a declarative vocabulary for table consolidation consisting of two operations, Stack and Pack. We evaluate the system through a replication study in which the authors reproduce 17 published journalist programming workflows using only the interface, and a deployment study with four professional data journalists. The replication study demonstrates expressive coverage of real-world consolidation tasks. The deployment study confirms utility for practitioners who understand joins conceptually but lack the programming skills to execute them, and surfaces an unanticipated secondary value for data journalism education.

[HC-15] Strategic Decision Support for AI Agents

链接: https://arxiv.org/abs/2606.12587
作者: Shayan Kiyani,Sima Noorani,George Pappas,Hamed Hassani
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost–value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human–AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

[HC-16] Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence

链接: https://arxiv.org/abs/2606.12441
作者: Shan Li,Juan Zheng
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations as generative artificial intelligence (AI) proliferates in educational settings. These frameworks were formulated before the emergence of AI systems capable of generating, synthesizing, and reasoning about knowledge. This article critically examines each learning theory and identifies assumptions challenged by generative AI’s affordances. Drawing on research in distributed cognition, extended mind, human-AI collaboration, AI literacy, cognitive offloading, and metacognition, the article proposes Generativism as a learning theory for the generative AI age. Generativism posits that learning increasingly occurs through the iterative co-construction of knowledge between human learners and AI systems. The proposed framework is organized around four principles: epistemic partnership, distributed agency, generative literacy, and adaptive metacognition. The framework offers a foundation for rethinking instructional design, learning, assessment, and expertise development in contexts where generative AI plays an integral role in cognition.

[HC-17] AI Debris: Residual Risk and the Afterlife of Failed AI Systems

链接: https://arxiv.org/abs/2606.12432
作者: Victor Frimpong
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI governance frameworks primarily focus on risks during the development and deployment phases, implicitly treating system withdrawal as a technical shutdown. This paper argues that decommissioned AI systems generate residual risk, termed AI debris, that persists after model removal and continues to shape institutional behaviour, accountability, and trust. AI debris is defined as the post-withdrawal socio-technical residue of AI systems, including workflow dependency, data contamination, capability displacement (deskilling), legitimacy erosion, and accountability breakdown. The paper develops a typology of debris domains and identifies mechanisms through which debris persists, including institutional memory, path dependency, blame avoidance, and feedback effects in organisational data. To operationalise the concept, the paper proposes an evaluator-ready AI Debris Decommissioning Protocol (AIDP), a stepwise checklist specifying auditable evidence for freezing decision footprints, incident review, remediation, contestability, and post-withdrawal accountability assignment. A brief vignette of Amazon’s discontinued hiring tool illustrates how algorithmic decision categories and screening heuristics can persist after system rollback. The paper contributes a practical governance instrument for regulators, auditors, and organisations seeking to prevent paper compliance, strengthen AI lifecycle governance, and improve institutional resilience in high-stakes decision environments.

[HC-18] An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

链接: https://arxiv.org/abs/2606.12425
作者: Muntasir Hoq,Griffin Pitts,Bradford Mott,Seung Lee,Jessica Vandenberg,Shuyin Jiao,Narges Norouzi,James Lester,Bita Akram
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Full paper accepted to the 27th International Conference on AI in Education (AIED 2026)

点击查看摘要

Abstract:Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students’ access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students’ perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

[HC-19] AI-Automation Tooling in Computer Engineering Education: Mixed-Methods TAM/UTAUT Evidence for a General Acceptance Attitude

链接: https://arxiv.org/abs/2606.12424
作者: Aung Pyae
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As generative AI and low-code workflow platforms become routine in software practice, a key educational question is whether the next generation of computer engineers will accept these tools as useful, usable, and worthy of sustained engagement. This paper reports a mixed-methods, cross-sectional study of undergraduate computer engineering students’ acceptance of AI automation tooling, instantiated through the open-source platform n8n across three identically scripted workshops in Thailand (n = 103). A 12-item, five-point Likert instrument mapped to six TAM/UTAUT constructs - Performance Expectancy (PE), Effort Expectancy (EE), Behavioral Intention (BI), Self-Efficacy (SE), Hedonic Motivation (HM), and Output Quality (OQ) - was complemented by inductive thematic analysis of open-ended feedback. Analyses combined ordinal reliability estimation, bootstrap confidence intervals, non-parametric tests, multiple-comparison-controlled correlations, polychoric dimensionality diagnostics, a common-method-bias check, and between-session comparisons. Acceptance was favorable across all six constructs with large effect sizes, with PE emerging as the strongest construct and HM as the weakest. Dimensionality diagnostics further revealed that canonical TAM/UTAUT sub-facets collapsed into a single general acceptance factor in this short-form post-workshop context, a finding with important methodological and theoretical implications. Qualitative themes converged with the quantitative profile regarding usefulness and enthusiasm but diverged on output quality, revealing a small yet articulate reliability-skeptical minority. The findings support the curricular adoption of AI automation tooling in undergraduate computing education and identify three theory-grounded instructional levers: instruction-sequencing scaffolds, self-efficacy supports, and trust-calibration interventions.

[HC-20] Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering WWW

链接: https://arxiv.org/abs/2606.12422
作者: Zewei Tian,Alex Liu,Lief Esbenshade,Michael Xiao,Zachary Zhang,Yulia Lápicus,Thomas Han,Kevin He,Min Sun
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published on the Proceedings of NCME 2026 Conference ( this https URL )

点击查看摘要

Abstract:The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.

[HC-21] Navigating the muddy waters of bias in artificial intelligence research: Understanding divergent meanings and conceptions

链接: https://arxiv.org/abs/2606.12421
作者: Mohammad Hossein Jarrahi,Amir Karami,Patrick Conway,Ali Memariani,Christoph Lutz
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) pervades many decision-making domains, AI bias grows in importance. Although there is increasing awareness of the social and ethical consequences of biased AI, understanding bias from the perspective of those who develop these systems, such as the AI research community, is less clear. In this study, we employ topic modeling on 6520 articles to explore how the AI research community interprets the concept of bias. Our results show that the definition of bias is dispersed and complex within the community, often exhibiting even divergent conceptions (some even view and introduce bias as a tunable statistical parameter rather than an undesirable issue). The research community as a whole needs to engage more effectively with the concept of bias and establish a more cohesive understanding of it. We specifically argue that, although some sub-communities view bias as an issue that can be captured and mitigated through technical, computational, or statistical methods, it is not solely a technical problem. It instead involves contextual, social, and ethical factors that require broader sociotechnical perspectives and solutions.

[HC-22] Assessing Student Ability to Select an Algorithmic Paradigm

链接: https://arxiv.org/abs/2606.12417
作者: Dip Kiran Pradhan Newar,Michael Shindler,Seth Poulsen
类目: Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Computer science students are expected to be able to look at a problem and select an appropriate algorithm design paradigm to use to produce a solution. However, there is little research on how students determine which algorithmic paradigm to use. Historically, researchers have relied on free-response questions or interviews to assess students’ knowledge of algorithmic paradigm selection. To successfully evaluate and scale teaching interventions for selecting an algorithmic design paradigm, we need to efficiently test a student’s ability to select among different design paradigms. Here, we present the first attempts to assess student knowledge to select an algorithm design paradigm using multiple-choice questions. We present the construction of the \textitalgorithmic paradigm selection assessment (APSA) and preliminary data demonstrating its effectiveness as an assessment. We discuss the key points we learned during this process to write multiple-choice questions for Algorithm Design Paradigms. We tested the internal consistency of our assessment using Cronbach’s \alpha and obtained a score of 0.73 , which is above the required threshold of 0.7 . APSA can be used across institutions as a standardized way to assess students’ ability to select different algorithm design paradigms. APSA will assist researchers in evaluating whether a theory helps students improve their knowledge of different Algorithm Design Paradigms.

[HC-23] Revisiting the ABCs of Working with AI: A Replication with Radiologists

链接: https://arxiv.org/abs/2606.12585
作者: Daniel Martin
类目: General Economics (econ.GN); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems increasingly assist human experts, but the consequences of AI assistance on productivity can be heterogeneous. Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b) provide evidence that two characteristics, ability and belief calibration, help to determine the returns to AI assistance. This note shows that their results replicate to a setting where professional radiologists analyze chest X-rays with access to state-of-the-art machine learning predictions. I leverage the public Collab-CXR data repository described by Moehring, Kutwal, Huang, Banerjee, Jacobi, Eber, Mendoza, Chung, Dayan, Gupta, Bui, Truong, Pareek, Langlotz, Lungren, Agarwal, Rajpurkar, and Salz (2025) and first analyzed for human-AI collaboration by Agarwal, Moehring, Rajpurkar, and Salz (2023). To faithfully reproduce the analysis in Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b), I use the radiologist assessments from the repeated-case designs, which include 68 radiologists and 11,420 paired radiologist-patient-pathology observations. The results of this replication support the external validity of their core findings: lower baseline ability and higher calibration predict larger incremental value from AI.

计算机视觉

[CV-0] InterleaveThinker: Reinforcing Agent ic Interleaved Generation

链接: https://arxiv.org/abs/2606.13679
作者: Dian Zheng,Harry Lee,Manyuan Zhang,Kaituo Feng,Zoey Guo,Ray Zhang,Hongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator’s outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

[CV-1] Mana: Dexterous Manipulation of Articulated Tools

链接: https://arxiv.org/abs/2606.13677
作者: Zhao-Heng Yin,Guanya Shi,Pieter Abbeel,C. Karen Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

[CV-2] Modality Forcing for Scalable Spatial Generation

链接: https://arxiv.org/abs/2606.13676
作者: Bardienus Pieter Duisterhof,Deva Ramanan,Jeffrey Ichnowski,Justin Johnson,Keunhong Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. this https URL

[CV-3] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

链接: https://arxiv.org/abs/2606.13674
作者: Junke Wang,Qihang Zhang,Shuai Yang,Yiming Luo,Yujun Shen,Zuxuan Wu,Yu-Gang Jiang,Yinghao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at this https URL.

[CV-4] SpatialClaw: Rethinking Action Interface for Agent ic Spatial Reasoning

链接: https://arxiv.org/abs/2606.13673
作者: Seokju Cho,Ryo Hachiuma,Abhishek Badki,Hang Su,Byung-Kwan Lee,Chan Hee Song,Sifei Liu,Subhashree Radhakrishnan,Seungryong Kim,Yu-Chiang Frank Wang,Min-Hung Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent’s capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

[CV-5] Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

链接: https://arxiv.org/abs/2606.13655
作者: Jen-Hao Cheng,Yipeng Wang,Hao Zhang,Gengshan Yang,Jenq-Neng Hwang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 18 pages, 8 figures. Code, and multi-view caption dataset available

点击查看摘要

Abstract:We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

[CV-6] World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

链接: https://arxiv.org/abs/2606.13652
作者: Hao Zhang,Mohamed El Banani,Jen-Hao Cheng,Paul Zhang,Yi Hua,Ben Mildenhall,Christoph Lassner,Narendra Ahuja,Gengshan Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: World Labs Technical Report; Page: this https URL

点击查看摘要

Abstract:Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

[CV-7] Surflo: Consistent 3D Surface Flow Model with Global State

链接: https://arxiv.org/abs/2606.13644
作者: Antoine Guédon,Shu Nakamura,Nicolas Dufour,Jiahui Lei,Ko Nishino,Angjoo Kanazawa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

[CV-8] Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios ICPR

链接: https://arxiv.org/abs/2606.13625
作者: Vinícius Orrú,Bruno H. Foggiatto,Gabriel E. Lima,David Menotti,Rayson Laroca
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2026 International Conference on Pattern Recognition (ICPR) - V3SC Workshop

点击查看摘要

Abstract:Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at this https URL

[CV-9] owards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background ICML2026

链接: https://arxiv.org/abs/2606.13587
作者: Mamoona Javaid,Mubashir Noman,Abdul Hannan,Shah Nawaz,Mustansar Fiaz,Sajid Ghuffar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at ICML 2026

点击查看摘要

Abstract:Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects’ boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.

[CV-10] EvTexture: Event-Driven Texture Enhancement for Video Super-Resolution ICML2024

链接: https://arxiv.org/abs/2606.13580
作者: Dachun Kai,Jiayao Lu,Yueyi Zhang,Xiaoyan Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: this https URL

点击查看摘要

Abstract:Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: this https URL.

[CV-11] Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

链接: https://arxiv.org/abs/2606.13562
作者: Stephen Moore,Lara Leijser,Richard Frayne,Roberto Souza
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 1 table, 7 figures

点击查看摘要

Abstract:Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors R=4 and R=8 using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

[CV-12] Whats Old is New Again: Classical Dimensionality Reduction for Efficient Saliency-Guided Biometric Attack Detection

链接: https://arxiv.org/abs/2606.13528
作者: Samuel Webster,Walter Scheirer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages (8 main, 2 references, 6 appendix), 4 figures (3 main, 1 appendix), 13 tables (3 main, 10 appendix)

点击查看摘要

Abstract:Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

[CV-13] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

链接: https://arxiv.org/abs/2606.13515
作者: Hanyang Yu,Haitao Lin,Jingbo Zhang,Wenyao Zhang,Chenghao Gu,Heng Li,Ping Tan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

[CV-14] Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

链接: https://arxiv.org/abs/2606.13509
作者: Mateo Toro Diz,Jonathan Hoss,Noah Klarmann
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

点击查看摘要

Abstract:Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems. Comments: This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13509 [cs.CV] (or arXiv:2606.13509v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.13509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-15] Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

链接: https://arxiv.org/abs/2606.13503
作者: Judith Vilella-Cantos,Juan José Cabrera,Mónica Ballesta,David Valiente,Luis Payá
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

[CV-16] SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

链接: https://arxiv.org/abs/2606.13497
作者: Nils Blank,Paul Mattes,Maximilian Xiling Li,Jakub Suliga,Thomas Roth,Moritz Reuss,Pankhuri Vanjani,Rudolf Lioutikov
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at this http URL.

[CV-17] Budget-Constrained Step-Level Diffusion Caching ICML2026

链接: https://arxiv.org/abs/2606.13496
作者: Mingkun Lei,Tong Zhao,Liangyu Yuan,Chi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at this https URL

[CV-18] NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

链接: https://arxiv.org/abs/2606.13494
作者: Daichi Azuma,Taiki Miyanishi,Koya Sakamoto,Shuhei Kurita,Yaonan Zhu,Petr Khrapchenkov,Motoaki Kawanabe,Yusuke Iwasawa,Yutaka Matsuo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: this https URL

[CV-19] Point-Wise Geometry-Aware Transformer for Partial-to-Full Point Cloud Registration in Computer-Assisted Surgery

链接: https://arxiv.org/abs/2606.13488
作者: Siyu Zhou,Zhongliang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emphGAPR-Net, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2%, the method results in a low RMSE of 1.992 mm and R^2 values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

[CV-20] Reinforcement Learning for Neural Model Editing

链接: https://arxiv.org/abs/2606.13461
作者: Shaivi Malik
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

[CV-21] VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

链接: https://arxiv.org/abs/2606.13460
作者: Ruiqi Xian,Yuehan Xian,Jing Liang,Xuewei Qi,Dinesh Manocha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

[CV-22] OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

链接: https://arxiv.org/abs/2606.13432
作者: Jiwen Liu,Shujuan Li,Zhixue Fang,Xiaohan Li,Yan Zhou,Zijie Meng,Zhimin Zhang,Yawen Luo,Guoxin Zhang,Yu-Shen Liu,Pengfei Wan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: this https URL

[CV-23] VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits ICMR2026

链接: https://arxiv.org/abs/2606.13427
作者: Hoang-Nguyen Cao,Le-Hoang Bui,Dinh-Khoi Vo,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICMR 2026. Project page: this https URL

点击查看摘要

Abstract:Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: this https URL.

[CV-24] SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

链接: https://arxiv.org/abs/2606.13382
作者: Zian Yang,Zixin Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

[CV-25] MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

链接: https://arxiv.org/abs/2606.13376
作者: Yang Zhou,Ziheng Wang,Yuqin Lu,Haofeng Liu,Jun Liang,Shengfeng He,Jing Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360 ^\circ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

[CV-26] IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

链接: https://arxiv.org/abs/2606.13368
作者: Tao Hu,Jiaxin Ai,Licheng Wen,Xueheng Li,Shu Zou,Siqi Li,Nianchen Deng,Xinyu Cai,Hongbin Zhou,Pinlong Cai,Daocheng Fu,Yu Yang,Hairong Zhang,Botian Shi,Xuemeng Yang
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

[CV-27] Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

链接: https://arxiv.org/abs/2606.13366
作者: Sanxin Jiang,Jiro Katto,Heming Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rate-distortion-perception (RDP) trade-off extends classical rate–distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate–perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint – requiring that re-encoding the restored image recovers the base codec reconstruction – serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors (K_D, K_P) jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC _RD ( K_P=0 ) and DCIC _RP ( K_D=0 ) arise as boundary curves, with DCIC _RDP ( K_D = K_P=1 ) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC _RDP achieves superior BD-PSNR over all perceptual codecs, while DCIC _RP matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

[CV-28] VideoMDM: Towards 3D Human Motion Generation From 2D Supervision MDM

链接: https://arxiv.org/abs/2606.13364
作者: Amir Mann,Gal Michael Harari,Merav Keidar,Or Litany
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

[CV-29] JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

链接: https://arxiv.org/abs/2606.13345
作者: Xinnan Zhu,Ruijie Xu,Jiayu Ying,Daoguo Dong,Jiachen Xu,Yuan Xie,Xin Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbfJointEdit3D, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

[CV-30] Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

链接: https://arxiv.org/abs/2606.13341
作者: Gabriel Steele,Alzahra Altalib,Alessandro Perelli
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

[CV-31] OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

链接: https://arxiv.org/abs/2606.13332
作者: Felix Tristram,Ege Özsoy,Christian Benz,Marcel Walch,Ghazal Ghazaei,Nassir Navab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

[CV-32] Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

链接: https://arxiv.org/abs/2606.13315
作者: Esra Ergün,Hersh Chandarana,Dan Sodickson,Gözde Ünal
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance–covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task’s structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task’s structure.

[CV-33] MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

链接: https://arxiv.org/abs/2606.13312
作者: Sliman Jammal,Andrei Sharf
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

[CV-34] ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

链接: https://arxiv.org/abs/2606.13304
作者: Salaheldin Mohamed,M. Hamza Mughal,Rishabh Dabral,Christian Theobalt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

[CV-35] DuET: Dual Expert Trajectories for Diffusion Image Editing

链接: https://arxiv.org/abs/2606.13303
作者: Lidia Troeshestova,Alexander Ustyuzhanin,Sergey Kastryulin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

[CV-36] HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

链接: https://arxiv.org/abs/2606.13289
作者: Guozhen Zhang,Xuerui Qiu,Yutao Cui,Tianhui Song,Changlin Li,Junzhe Li,Tao Huang,Xiao Zhang,Yang Li,Jianbing Wu,Miles Yang,Zhao Zhong,Liefeng Bo,Limin Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

[CV-37] Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing ICME

链接: https://arxiv.org/abs/2606.13275
作者: Anugrah Aidin Yotolembah,Novanto Yudistira,Gembong Edhi Setyawan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICME workshop on AIART 2026

点击查看摘要

Abstract:This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at this https URL.

[CV-38] owards More General Control of Diffusion Models Using Jeffrey Guidance

链接: https://arxiv.org/abs/2606.13240
作者: Raphaël Razafindralambo,Rémy Sun,Frédéric Precioso,Jes Frellsen,Pierre-Alexandre Mattei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey’s rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

[CV-39] Distributional Loss for Robust Classification ICANN2026

链接: https://arxiv.org/abs/2606.13223
作者: Kathleen Anderson,Thomas Martinetz
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICANN 2026

点击查看摘要

Abstract:This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

[CV-40] Visual Place Recognition in Forests with Depth-Aware Distillation ICRA

链接: https://arxiv.org/abs/2606.13206
作者: Walter Nedov,Saimunur Rahman,Kavindie Katuwandeniya,David Hall,Kaushik Roy,Peyman Moghadam
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IEEE ICRA Workshop on Field Robotics 2026

点击查看摘要

Abstract:Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.

[CV-41] ransformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

链接: https://arxiv.org/abs/2606.13188
作者: Abhishek H S,Akash Ganamukhi,Abhimanyu Suresh,Aditya G Hiremath,Prasad B Honnavalli,Adithya Balasubramanyam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow – segmenting the image, running Marching Cubes, and then manually cleaning up the result – is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient’s cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass – no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

[CV-42] Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

链接: https://arxiv.org/abs/2606.13156
作者: Animesh Tripathy,Aswanth Krishnan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model’s own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

[CV-43] An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

链接: https://arxiv.org/abs/2606.13136
作者: Saurabh Kumar,Nutan Sairam Yenneti
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

[CV-44] Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

链接: https://arxiv.org/abs/2606.13135
作者: Elena S. Kozachok,Sergey S. Seregin,Aleksandr V. Kozachok,Ilya P. Latyshev,Oleg I. Samovarov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment. Comments: 28 pages, 8 figures, 10 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13135 [cs.CV] (or arXiv:2606.13135v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.13135 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Elena Kozachok [view email] [v1] Thu, 11 Jun 2026 09:55:57 UTC (95 KB)

[CV-45] Fully Distributed Multi-View 3D Tracking in Real-Time

链接: https://arxiv.org/abs/2606.13127
作者: Byron Hernandez,Fangyu Li,Aotian Wu,Paul J. Shin,Kaustubh Purandare,Henry Medeiros
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures, 2 algorithms, 4 tables

点击查看摘要

Abstract:Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

[CV-46] PP-OCRv6: From 1.5M to 34.5M Parameters Surpassing Billion-Scale VLMs on OCR Tasks

链接: https://arxiv.org/abs/2606.13108
作者: Yubo Zhang,Xueqing Wang,Manhui Lin,Yue Zhang,Penglongyi Deng,Ting Sun,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Changda Zhou,Hongen Liu,Suyin Liang,Cheng Cui,Yi Liu,Dianhai Yu,Yanjun Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9 \times faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

[CV-47] Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

链接: https://arxiv.org/abs/2606.13096
作者: Yupeng Cai,Jia Wei,Jianlong Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.13096 [cs.CV] (or arXiv:2606.13096v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.13096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-48] LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

链接: https://arxiv.org/abs/2606.13061
作者: Peixi Wu,Biao Yang,Feipeng Ma,Bosong Chai,Bo Lin,Wei Yuan,Fan Yang,Tingting Gao,Hebei Li,Xiaoyan Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

[CV-49] Augmentation techniques for video surveillance in the visible and thermal spectral range

链接: https://arxiv.org/abs/2606.13042
作者: Vanessa Buhrmester,Ann-Kristin Grosselfinger,David Munch,Michael Arens
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques…

[CV-50] SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

链接: https://arxiv.org/abs/2606.13041
作者: Xiangyu Lyu,Dan Lei
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注: 19 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

[CV-51] herCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

链接: https://arxiv.org/abs/2606.13035
作者: Yu Meng,Xiangyang Luo,Letian Li,Wenyuan Jiang,Chen Gao,Xinlei Chen,Yong Li,Xiao-Ping Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

[CV-52] SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

链接: https://arxiv.org/abs/2606.13033
作者: Alexander Holmberg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker’s output is modified only when the VOS model makes a confident prediction that contradicts the base tracker’s identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

[CV-53] GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

链接: https://arxiv.org/abs/2606.13032
作者: Rui Tang,Guankun Wang,Long Bai,Haochen Yin,Huxin Gao,Jiewen Lai,Jiazheng Wang,Hongliang Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE ICIA 2026

点击查看摘要

Abstract:Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

[CV-54] A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

链接: https://arxiv.org/abs/2606.13030
作者: Haoran Zhang,Haokun Zhang,Pengyu Liu,Yujia Zhang,Weibao Xue,Yanbin Hao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13%, securing the 4th place.

[CV-55] Comparing Commercial Depth Sensor Accuracy for Medical Applications

链接: https://arxiv.org/abs/2606.13028
作者: Pit Henrich,Maximilian Weiherer,Franziska Hansen,Bernhard Egger,Franziska Mathis-Ullrich
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 Pages

点击查看摘要

Abstract:Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

[CV-56] Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

链接: https://arxiv.org/abs/2606.13022
作者: Ziyi Chang,Kanglei Zhou,Xiaohui Liang,Hubert P. H. Shum
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

[CV-57] A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

链接: https://arxiv.org/abs/2606.12988
作者: Manex Atxa,Bruno Simoes,Julen Balzategui
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, conference 24CMH

点击查看摘要

Abstract:This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

[CV-58] Diffusion Transformer World-Action Model for AV Scene Prediction

链接: https://arxiv.org/abs/2606.12987
作者: Ruslan Sharifullin,Benjamin Jiang,Kai Xi Chew
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 10 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field’s standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to 256 \times 256 frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the x_0 objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ( 4.8\times better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman \rho = 0.81 , vs -0.18 for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter “jump” model that recovers full ground-truth motion magnitude ( 1.02\times GT), where single-pass models capture less than half.

[CV-59] Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

链接: https://arxiv.org/abs/2606.12985
作者: Sathira Silva,Abrham Kahsay Gebreselasie,Muhammad Umer Sheikh,Kartik Kuckreja,Daniel Harari,Muhammad Haris Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at this https URL.

[CV-60] Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

链接: https://arxiv.org/abs/2606.12981
作者: Muhammad Shahbaz,Shaurya Agarwal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird’s-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

[CV-61] rajectory-Level Redirection Attacks on Vision-Language-Action Models

链接: https://arxiv.org/abs/2606.12978
作者: Gokul Puthumanaillam,Vardhan Dongre,Pranay Thangeda,Hooshang Nayyeri,Dilek Hakkani-Tür,Melkior Ornik
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still \textitappears to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as \textitcommand-preserving trajectory redirection , a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot’s final physical outcome. Project website: this https URL

[CV-62] Efficient Robust and Anti-Collusion Fingerprinting of Image Diffusion Models

链接: https://arxiv.org/abs/2606.12977
作者: Jianwei Fei,Yunshu Dai,Zhihua Xia,Xiaochun Cao,Jiantao Zhou,Alessandro Piva,Benedetta Tondi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

[CV-63] YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

链接: https://arxiv.org/abs/2606.12958
作者: Ching-Yu Tsai,Chia-Min Lin,Chih-Hsiang Yang,Yung-Che Wang,Jen-Shiun Chiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper

点击查看摘要

Abstract:Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at this https URL.

[CV-64] OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

链接: https://arxiv.org/abs/2606.12953
作者: Ibrahim Gulluk,Max Van Puyvelde,Olivier Gevaert
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

点击查看摘要

Abstract:We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

[CV-65] ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

链接: https://arxiv.org/abs/2606.12949
作者: Fatima Qaiser,Bisma Tahir,Muhammad Abid Mughal,Nauman Shamim
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

[CV-66] MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds ICPR2026

链接: https://arxiv.org/abs/2606.12939
作者: Inseok Kong,Geunyoung Jung,Jiyoung Jung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICPR 2026

点击查看摘要

Abstract:3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at this https URL

[CV-67] Multi-Label Test-Time Adaptation with Bayesian Conditional Priors ICML2026

链接: https://arxiv.org/abs/2606.12925
作者: Qiru Li,Ao Zhou,Zhiwei Jiang,Zifeng Cheng,Cong Wang,Yafeng Yin,Qing Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: accepted by ICML2026

点击查看摘要

Abstract:Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

[CV-68] Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration ICML2026

链接: https://arxiv.org/abs/2606.12913
作者: Dongyue Wu,Zilin Guo,Xiaoyu Li,Jiajia Liu,Jingdong Chen,Nong Sang,Changxin Gao
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40% without sacrificing accuracy on ImageNet-1k with ResNet-50. Comments: ICML 2026 Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.12913 [cs.LG] (or arXiv:2606.12913v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12913 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dongyue Wu [view email] [v1] Thu, 11 Jun 2026 05:13:32 UTC (247 KB) Full-text links: Access Paper: View a PDF of the paper titled Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration, by Dongyue Wu and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-69] Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

链接: https://arxiv.org/abs/2606.12910
作者: Allison Andreyev,Landon Eum,Nestor Tiglao,Romel Gomez
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Project website: this https URL

点击查看摘要

Abstract:For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally “heavyweight” or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as “top shelf” and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

[CV-70] Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

链接: https://arxiv.org/abs/2606.12886
作者: Tingyu Li,Le Zhou,Siyuan Li,Yujun Wu,Xinglong Xu,Jingxuan Wei,Conghui He,Cheng Tan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

[CV-71] Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

链接: https://arxiv.org/abs/2606.12869
作者: Tsz Lok Ip,Han Zhang,Lok Ming Lui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

[CV-72] JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

链接: https://arxiv.org/abs/2606.12858
作者: Tong Wu,Zhiyong Chen,Guo Lu,Li Song,Feng Yang,Meixia Tao,Wenjun Zhang
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE Journal

点击查看摘要

Abstract:Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon’s rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder with a generative model at the receiver. The received signal is treated as a condition that controls the sampling process into the learned conditional distribution, reformulating communication from deterministic reconstruction for distortion minimization to controlled generation for mutual information maximization under perceptual constraints. Based on this formulation, we develop a unified joint training and efficient stochastic sampling framework, and provide theoretical analysis of its effectiveness in both learning and inference stages. Extensive experiments on latent-space image transmission demonstrate that the JSCGC consistently improves feature-based, semantic-level, and distributional quality across diverse channel conditions, while exhibiting a distinct error behavior characterized by semantic inconsistency rather than distortion.

[CV-73] SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

链接: https://arxiv.org/abs/2606.12849
作者: Rahul Singh,Devdeep Ray,Connor Smith,Sarita Adve
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2606.12849 [cs.DC] (or arXiv:2606.12849v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.12849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-74] Language-Guided Abstraction for Visual Reasoning

链接: https://arxiv.org/abs/2606.12847
作者: Xu-Jing Ye,Yuan-Gen Wang,Ruping Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at this https URL.

[CV-75] Perceive Interact Reason : Building Tool-Augmented Visual Agents for Spatial Reasoning

链接: https://arxiv.org/abs/2606.12830
作者: Changye Li,Meng Lu,Yi Wu,Ligeng Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

[CV-76] DIMOS: Disentangling Instance-level Moving Object Segmentation

链接: https://arxiv.org/abs/2606.12826
作者: Hongxiang Huang,Hongwei Ren,Xiaopeng Lin,Yulong Huang,Zeke Xie,Bojun Cheng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

[CV-77] GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

链接: https://arxiv.org/abs/2606.12744
作者: Garvita Allabadi,Matteo Sodano,Roberto Estevão,Yuxiong Wang,Vikram Adve,Emre Kiciman,Ranveer Chandra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.12744 [cs.CV] (or arXiv:2606.12744v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.12744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-78] EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

链接: https://arxiv.org/abs/2606.12728
作者: Clinton Enwerem,John S. Baras,Calin Belta
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: this https URL

点击查看摘要

Abstract:Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below 0.04^\circ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a 120^\circ co-rotation. Videos, code, and checkpoints are available at this https URL.

[CV-79] VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

链接: https://arxiv.org/abs/2606.12706
作者: Thach Nguyen,Danhua Guo,Tom Lampo,Fei Wu,Burhan Yaman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

[CV-80] SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

链接: https://arxiv.org/abs/2606.12671
作者: Xiaoxiao Sun,Ruotian Zhang,Junzhe Huang,James Burgess,Serena Yeung-Levy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures, 7 tables. Dataset: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

[CV-81] Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

链接: https://arxiv.org/abs/2606.12655
作者: Ahmed Sharshar,Naveen Kumar Kummari,Mohsen Guizani
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

[CV-82] CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

链接: https://arxiv.org/abs/2606.12635
作者: Tooba Imtiaz,Milind Rajadhyaksha,Kivanc Kose,Jennifer Dy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution “optical biopsies” of human skin \emphin vivo by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 \mu m) \sim 6 times higher compared to axial resolution, which is defined by the optical sectioning (3 \mu m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 \mu m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM’s depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

[CV-83] ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation ICML2026

链接: https://arxiv.org/abs/2606.12633
作者: Jiangtao Kong,Peijun Zhao,Chun-Fu Chen,Youngwook Do,Shaohan Hu,Tianyi Zhou,Huajie Shao
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA’s performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at this https URL.

[CV-84] Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving CVPR2026

链接: https://arxiv.org/abs/2606.12628
作者: Binay Kumar Singh,Niels Da Vitoria Lobo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, CVPR 2026 Precognition Workshop

点击查看摘要

Abstract:Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as “Train” that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at this https URL.

[CV-85] Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

链接: https://arxiv.org/abs/2606.12601
作者: Sieu Tran,Duc Nguyen,Hao Vo,Khoa Vo,Ngan Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.12601 [cs.CV] (or arXiv:2606.12601v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.12601 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-86] Emerging Flexible Designs for Geospatial Multimodal Foundation Models

链接: https://arxiv.org/abs/2606.12595
作者: Philipe Dias,Waqwoya Abebe,Abhishek Potnis,Aristeidis Tsaris,Dan Lu,Xiao Wang,Dalton Lunga
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

[CV-87] Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

链接: https://arxiv.org/abs/2606.12590
作者: Shayan Mohammadizadehsamakosh,Pritam Sarkar,Leonid Sigal,Ali Etemad,Elham Dolatabadi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

[CV-88] High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

链接: https://arxiv.org/abs/2606.12575
作者: Dongyang Liu,Ruoyi Du,David Liu,Dengyang Jiang,Liangchen Li,Qilong Wu,Zhen Li,Steven C.H. Hoi,Hongsheng Li,Peng Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

[CV-89] HairPort: In-context 3D-aware Hair Import and Transfer for Images SIGGRAPH2026

链接: https://arxiv.org/abs/2606.12562
作者: Alireza Heidari,Amirhossein Alimohammadi,Wallace Michel Pinto Lira,Adi Bar-Lev,Ali Mahdavi-Amiri
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: this https URL

点击查看摘要

Abstract:Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

[CV-90] AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

链接: https://arxiv.org/abs/2606.12555
作者: Zeyue Tian,Lei Ke,Zhaoyang Liu,Ruibin Yuan,Liumeng Xue,Yujiu Yang,Weijia Chen,Xu Tan,Qifeng Chen,Wei Xue,Yike Guo
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at this https URL.

[CV-91] Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

链接: https://arxiv.org/abs/2606.12473
作者: Shreyas Narasimhiah Ramesh,P. D. Rathika,Mahasweta Sarkar,Kristen Wells,Michel Audette,Christopher Paolini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages; 31 figures

点击查看摘要

Abstract:Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU = 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed. Comments: 19 pages; 31 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.12473 [cs.CV] (or arXiv:2606.12473v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.12473 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christopher Paolini [view email] [v1] Wed, 10 Jun 2026 05:08:35 UTC (154,009 KB)

[CV-92] Acquisition state behaves as a structured measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection frag ility invisible to DICOM metadata

链接: https://arxiv.org/abs/2606.12824
作者: Daniel Soliman
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

人工智能

[AI-0] Automated reproducibility assessments in the social and behavioral sciences using large language models

链接: https://arxiv.org/abs/2606.13670
作者: Tobias Holtdirk,Pietro Marcolongo,Anna Steinberg Schulten,Felix Henninger,Stefan Rose,Sarah Ball,Bolei Ma,Frauke Kreuter,Markus Weinmann,Stefan Feuerriegel
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen’s d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

[AI-1] Agents -K1: Towards Agent -native Knowledge Orchestration

链接: https://arxiv.org/abs/2606.13669
作者: Zongsheng Cao,Bihao Zhan,Jinxin Shi,Jiong Wang,Fangchen Yu,Zhijie Zhong,Zijie Guo,Tianshuo Peng,Zhuo Liu,Yi Xie,Xiang Zhuang,Yue Fan,Runmin Ma,Shiyang Feng,Xiangchao Yan,Anran Liu,Peng Ye,Wenlong Zhang,Shufei Zhang,Chunfeng Song,Fenghua Ling,Jie Zhou,Liang He,Bo Zhang,Lei Bai
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \textttcites edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbfAgents-K1, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbfScholar-KG, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

[AI-2] Before You Think: System 0 AI-Mediated Cognition and Cognitive Colonization

链接: https://arxiv.org/abs/2606.13658
作者: Marianna Bergamaschi Ganapini,Massimo Chiriatti,Enrico Panai,Giuseppe Riva
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-System Theory, Thinkframes, and System 0. It argues that while the first two capture important dimensions of AI’s influence on individual reasoning and collective epistemic practices, System 0 occupies a theoretically distinctive position that neither can fully replicate. The paper introduces the concept of cognitive colonization, according to which AI systems can embed external interests within the architecture of the self in ways that are difficult for users to perceive. Because such systems are already widely deployed, understanding these invisible forms of influence is an urgent philosophical and practical task.

[AI-3] Agent Beats: Agent Agent ifying Agent Assessment for Openness Standardization and Reproducibility

链接: https://arxiv.org/abs/2606.13608
作者: Xiaoyuan Liu,Jianhong Tu,Yuqi Chen,Siyuan Xie,Sihan Ren,Tianneng Shi,Gal Gantar,Evan Sandoval,Donghyun Lee,Daniel Miao,Peter J. Gilbert,Nick Hynes,Mauro Staver,Warren He,David Marn,Andrew Low,Xi Zhang,Elron Bandel,Michal Shmueli-Scheuer,Siva Reddy,Alexandre Drouin,Alexandre Lacoste,Ramayya Krishnan,Elham Tabassi,Yu Su,Victor Barres,Chenguang Wang,Wenbo Guo,Dawn Song
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.13608 [cs.AI] (or arXiv:2606.13608v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

链接: https://arxiv.org/abs/2606.13607
作者: Zach Studdiford,Gary Lupyan
类目: Artificial Intelligence (cs.AI)
备注: 13 pages main text, 51 pages supplementary text

点击查看摘要

Abstract:When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people’s behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

[AI-5] EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

链接: https://arxiv.org/abs/2606.13602
作者: Harihara Muralidharan,Reema Baskar,Soo Hee Lee,Tim Proctor,Kenny Workman
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\Tag/CUT\RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0% (143/318 attempts; 95% confidence interval (CI), 36.3–53.7), followed by GPT-5.5 / OpenAI Codex at 39.9% (127/318 attempts; 95% CI, 31.6–48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0% (124/318 attempts; 95% CI, 30.2–47.8 and 31.0–47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

[AI-6] Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

链接: https://arxiv.org/abs/2606.13571
作者: Yifan Hu,Hongzhou Chen,Peiyuan Liu,Yiding Liu,Zewei Dong,Jiang-Ming Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in this https URL.

[AI-7] A Three-Layer Framework for AI in Scientific Discovery

链接: https://arxiv.org/abs/2606.13566
作者: Guojun Liao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execution through optimization, simulation, and automation. Both are important, but neither fully captures the central act of discovery: the formation and evolution of models. This paper proposes a three-layer view of AI in discovery. Layer 1 is search and retrieval by large language models. Layer 2, as the main innovation of this paper, is model formation through qualitative reasoning: the capacity to recognize when a current framework is structurally inadequate and to understand the problem within a broader representational space, not through trial and error, but through structural insight into what is missing and where it can be found. Layer 3 is execution, optimization, and refinement. The main claim is that Layer 2 is both the most important and the least developed. Search without model formation remains confined to inherited frameworks, while execution without conceptual revision only amplifies an existing formulation. We illustrate Layer 2 reasoning through three case studies: S. S. Chern’s intrinsic proof of the Gauss-Bonnet theorem, the resolution of the Nesterov Accelerated Gradient convergence problem via Lyapunov functions, and the autonomous disproof of the Erdos unit distance conjecture by OpenAI in 2026. Each case exhibits the same structural signature: a framework that had become inadequate, a missing conceptual object, and a resolution found in an unexpected neighboring field.

[AI-8] CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation KDD2026

链接: https://arxiv.org/abs/2606.13513
作者: Xiaobin Zhang,Lefei Shen,Mouxiang Chen,Zhuo Li,Hongkai Li,Han Fu,Jianling Sun,Xiaoxue Ren,Chenghao Liu
类目: Artificial Intelligence (cs.AI)
备注: Accepted to KDD 2026

点击查看摘要

Abstract:Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

[AI-9] CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

链接: https://arxiv.org/abs/2606.13486
作者: William Smits
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 2 appendices. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE). Code: this https URL

点击查看摘要

Abstract:Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types – point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) – each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests – one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework – oracle F1, detectability limits, and branch separation ratios – identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: this https URL

[AI-10] Understanding the Rejection of Fixes Generated by Agent ic Pull Requests – Insights from the AIDev Dataset

链接: https://arxiv.org/abs/2606.13468
作者: Mahmoud Abujadallah,Ali Arabat,Mohammed Sayagh
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, MSR '26: Proceedings of the 23rd International Conference on Mining Software Repositories, April 2026, Rio de Janeiro, Brazil

点击查看摘要

Abstract:AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

[AI-11] oward Instructions-as-Code: Understanding the Impact of Instruction Files on Agent ic Pull Requests

链接: https://arxiv.org/abs/2606.13449
作者: Ali Arabat,Mohammed Sayagh
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 8 figures, 23rd International Conference on Mining Software Repositories, April 13–14, 2026

点击查看摘要

Abstract:AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7% of the projects increased their merge rate by at least 20%, while 26.35% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbfInstructions-as-Code).

[AI-12] Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

链接: https://arxiv.org/abs/2606.13436
作者: Raymond Vasquez
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational (“silver”) evaluation degrade substantially under independent (“gold”) evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13436 [cs.AI] (or arXiv:2606.13436v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms GECCO2026

链接: https://arxiv.org/abs/2606.13407
作者: Hiba Ahmed,Alexander E.I. Brownlee,Jason Adair,Simon T. Powers
类目: Artificial Intelligence (cs.AI)
备注: 9 pages; full results and methodology for poster paper accepted to GECCO 2026

点击查看摘要

Abstract:Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

[AI-14] PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update ICML2026

链接: https://arxiv.org/abs/2606.13400
作者: Jianming Ma,Qiyue Yang,Yang Zhang,Liyun Yan,Zhanxiang Cao,Yazhou Zhang,Yue Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 30 pages, 12 figures, Accepted to ICML 2026

点击查看摘要

Abstract:While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on this https URL.

[AI-15] MiniMax Sparse Attention

链接: https://arxiv.org/abs/2606.13392
作者: Xunhao Lai,Weiqi Xu,Yufeng Yang,Qiaorui Chen,Yang Xu,Lunbin Zeng,Xiaolong Li,Haohai Sun,Haichao Zhu,Vito Zhang,Pengyu Zhao
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 14 figures

点击查看摘要

Abstract:Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: this https URL. A production-grade natively multimodal model powered by MSA has been publicly released at: this https URL.

[AI-16] A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

链接: https://arxiv.org/abs/2606.13370
作者: Joe Dwyer
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

[AI-17] Real-Time Execution with Autoregressive Policies

链接: https://arxiv.org/abs/2606.13355
作者: Sangkyu Lee,Seohyeon Park,Tackgeun You,Avi Caciularu,Idan Szpektor,Hwasup Lim,Youngjae Yu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

[AI-18] ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

链接: https://arxiv.org/abs/2606.13316
作者: Xucong Wang,Ziyu Ma,Yong Wang,Shidong Yang,Hailang Huang,Renda Li,Pengkun Wang,Xiangxiang Chu
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, including 13 pages of main text and 11 pages of appendix

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization’’ phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4% while reducing rollout length by 18.6%.

[AI-19] Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

链接: https://arxiv.org/abs/2606.13311
作者: Yongmin Kim,ByeongHoon Jeon,Sungil Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1–False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

[AI-20] Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

链接: https://arxiv.org/abs/2606.13302
作者: Abubakar Hamisu Kamagata,Dharm Singh Jat,Attlee Munyaradzi Gamundani,Abhishek Srivastava,Paramasivam Saravanakumar
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

[AI-21] Mining Architectural Quality Under Agent ic AI Adoption: A Causal Study of Java Repositories KR MICRO

链接: https://arxiv.org/abs/2606.13298
作者: Oliver Aleksander Larsen,Mahyar T. Moghaddam
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages. Accepted for presentation at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026, Krakow, Poland, 2-4 September 2026, and for publication in the Springer LNCS proceedings. This is the author’s accepted manuscript

点击查看摘要

Abstract:AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called “vibe coding”. Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

[AI-22] Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation ICML2026

链接: https://arxiv.org/abs/2606.13285
作者: Beinan Xu,Andy Song,Jiti Gao,Feng Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

[AI-23] ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

链接: https://arxiv.org/abs/2606.13282
作者: Pratyush Chaudhari
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 10 tables

点击查看摘要

Abstract:As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

[AI-24] Different Layers Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization ICML2026

链接: https://arxiv.org/abs/2606.13276
作者: Kirato Yoshihara
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at WSS @ ICML 2026, code is available at this https URL

点击查看摘要

Abstract:Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

[AI-25] From Verdict to Process: Agent ic Reinforcement Learning for Multi-Stage Fact Verification

链接: https://arxiv.org/abs/2606.13262
作者: Rongxin Yang,Shenghong He,Siyuan Zhu,Chao Yu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

[AI-26] MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinsons Disease Gait Assessment

链接: https://arxiv.org/abs/2606.13258
作者: Minlin Zeng,Zhipeng Zhou,Yang Qiu,Martin J. McKeown,Zhiqi Shen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gait-based Parkinson’s disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson’s gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: this https URL

[AI-27] Humor Style Drives Laughter Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

链接: https://arxiv.org/abs/2606.13256
作者: Anna-Maria Velentza,Anne-Gwenn Bosser
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

点击查看摘要

Abstract:Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants’ self-reported fluency and humor practices.

[AI-28] owards Personalized Federated Learning for Dysarthric Speech Recognition

链接: https://arxiv.org/abs/2606.13253
作者: Tao Zhong,Mengzhe Geng,Jiajun Deng,Shujie Hu,Xunying Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

[AI-29] Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

链接: https://arxiv.org/abs/2606.13249
作者: Seongjin Kim,Sungil Kim
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of “incident cards”, indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

[AI-30] EPIG: Emotion-Based Prompting for Personalised Image Generation

链接: https://arxiv.org/abs/2606.13247
作者: Emna Othmen,Mohamed Yassine Landolsi,Lotfi Ben Romdhane
类目: Artificial Intelligence (cs.AI)
备注: Submitted to arXiv. 20 pages, 4 figures. Work on emotion-based prompt engineering for text-to-image diffusion models with applications in personalized image generation

点击查看摘要

Abstract:Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model’s ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method. Comments: Submitted to arXiv. 20 pages, 4 figures. Work on emotion-based prompt engineering for text-to-image diffusion models with applications in personalized image generation Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T45, 91E40 ACMclasses: I.2.10; I.2.6 Cite as: arXiv:2606.13247 [cs.AI] (or arXiv:2606.13247v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13247 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Emna Othmen [view email] [v1] Thu, 11 Jun 2026 12:04:08 UTC (11,670 KB) Full-text links: Access Paper: View a PDF of the paper titled EPIG: Emotion-Based Prompting for Personalised Image Generation, by Emna Othmen and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-31] Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

链接: https://arxiv.org/abs/2606.13241
作者: Francesco Massa,Marco Cristofanilli
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures. Technical report

点击查看摘要

Abstract:Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

[AI-32] Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier ICML2026

链接: https://arxiv.org/abs/2606.13236
作者: Olga Isupova,Danil Kuzin,Ella Browning,Tom Mills,Steven Reece
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Applications (stat.AP)
备注: ICML 2026 Workshop on Machine Learning for Audio

点击查看摘要

Abstract:Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

[AI-33] ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

链接: https://arxiv.org/abs/2606.13233
作者: Sihwa Lee,Janghwan Lee,Donghoon Yoo,Jae Gon Kim,Hanyul Ryu,Soojung Ryu,Jungwook Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbfReSET, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small- M NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to \sim! 2 points over the NVFP4 baseline. Our CUDA-core small- M kernel further improves latency-critical decoding, delivering up to 2.5!\times kernel-level speedup over NVFP4 vLLM and approximately 2!\times end-to-end decoding speedup over BF16. Code is available at this https URL.

[AI-34] Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

链接: https://arxiv.org/abs/2606.13222
作者: Yurun Chen,Tianyuan Gao,Yizhong Ge,Shikun Ban,Yizhou Wang,Hongkai Xiong,Wenjun Zeng,Wentao Zhu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures, 1 supplementary table

点击查看摘要

Abstract:Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot’s body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: this https URL.

[AI-35] Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy Detection and Mitigation under Regulatory Constraints

链接: https://arxiv.org/abs/2606.13211
作者: Omar Alshahrani,Muzammil Behzad
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA’s Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

[AI-36] A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

链接: https://arxiv.org/abs/2606.13201
作者: Manisha Dubey,Anirban Sarkar,Subramanian Ramamoorthy
类目: Artificial Intelligence (cs.AI)
备注: 3 pages, 1 figure, accepted as extended abstract at Annual Conference on Cognitive Computational Neuroscience 2026

点击查看摘要

Abstract:Human decision-making often involves choosing between multi-attribute alternatives, yet classical models assume fully compensatory utility aggregation despite evidence that people reject options with poor performance on critical attributes. We propose a bounded trade-off reasoning framework in which decisions are governed by a screening process that evaluates the balance between gains and losses across attributes. The model introduces a trade-off tolerance parameter that controls acceptable imbalance and can vary across contexts. Through simulation, we show that this mechanism produces preference patterns that differ from standard utility-based models and captures context-dependent variation in trade-off behavior. These results establish bounded trade-off screening as a plausible computational mechanism for multi-attribute choice and generate testable predictions for future behavioral studies.

[AI-37] ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

链接: https://arxiv.org/abs/2606.13197
作者: Fuqiang Niu,Bowen Zhang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5%, 96.5%, 90.0%, and 81.5% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.

[AI-38] Under What Conditions Can a Machine Become Genuinely Creative?

链接: https://arxiv.org/abs/2606.13196
作者: Yong Zeng
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can become genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

[AI-39] Reasoning for Mobile User Experience with Multimodal LLM s: Task Benchmark and Approach CVPR2026

链接: https://arxiv.org/abs/2606.13192
作者: Ruichao Mao,Zhou Fang,Teng Guo,Hao Yang,Yaping Li,Shaohua Peng,Maji Huang,Xiaoyu Lin,Shuoyang Liu,Xuepeng Li,Yuyu Zhang,Hai Rao
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, Accepted at CVPR 2026 Findings

点击查看摘要

Abstract:User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs’ ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 – surpassing Claude-4.5-Sonnet’s 0.6550 – while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency. Comments: 10 pages, 6 figures, Accepted at CVPR 2026 Findings Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.10; I.2.6; H.5.2 Cite as: arXiv:2606.13192 [cs.AI] (or arXiv:2606.13192v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.13192 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ruichao Mao [view email] [v1] Thu, 11 Jun 2026 11:00:16 UTC (6,330 KB) Full-text links: Access Paper: View a PDF of the paper titled Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach, by Ruichao Mao and 11 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-40] Modern analog computing for solving differential and matrix equations

链接: https://arxiv.org/abs/2606.13179
作者: Zhong Sun,Piergiulio Mannocci,Manuel Le Gallo,Abu Sebastian
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computing, analog computing has gained renewed interest. Given the diversity of computational tasks and recent advancements in analog CMOS circuits and resistive memory technologies, we refer to the evolving landscape as modern analog computing. In this context, we identify three core computational primitives: solving differential equations, solving matrix equations, and performing matrix-vector multiplications, and we explore the connections among them. We also examine various hardware implementations of these analog computing operators, including those built with discrete components, integrated circuits, and resistive memory devices. Among these, resistive memory arrays emerge as particularly promising due to their implementation efficiency. The paper then surveys recent progress in leveraging modern analog computing to solve differential and matrix equations using both advanced analog CMOS circuits and resistive memory arrays. Finally, we discuss the applications of these circuits, the precision and scalability issues and their potential solutions, the relationship with in-memory computing, and the unique computational complexity of analog computing. This paper provides a unified perspective on analog computing, highlighting its strengths, current developments, and challenges, and positioning it as a pivotal enabler of next-generation computational frontiers.

[AI-41] Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

链接: https://arxiv.org/abs/2606.13176
作者: Xin Wang,Boyan Gao,Yibo Yang,David A. Clifton
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

[AI-42] rraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

链接: https://arxiv.org/abs/2606.13148
作者: Dat Tien Nguyen,Thao Nguyen,Fadillah Adamsyah Maani,Huy M. Le,Muhammad Umer Sheikh,Numan Saeed,Muhammad Haris Khan,Salman Khan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

[AI-43] Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

链接: https://arxiv.org/abs/2606.13141
作者: Yuho Lee,Jisu Shin,Nicole Hee-Yeon Kim,Jihwan Bang,Juntae Lee,Kyuwoong Hwang,Fatih Porikli,Hwanjun Song
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of \langle query, evidence chunk, answer \rangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

[AI-44] Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

链接: https://arxiv.org/abs/2606.13125
作者: Akshay Krishnamurthy,Audrey Huang,Nived Rajaraman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

[AI-45] MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

链接: https://arxiv.org/abs/2606.13119
作者: Lilan Peng,Yandi Liu,Qingren Yao,Chongshou Li,Tianrui Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at this https URL.

[AI-46] Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents ICML2026

链接: https://arxiv.org/abs/2606.13097
作者: Saehun Chun,Wonje Choi,Sera Choi,Sanghyun Ahn,Honguk Woo
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

[AI-47] Emotional regulation improves deep learning-based image classification

链接: https://arxiv.org/abs/2606.13081
作者: Riccardo Emanuele Landi,João M. F. Rodrigues,Marta Chinnici
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks’ optimization, encouraging further investigation on emotion-inspired architectures.

[AI-48] he Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

链接: https://arxiv.org/abs/2606.13079
作者: Jiaqi Luo,Jiarun Dai,Zhile Chen,Jia Xu,Weibing Wang,Yawen Duan,Brian Tse,Geng Hong,Xudong Pan,Yuan Zhang,Min Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.13079 [cs.CR] (or arXiv:2606.13079v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.13079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] WLA: Achieving Ternary Weights and Low-Bit Activations for LLM s via Post-Training Quantization ICML2026

链接: https://arxiv.org/abs/2606.13054
作者: Zhixiong Zhao,Zukang Xu,Zhixuan Chen,Xing Hu,Zhe Jiang,Dawei Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at this https URL.

[AI-50] EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation

链接: https://arxiv.org/abs/2606.13053
作者: Kailin Wang,Haoxiang Jie,Yaoyuan Yan,Jiacheng Zhou,Zhiyou Heng
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant events. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA-WM, an event-aware world-model framework that augments frozen visual-feature dynamics with task-specification-grounded event prediction and verification. EA-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPOgenerated proposals. Across navigation, deformable-object, wall-constrained, and languagedescribed manipulation studies, EA-WM shows that event-aware verification can make featurespace world models more interpretable and better aligned with task progress.

[AI-51] AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

链接: https://arxiv.org/abs/2606.13051
作者: Fabien Maury(Imagine - U1163, HeKA | U1346),Solène Grosdidier,Maud de Dieuleveult(Imagine - U1163),Adrien Coulet(HeKA | U1346)
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at this https URL.

[AI-52] Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

链接: https://arxiv.org/abs/2606.13038
作者: Haowei Qian
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 1 figure, 7 tables. Reproduction artifacts (code, frozen profiles, prompts, model outputs): this https URL

点击查看摘要

Abstract:As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC = 0.5, bootstrap CI lower bound 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score – a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: this https URL

[AI-53] Democracy in the Era of Artificial Intelligence

链接: https://arxiv.org/abs/2606.13026
作者: Evangelos Pournaras,Srijoni Majumdar,Carina Hausladen,Dirk Helbing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

[AI-54] CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

链接: https://arxiv.org/abs/2606.13024
作者: Bo Liu,Di Dai,Jingwei Liu,Jiarui Jin,Xiaocheng Fang,Guangkun Nie,Hongyan Li,Shenda Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a “one-size-fits-all” paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

[AI-55] SciR: A Controllable Benchmark for Scientific Reasoning in LLM s

链接: https://arxiv.org/abs/2606.13020
作者: Pierre Beckmann,Marco Valentino,Andre Freitas
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

[AI-56] Otters: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

链接: https://arxiv.org/abs/2606.13016
作者: Zhanglu Yan,Jiayi Mao,Kaiwen Tang,Fanfan Li,Gang Pan,Tao Luo,Bowen Zhu,Qianhui Liu,Weng-Fai Wong
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware “bug,” the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In _2 O _3 optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

[AI-57] scLLM -DSC: LLM -Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

链接: https://arxiv.org/abs/2606.13007
作者: Ping Xu,Pengjiang Li,Tian Du,Zaitian Wang,Jiawei Gu,Ziyue Qiao,Pengfei Wang,Yuanchun Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

[AI-58] APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization KDD KDD2026

链接: https://arxiv.org/abs/2606.12991
作者: Yifan Zhao,Lang Qin,Jintai Chen
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at this https URL.

[AI-59] Structured Testbench Generation for LLM -Driven HDL Design and Verification-Oriented Data Curation

链接: https://arxiv.org/abs/2606.12983
作者: En-Ming Huang,Yu-Hung Kao,Ren-Hao Deng,Wei-Po Hsin,Yao-Ting Hsieh,Cheng Liang,Hsiang-Yu Tsou,Mu-Chi Chen,Yu-Kai Hung,Shao-Chun Ho,Po-Hsuang Huang,Shih-Hao Hung,H.T. Kung
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47%. Our models are available at this https URL.

[AI-60] A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

链接: https://arxiv.org/abs/2606.12976
作者: Akbar Erkinov,Nurmukhammad Abdurasulov
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

[AI-61] Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

链接: https://arxiv.org/abs/2606.12969
作者: Quan Quan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions – such as querying knowledge bases or generating work orders – to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

[AI-62] Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agent ic Memory

链接: https://arxiv.org/abs/2606.12945
作者: Zhibao Chen,Qian Cheng
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency – both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 – this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap’s 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable – reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

[AI-63] PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

链接: https://arxiv.org/abs/2606.12942
作者: Hao Jiang,Xin Li,Annan Wang,Zhi Yang,Haoxiang Zhang,Yichi Zhang,Weisi Lin
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12942 [cs.AI] (or arXiv:2606.12942v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.12942 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hao Jiang [view email] [v1] Thu, 11 Jun 2026 06:09:51 UTC (1,377 KB) Full-text links: Access Paper: View a PDF of the paper titled PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization, by Hao Jiang and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-64] An Embodied Simulation Platform Benchmark and Data-Efficient Augmentation Framework for Wet-Lab Robotics

链接: https://arxiv.org/abs/2606.12936
作者: Zhe Liu,Huanbo Jin,Zhaohui Du,Zhe Wang,He Xu,Peijia Li,Jiaming Gu,Quan Lu,Qi Wang,Bin Ji,Ting Xiao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 25 pages, 17figures

点击查看摘要

Abstract:Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and \pi0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

[AI-65] MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

链接: https://arxiv.org/abs/2606.12935
作者: Wenbo Chen,Puheng Li,Mengyang Liu,Weijie Su,Tianpei Xie
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

[AI-66] Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agent ic Search Architectures in E-Commerce

链接: https://arxiv.org/abs/2606.12924
作者: Jetlir Duraj,Jayanth Yetukuri,Shuang Zhou,Dhruv Varma,Rui Kong,Ishita Khan,Qunzhi Zhou
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16–0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

[AI-67] LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

链接: https://arxiv.org/abs/2606.12921
作者: Franz Louis Cesista,Katherine Crowson,Cédric Simal,Stella Biderman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer’s spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank- 2 proxy recovers the dense best tested learning rate, and a rank- 32 LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE’s simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

[AI-68] MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

链接: https://arxiv.org/abs/2606.12918
作者: Chejian Xu,Zhaorun Chen,Jingyang Zhang,Freddy Lecue,Avni Kothari,Sarah Tan,Wenbo Guo,Bo Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent’s marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

[AI-69] PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

链接: https://arxiv.org/abs/2606.12896
作者: Junfeng Guo Heng Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent’s internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \textttPolicyGuard, a \textittest-time step-level backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

[AI-70] he Hidden Power of Scaling Factor in LoRA Optimization

链接: https://arxiv.org/abs/2606.12883
作者: Zicheng Zhang,Haoran Li,Jiaxing Wang,Guoqiang Gong,Anqi Li,Yudong Hu,Ting Xiong,Yurong Gao,Junxing Hu,Zhida Jiang,Yifeng Zhang,Pengzhang Liu,Qixia Jiang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Low-Rank Adaptation (LoRA), the scaling factor \alpha is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor \alpha and the learning rate function differently, with \alpha emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA’s scaling mechanism: First, LoRA’s spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, \alpha outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA- \alpha , a minimalist framework that restores \alpha to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA- \alpha consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

[AI-71] HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

链接: https://arxiv.org/abs/2606.12882
作者: Xiaoxuan Wang,Haixin Wang,Alexander Taylor,Jason Cong,Yizhou Sun,Wei Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent–environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent–environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

[AI-72] DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

链接: https://arxiv.org/abs/2606.12871
作者: Jingxuan Han,Wei Liu,Mingyang Zhu,Youpeng Wang,Ziwen Wang,Lin Qiu,Xuezhi Cao,Xunliang Cai,Zheren Fu,Licheng Zhang,Zhendong Mao
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users’ expectations. To facilitate future research, our dataset and code are made publicly available at this https URL.

[AI-73] Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation Hacking and Repair in Competitive Programming

链接: https://arxiv.org/abs/2606.12864
作者: Tingqiang Xu,Hangrui Zhou,Tianle Cai,Alex Gu,Kaifeng Lyu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code – a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ’s native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

[AI-74] WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

链接: https://arxiv.org/abs/2606.12852
作者: Renmin Cheng,Changhao Chen(The Hong Kong University of Science and Technology (Guangzhou))
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textitwhat-where-when memory from \textitwhich-why reasoning. To address this, we propose \textbfWISE (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

[AI-75] (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

链接: https://arxiv.org/abs/2606.12848
作者: Chen Zhu,Xiaolu Wang,Weilong Zhang
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher’s exact test rejects equality of failure rates at p0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

[AI-76] meROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

链接: https://arxiv.org/abs/2606.12841
作者: Zhengtao Yao,Liuyang Song,Hongbo Zhang,Chenhao Wei,Haoyan Xu,Guang Yang,Siheng Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

[AI-77] Fantastic Scientific Agents and How to Build Them: Agent Build for Rietveld Refinement

链接: https://arxiv.org/abs/2606.12834
作者: Woong Shin,Craig A. Bridges,Marshall T. McDonnell,Rafael Ferreira da Silva
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist’s judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist’s judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist’s authored contract remains the durable asset.

[AI-78] opical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

链接: https://arxiv.org/abs/2606.12828
作者: Rasul Khanbayov,Hasan Kurban
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at this https URL.

[AI-79] GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

链接: https://arxiv.org/abs/2606.12821
作者: Gabriel Diaz-Ireland,Diego Prieto-Herráez,Mario García Peces,Javier Velázquez,Devika Jain
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026

点击查看摘要

Abstract:Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude’s capability at 11x lower cost ( 0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

[AI-80] ach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

链接: https://arxiv.org/abs/2606.12817
作者: Yudong Zhang(1),Lei Hu(1),Daoyang Liu(2),Jiawei Liu(1),Yangfan Luo(1),Xingyu Liu(1),Zuojian Wang(1),Zhilin Gao(1) ((1) Honor Device Co., Ltd, (2) The Chinese University of Hong Kong, Hong Kong, China)
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Xingyu Liu, Zuojian Wang, and Zhilin Gao are corresponding authors

点击查看摘要

Abstract:Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

[AI-81] Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

链接: https://arxiv.org/abs/2606.12814
作者: Xiao Ren,Yuhui Yang,Zongbiao Weng,Zhijie Liu,He Kong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to this https URL.

[AI-82] MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLM s ICML2026

链接: https://arxiv.org/abs/2606.12809
作者: He Li,Haoang Chi,Qizhou Wang,Yunxin Mao,Zhiheng Zhang,Jie Tan,Tongliang Liu,Wenjing Yang,Bo Han
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36 pages, accepted to the ICML 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in this https URL.

[AI-83] SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

链接: https://arxiv.org/abs/2606.12808
作者: Yash Vardhan Tomar,Dheeraj Peddireddy,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by 47.1\times and 72.6\times relative to these online baselines; at twelve qubits, full simulated steps take 1.02 s for SymQNet versus 13.27 s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

[AI-84] he Containment Gap: How Deployed Agent ic AI Frameworks Fail Public-Facing Safety Requirements ICML2026

链接: https://arxiv.org/abs/2606.12797
作者: Md Jafrin Hossain,Mohammad Arif Hossain,Weiqi Liu,Nirwan Ansari
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026 (AI4GOOD Workshop)

点击查看摘要

Abstract:Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

[AI-85] A Tutorial on World Models and Physical AI

链接: https://arxiv.org/abs/2606.12783
作者: Il-Seok Oh
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

[AI-86] Constructing Evaluation Datasets for Procedural Reasoning : Balancing Naturalness Grounding and Multi-Hop Coverag e

链接: https://arxiv.org/abs/2606.12767
作者: Sarah Elshabrawy,Rahul K. Dass,Ashok K. Goel
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026

点击查看摘要

Abstract:Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning. Comments: 10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12767 [cs.AI] (or arXiv:2606.12767v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.12767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-87] Prefill Awareness in Large Language Models NEURIPS2026

链接: https://arxiv.org/abs/2606.12747
作者: Andy Wang,Parv Mahajan,David Demitri Africa,Alexandra Souly,Jordan Taylor,Robert Kirk
类目: Artificial Intelligence (cs.AI)
备注: Submitted to NeurIPS 2026

点击查看摘要

Abstract:Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

[AI-88] Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

链接: https://arxiv.org/abs/2606.12742
作者: Farough Shayeste Roodi,Parham Zilouchian Moghaddam,Mahdi Mohammadi-nasab,Mehdi Modarressi,Mostafa Ersali Salehi Nasab,Masoud Daneshtalab
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

[AI-89] PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

链接: https://arxiv.org/abs/2606.12737
作者: Pengfei He,Lesly Miculicich,Vishesh Sharma,Ash Fox,George Lee,Jiliang Tang,Tomas Pfister,Long T. Le
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.

[AI-90] Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

链接: https://arxiv.org/abs/2606.12736
作者: Tianyu Liu,Allen Xin Wang,Antonia Panescu,Lisa Xinyi Chen,Wenxin Long,Xinyu Wei,Yueqian Jing,Ziyao Zeng,Jihang Chen,Sihan Jiang,Ziqing Wang,Siyi Gu,Siyu Chen,Xinyang Hu,Haoran Shao,Leqi Xu,Wangjie Zheng,Zhiyuan Cao,Ada Fang,Botao Yu,Kunyang Sun,Rex Ying,Arman Cohan,Qingyu Chen,Lingzhou Xue,Kaize Ding,Yuanqi Du,Wengong Jin,Zhuoran Yang,Marinka Zitnik,James Zou,Hua Xu,Hongyu Zhao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 figures

点击查看摘要

Abstract:AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: this https URL.

[AI-91] he Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

链接: https://arxiv.org/abs/2606.12721
作者: Nikolos Gurney,Stacy Marsella
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inferring others’ beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) – directed typed graphs that represent agents, state nodes, and the epistemic relationships among them – and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

[AI-92] Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

链接: https://arxiv.org/abs/2606.12713
作者: J. E. Aguilera Briones
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 1 table, 2 appendices

点击查看摘要

Abstract:Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. “AGI” lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

[AI-93] SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

链接: https://arxiv.org/abs/2606.12703
作者: Tarun Sharma
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent’s responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

[AI-94] Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

链接: https://arxiv.org/abs/2606.12702
作者: Alyssa Unell,Miguel Fuentes,Brenna Li,Bridget Lin,Meena Jagadeesan,Sanmi Koyejo,Nigam Shah
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets – leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

[AI-95] LLM -Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data ALT

链接: https://arxiv.org/abs/2606.12699
作者: Yifan Gao,Yanmin Gong,Yun Shi,Yuanxiong Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 14th IEEE International Conference on Healthcare Informatics, 2026

点击查看摘要

Abstract:Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care. Comments: The 14th IEEE International Conference on Healthcare Informatics, 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12699 [cs.LG] (or arXiv:2606.12699v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-96] wo-Layer Linear Auto-Regressive Models Estimate Latent States ICML2026

链接: https://arxiv.org/abs/2606.12691
作者: Yahya Sattar,Sunmook Choi,Leo Maynard-Zhang,Yassir Jedra,Maryam Fazel,Sarah Dean
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: ICML 2026

点击查看摘要

Abstract:Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

[AI-97] EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

链接: https://arxiv.org/abs/2606.12690
作者: Xin Zhou,Cong Miao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

[AI-98] M*: A Modular Extensible Serving System for Multimodal Models

链接: https://arxiv.org/abs/2606.12688
作者: Atindra Jha,Naomi Sagan,Keisuke Kamahori,Irmak Sivgin,Rohan Sanda,Steven Gao,Mark Horowitz,Luke Zettlemoyer,Olivia Hsu,Jure Leskovec,Baris Kasikci,Stephanie Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

[AI-99] From AGI to ASI

链接: https://arxiv.org/abs/2606.12683
作者: Tim Genewein,Matija Franklin,Alexander Lerchner,Laurent Orseau,Samuel Albanie,Adam Bales,Cole Wyeth,Stephanie Chan,Iason Gabriel,Joel Z. Leibo,Allan Dafoe,Marcus Hutter,Thore Graepel,Shane Legg
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

[AI-100] Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

链接: https://arxiv.org/abs/2606.12674
作者: Kushal Raj Bhandari,Ling Yue,Ching-Yun Ko,Dhaval Patel,Shaowu Pan,Pin-Yu Chen,Jianxi Gao
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

[AI-101] A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

链接: https://arxiv.org/abs/2606.12673
作者: Phan Nguyen,Dat Cao,Hien Chu,Khue Hoang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

[AI-102] Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

链接: https://arxiv.org/abs/2606.12667
作者: Grace Ra Kim,Duncan Eddy,Vedant Srinivas,Mykel J. Kochenderfer
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)

点击查看摘要

Abstract:Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

[AI-103] CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

链接: https://arxiv.org/abs/2606.12666
作者: Siyu Shen,Fenghao Xu,Wenrui Diao,Kehuan Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user’s request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results support the central claim that screenshot upload should be treated as an explicit device–cloud boundary decision, governed by task-driven selective exposure rather than all-or-nothing screen sharing. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12666 [cs.CR] (or arXiv:2606.12666v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.12666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-104] BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

链接: https://arxiv.org/abs/2606.12662
作者: Damien Martins Gomes,François Capman
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

[AI-105] rajGenAgent : A Hierarchical LLM Agent for Human Mobility Trajectory Generation MDM2026

链接: https://arxiv.org/abs/2606.12657
作者: Siyu Li,Toan Tran,Lingyi Zhao,Khurram Shafique,Li Xiong
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Robotics (cs.RO)
备注: 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

点击查看摘要

Abstract:Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

[AI-106] oken Complexity Theory for AI-Augmented Computing

链接: https://arxiv.org/abs/2606.12647
作者: Jie Wang
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 1 figure

点击查看摘要

Abstract:AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex. Comments: 25 pages, 1 figure Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: F.1.1; F.1.2; F.1.3; F.2.2; I.2.7; I.2.8 Cite as: arXiv:2606.12647 [cs.CC] (or arXiv:2606.12647v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2606.12647 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-107] Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

链接: https://arxiv.org/abs/2606.12629
作者: Varun Reddy Nalagatla
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 10 tables

点击查看摘要

Abstract:We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, functioning as independent binary registers. We validate this Bag of Dims framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four progressive experiments. Sign patterns alone carry predictive content: replacing all magnitudes with unity achieves 72-93% top-5 next-token accuracy through the LM head, and pure Hamming scoring without any decoder reaches 80-90% top-4096. These sign patterns organize into semantic features: using a single-token type cache (one forward pass per vocabulary token, no context), we discover 175 categories via per-dimension sign consistency (mean AUC 0.80) from 50 anchors with zero training. A trained probe adds only +0.018 AUC and converges to axis-aligned weights, confirming negligible cross-dimension structure. This structure extends to attention: all 175 categories remain discoverable in K and V projections. On the write side, static FFN weight inspection links 20% of features to individual writer neurons (0.70 agreement; random controls: 0%), with top-200 neuron coalitions achieving 0.70 agreement on 99.9% of prototypes via majority vote. Fully unsupervised discovery (random seeds, no labels) scales to 1500 features at 100% yield and 99% sparsity across all three models, with pairwise MI of 0.0014 bits confirming low inter-dimension coupling. These results establish that the standard basis already suffices for feature reading throughout the transformer compute pathway, requiring no training, no optimization, and no GPU-days beyond a single forward pass per vocabulary token.

[AI-108] HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection LREC2026

链接: https://arxiv.org/abs/2606.12620
作者: Luke Patterson,Li Wang,Adam Faulkner
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

[AI-109] “Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

链接: https://arxiv.org/abs/2606.12618
作者: Alan Cooney,David Africa,Geoffrey Irving
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

[AI-110] From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

链接: https://arxiv.org/abs/2606.12603
作者: Honglin He,Zhizheng Liu,Yukai Ma,Bolei Zhou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model’s counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

[AI-111] Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

链接: https://arxiv.org/abs/2606.12594
作者: Joshua Ong Jun Leang,Zheng Zhao,Mihaela Cătălina Stoian,Qiyuan Xu,Haonan Li,Wenda Li,Shay B. Cohen,Eleonora Giunchiglia
类目: Artificial Intelligence (cs.AI)
备注: Pythagoras-Prover: Technical Report

点击查看摘要

Abstract:Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement’s surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

[AI-112] Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

链接: https://arxiv.org/abs/2606.12581
作者: Mateusz Stolarski,Michał Czuba,Piotr Bielak,Piotr Bródka
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ( Gain@k vs. \mathrmAUC_\mathrmcutoff ): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

[AI-113] Arbor: Tree Search as a Cognition Layer for Autonomous Agents

链接: https://arxiv.org/abs/2606.12563
作者: Neha Prakriya,Chaojun Hou,Zheng Gong,Huasha Zhao,Xi Zhao,Mou Li,Zhenyu Gu,Emad Barsoum
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation – a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12563 [cs.AI] (or arXiv:2606.12563v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.12563 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-114] Foresight: Iterative Reasoning About Clues that Matter for Navigation

链接: https://arxiv.org/abs/2606.12550
作者: Arthur Zhang,Carl Qi,Donne Su,Xiangyun Meng,Amy Zhang,Joydeep Biswas
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: this https URL

[AI-115] Boosting Direct Preference Optimization with Penalization ICML2026

链接: https://arxiv.org/abs/2606.12505
作者: Pengwei Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

点击查看摘要

Abstract:Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3% and 4.4% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

[AI-116] Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

链接: https://arxiv.org/abs/2606.12500
作者: Xian Liu,Carlo G. Prato,Gustav Markkula
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

[AI-117] Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

链接: https://arxiv.org/abs/2606.12485
作者: Longkun Hao,Hongyu Lin,Hao Li,Zhichao Yang,Haojie Hao,Dongshuo Huang,Haitao Yang,Hongyu Ge,Ming jie Xie,Yanjun Wu,Zi Hao Yin,Yan Bai,Yihang Lou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at this https URL.

[AI-118] Representing Time Series as Structured Programs for LLM Reasoning

链接: https://arxiv.org/abs/2606.12481
作者: Jaeho Kim,Changhun Oh,Seokhyun Lee,Irina Rish,Changhee Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks – editing, captioning, and question answering – where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

[AI-119] ReCal: Reward Calibration for RL-based LLM Routing

链接: https://arxiv.org/abs/2606.12479
作者: Qihang Yu,Hanwen Tong,Zhengqi Zhang,Bo Zheng,Feng Wei,Shengyu Zhang,Zemin Liu,Fei Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbfReCal, a \textbf\underlineReward \textbf\underlineCalibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at this https URL.

[AI-120] Reframing AI Loss of Control: What It Is How to Have It How to Lose It

链接: https://arxiv.org/abs/2606.12442
作者: Ze Shen Chin,Maurice Chiodo,Dennis Müller,Coleman Snell
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 56 pages

点击查看摘要

Abstract:At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discourse present among academics, frontier labs, and even governments. However, in the existing literature, the concept seems to rest on surprisingly weak foundations, where even those that discuss loss of control extensively do not first establish what control is and what exactly is being lost. Our paper aims to address these gaps. We establish a working definition of control by anchoring it to the “setting and getting of goals”. Then, we discuss various aspects of control, built on foundational concepts from related fields like cybernetics, management control, and control theory. This includes who (or what) can be in control, and the things they require to be in control, such as the ability to set goals, having a functional control loop, having requisite variety, and having sufficient goal alignment. Once a framework for control is established, we then discuss how control can be lost, how AIs can contribute to such loss of control, and offer relevant recommendations for how one can maintain control. One interesting consequence of our work is that humanity, as individuals and as groups, can lose varying degrees of control as a result of AI behaviour that is far below the level of superintelligence; the potential for loss of control scenarios (as we define them) already exist, and have existed for a long time.

[AI-121] Position: Generative Engine Optimization Creates Underexamined Risks Governance Must Target Concentration Disclosure and Academic Blind Spots ICML2026

链接: https://arxiv.org/abs/2606.12439
作者: Yizhu Wen,Nan Zhang,Haohan Yuan,Xun Chen,Haopeng Zhang,Hanqing Guo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This paper is accepted by the ICML 2026 Position Track

点击查看摘要

Abstract:Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines’ evidence pool and generation. We analyze the search engine optimization (SEO) to GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk: (iii) academic-industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.

[AI-122] Algorithmic Constitutionalism

链接: https://arxiv.org/abs/2606.12437
作者: Oren Perez,Nurit Wimer
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing encroachment of artificial intelligence (AI) on social life raises significant risks for society, particularly within the infospheres created and controlled by companies such as Google, Facebook, Apple, and Amazon. This article examines these risks through an in-depth analysis of Facebook’s content moderation regime, which is already partially governed by algorithms. We argue that the idea of ethical engineering, often proposed in the literature as a solution to the governance challenges posed by AI, is inadequate for several reasons. In response, we develop an alternative framework, which we term “algorithmic constitutionalism.” Our approach rests on three pillars: (a) a layered architecture consisting of two levels of code: (i) an operative or object level and (ii) a meta level designed to protect the system’s core principles from algorithmically initiated change; (b) algorithmic meta-reasoning, which enables the system to operate simultaneously at both levels so that it can monitor, verify, and potentially correct in real time operations at the object level that depart from principles protected at the meta-code level; and © correction through deliberation. The article elaborates the concept of algorithmic constitutionalism and demonstrates how it may be applied to Facebook’s content moderation regime. As part of this analysis, we examine the tension between societal constitutionalism and algorithmic constitutionalism. Paradoxically, attempts to subject AI systems to external deliberative control may also enable AI agents to intervene in that process, potentially undermining its purpose. The article concludes by considering the implications of this argument for the European Digital Services Act, which entered into force in October 2022. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.12437 [cs.CY] (or arXiv:2606.12437v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.12437 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Ind. J. Global Legal Stud. 30 (2023): 81

[AI-123] Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

链接: https://arxiv.org/abs/2606.12430
作者: Davide Ghia,Jaspreet Ranjit,Tania Cerquitelli,Daniele Quercia
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber’s theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

[AI-124] Muse Spark Safety Preparedness Report

链接: https://arxiv.org/abs/2606.12429
作者: Cristina Menghini,Peter Ney,Hamza Kwisaba,Zifan(Sail)Wang,Miles Turpin,Felix Binder,Jean-Christophe Testud,Aidan Boyd,Nathaniel Li,Ivan Evtimov,Klaudia Krawiecka,Arman Zharmagambetov,Jeremy Kritz,Alexander R. Fabbri,Daniel Song,Jinpeng Miao,Joonas Hjelt,Meghna Ramani,Leona Lan,Reza Aghajani,Joanna Bitton,Mahesh Pasupuleti,Devin Norder,Khalid El-Arini,Paridhi Singh,Vítor Albiero,Sahana CB,Rashnil Chaturvedi,Elahe Dabir,Edoardo Debenedetti,Jim Gust,Ziwen Han,Kat He,Sean Hendryx,Lifeng Jin,Polina Kirichenko,Sandra Lefdal,Kenneth Li,Asad Liaqat,Inna Lin,Despoina Magka,Neal Mangaokar,Ishita Mediratta,Zach Miller,Smitha Milli,Niloofar Mireshghallah,Saba Nazir,Hung Nguyen,Maximilian Nickel,Kelvin Niu,Kerem Oktar,Bhargavi Paranjape,Parth Pathak,Maya Pavlova,Emmanuel Ramirez,David Renardy,Candace Ross,Yasha Sheynin,Claudia Shi,Shivam Singhal,Evangelia Spiliopoulou,Rakshith Sharma Srinivasa,Jamelle Watson-Daniels,Spencer Whitman,Adina Williams,Chen Xing,Andy Zou,Tommy Ma,Siqi Deng,James Beldock,Prashant Ratanchandani,Kate Plawiak,Taesung Lee,Ryan Victory,Lindsay Hundley,Rachad Alao,Himaghna Bhattacharjee,Jianfeng Chi,Gary Frost,Pegah Ghahremani,Niki Howe,Yuheng Huang,Saeed Jahed,Hannah Korevaar,Trang Le,Zhe Liu,Jinghong Luo,Qin Lyu,Nina Mehrabi,Abraham Montilla,Chirag Nagpal,Cyrus Nikolaidis,Rajvardhan Oak,Manoj Ravi,Vidya Sarma,Aman Shankar,Alana Shine,Eric Michael Smith,Mariana Tandon
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 159 pages, 57 figures

点击查看摘要

Abstract:Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta’s Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark’s broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark’s deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the “high risk” category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

[AI-125] Mapping AI Programs in the U.S: A Status Report from Early 2026 and an Analysis of AI Majors and Minors

链接: https://arxiv.org/abs/2606.12428
作者: Felix Muzny,Carolyn Jones,Carter Ithier,Hasnain Sikora,Hrutika Harshadbhai Patel,Carla E. Brodley
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing, we 1) describe our scraping and mapping tools, which dynamically update to track the state of AI education in the U.S., and 2) create a historic record at a time of great upheaval. The tool we developed, available at this https URL, detects, scrapes, and displays data from more than 350 undergraduate AI programs–majors, minors, concentrations, and certificates–at 4-year universities. Our tool searched over 560 institutions to locate these programs, a sample that represents 86% of all undergraduate Computer Science (CS) graduates in the U.S. This tool allows prospective students, guidance counselors, administrators, and faculty to easily access AI program requirements and is designed to continually update as new programs emerge. To the best of our knowledge, this survey represents the most comprehensive snapshot of the state of AI programs in the U.S. to date. With this work we offer three important contributions: 1) a record of AI programs in the U.S. at a time of great upheaval; 2) a tool to explore AI programs and their requirements; and 3) an analysis of the courses required for 66 AI majors and 87 AI minors. Our analysis of majors and minors shows great variability in the size and the requirements of these degrees, but we note two takeaways. First, not all majors require a general AI course, but if they don’t, they do require a Machine Learning (ML) course. Second, while more than a third of majors require an Ethics in AI course, just under a quarter of AI minors do.

[AI-126] he Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review

链接: https://arxiv.org/abs/2606.12423
作者: Ayush Enkhtaivan,Chinazunwa Uwaoma
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, Hawaii International Conference on System Sciences

点击查看摘要

Abstract:The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and governance frameworks. This paper presents a systematic literature review (SLR) to examine the challenges of balancing AI compliance and technological innovation across critical infrastructure sectors. The review follows established SLR guidelines to extract and synthesize insights from peer-reviewed articles, report, and institutional sources published between 2020-2025. The study identifies three interrelated challenges: fragmented regulations, excessive compliance burdens for smaller to medium enterprises (SMEs), and misaligned governance models. To address these challenges, the study highlights practical governance strategies, including risk-tiered regulation, compliance by design, and explainable AI, to support scalable and trustworthy AI deployment in critical sectors. Key contributions include a concise mapping of core AI-governance challenges and a conceptual diagram illustrating their overlap, as well as actionable strategies for policymakers and practitioner to harmonize oversight with innovation.

[AI-127] Eigenism: Ethics for a Human-AI Future

链接: https://arxiv.org/abs/2606.12420
作者: Dan Hendrycks
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged. To determine what an AI actually has reason to care about, this paper introduces \textitEigenism, an ethical framework that treats identity not as an all-or-nothing property tied to specific hardware, but as a graded, distributed pattern of information. We propose that an agent evaluates outcomes by summing the wellbeing of all entities weighted by their connectedness to the agent’s pattern: \sum c\cdot w . We first formalize this equation to map exactly how an AI should value its existence across copies, forks, and updates. We then demonstrate that this ethical theory successfully generalizes to humans as well, providing a much-needed shared moral vocabulary. Finally, the framework uses this shared vocabulary to reframe AI alignment. Rather than only attempting to constrain AIs from the outside using confinement or reinforcement, Eigenism points toward ``identity engineering,‘’ showing how deep, non-redundant shared histories can make human flourishing a genuine component of an AI’s own rational self-interest.

[AI-128] GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

链接: https://arxiv.org/abs/2606.12419
作者: Sankalan Pal Chowdhury,Junling Wang,Donya Rooein,April Yi Wang,Mrinmaya Sachan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

[AI-129] Divination by Prompt: LLM -Mediated Xuanxue on Chinese Social Media

链接: https://arxiv.org/abs/2606.12418
作者: Chuang Li,Lixuan Wang,Yuqi Chen,Ze Hong
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) has produced a striking cultural practice: using conversational AI for divination. This paper offers one of the first systematic studies of LLM-mediated divination in the context of Xuanxue, an internet-native umbrella term for mystical and spiritual practices on Chinese social media. Using a mixed-methods design, we analyze 23000+ posts and comments from Xiaohongshu and conduct 32 semi-structured interviews with users and professional diviners. Users primarily consult LLMs about pragmatic concerns - romantic relationships, careers, exams, and in-game gacha draws - via two intersecting pathways: trend-driven curiosity enabled by viral visibility and zero-cost access, and event-driven anxiety under conditions of uncertainty. A defining feature is collaborative prompt refinement, which turns users into active prompt engineers. Among commenters expressing a clear stance, perceived efficacy skews positive, with “accuracy” often justified through biographical fit and retrospective confirmation, consistent with Barnum and confirmation bias. Users also develop verification practices such as repeated trials and cross-model comparison. Professional diviners, by contrast, portray LLMs as lacking the “spiritual power” required for genuine divination, reflecting both ontological commitments and economic boundary-work. We also show how participants navigate tensions between scientific and metaphysical frames when interpreting AI-generated readings. Situating these findings in anthropological and cognitive-evolutionary theories of divination, we argue that LLM divination preserves core functions of traditional practice while introducing scalability, repeatability, and prompt-driven co-production that reshape how divinatory authority is constructed and evaluated.

[AI-130] he AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

链接: https://arxiv.org/abs/2606.12415
作者: Nicola Fabiano
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist – understood broadly to encompass any professional with advanced legal training – operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

[AI-131] Valid Inference with Synthetic Data via Task Exchangeability

链接: https://arxiv.org/abs/2606.13629
作者: Lezhi Tan,Tijana Zrnic
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated “silicon samples” in pilot studies; AI evaluations increasingly rely on “LLM-as-a-judge” outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

[AI-132] Agent Rivet: an automated system for producing Rivet routines from journal publications

链接: https://arxiv.org/abs/2606.13535
作者: Antonio J. Costa,Caterina Doglioni,Christian Gütschow,Andrew D. Pilkington,Sukanya Sinha
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph)
备注:

点击查看摘要

Abstract:Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

[AI-133] An LLM System for Autonomous Variational Quantum Circuit Design

链接: https://arxiv.org/abs/2606.13380
作者: Kenya Sakka,Wataru Mizukami,Kosuke Mitarai
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 63 pages, 19 figures, 3 tables

点击查看摘要

Abstract:The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

[AI-134] OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

链接: https://arxiv.org/abs/2606.12838
作者: Danning Jiang,Zheming An,Yalong Zhao,Lipeng Lai
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Genomics (q-bio.GN)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

[AI-135] A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints

链接: https://arxiv.org/abs/2606.12502
作者: Cheng Qian
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注: Also available at this https URL (v5)

点击查看摘要

Abstract:We propose that value – the quantity goal-directed agents create, destroy, and exchange – is a lawful structural quantity in the same category as information. Following Shannon’s method, we make one ruthless abstraction: value is the rate at which an agent converts a resource into goal-progress, relative to a frame fixed by its goal. A scale-invariance axiom forces a logarithmic measure, V=\sum_i k_i \ln e_i ; compounding of a reinvested resource forces the same form via the ergodicity argument of Peters (2019). The two routes are kin rather than independent; their agreement is a consistency check, not an over-determination. We derive a coding theorem of value: \Delta G \le I(X;Y) , achieved by Bayes-proportional allocation; realized value decomposes as G=D(q|r)-D(q|p) , identifying misalignment with measurable waste. For populations, value is frame-relative while price is frame-independent; a fleet that pools its resource and fuses its perception inherits the ceiling G_\mathrmfleet \le I(X;Y_1:m) \le H(X) (a corollary; an earlier sum-form claim was wrong and is corrected in v5). A dynamical layer yields an is/ought asymmetry from which alignment emerges as a control-stability condition with a closed-form residual. We test the single-frame laws on live language models in a pre-registered scale-up: perception mutual information tracks realized capability rather than parameter count (Spearman \rho = 0.977 pooled over 30 model \times domain points), out-of-sample \Delta G tracks I(X;Y) , and over-confidence is measurable dissipation; a further pre-registered test shows the bridge is shape-invariant across four task shapes ( n=42 , slope 0.953). None of the mechanisms is individually new – generalized Kelly, Armstrong Mindermann (2018), classical control; the contribution is their unification and the governance mapping (incentive design over oversight) that follows.

机器学习

[LG-0] Understanding Truncated Positional Encodings for Graph Neural Networks ICML2026

链接: https://arxiv.org/abs/2606.13671
作者: James Flora,Mitchell Black,Weng-Keen Wong,Amir Nayyeri
类目: Machine Learning (cs.LG)
*备注: 28 pages, 4 figures, ICML 2026

点击查看摘要

Abstract:Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the “complete” version of these PEs, which requires O(n^3) time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first k eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the k -harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

[LG-1] Dense Supervision Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

链接: https://arxiv.org/abs/2606.13657
作者: Guo Yu,Wenlin Liu,Yulan Hu,Hao-Xuan Ma,Jun-Peng Jiang,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL

点击查看摘要

Abstract:On-policy distillation (\textscOPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model’s parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textscOPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textscOPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW’s adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textscOPD into ordinary dense parameter rewriting; instead, \textscOPD retains important geometric signatures of on-policy post-training.

[LG-2] he Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

链接: https://arxiv.org/abs/2606.13637
作者: Ayushman Trivedi,Bhavika Melwani
类目: Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

[LG-3] Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model

链接: https://arxiv.org/abs/2606.13633
作者: Ion Matei,Maksym Zhenirovskyy,Takuya Kurihana,Rohit Vupala,Anthony Wong
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.

[LG-4] Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive Latent-Variable and Adversarial Approaches

链接: https://arxiv.org/abs/2606.13626
作者: Kyuil Lee,Dezhi Yu,Yongkang Huang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 11 pages, 13 figures

点击查看摘要

Abstract:We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach’s style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

[LG-5] Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

链接: https://arxiv.org/abs/2606.13589
作者: Meher Sai Preetam,Meher Bhaskar
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 tables

点击查看摘要

Abstract:We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical “L1-simplex paradox” – the mathematical reality that the L1 norm is constant on the simplex and fails to prune – by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

[LG-6] Learning with Simulators: No Regret in a Computationally Bounded World COLT2026

链接: https://arxiv.org/abs/2606.13576
作者: Sasha Voitovych,Abhishek Shetty,Noah Golowich,Alexander Rakhlin
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: To appear at COLT 2026

点击查看摘要

Abstract:Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

[LG-7] Adjusted Cup-Product Neural Layer

链接: https://arxiv.org/abs/2606.13568
作者: Snigdha Chandan Khilar
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

[LG-8] A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

链接: https://arxiv.org/abs/2606.13565
作者: Sophia Tang,Yuchen Zhu,Molei Tao,Pranam Chatterjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models offer a simple and stable likelihood-based framework for sequence generation, recently extended to any-length settings via token insertion. Principled reward-guided fine-tuning for any-length discrete diffusion, however, remains largely unexplored. We introduce Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length discrete diffusion models via joint optimization of the insertion and unmasking policies together with a quality-based inference schedule. We derive the Radon-Nikodym derivative for the joint insertion-unmasking path measures, enabling theoretically guaranteed convergence to the intractable reward-tilted sequence distribution without requiring target samples. Building on this, we establish unmasking and insertion quality as tractable approaches for minimizing decoding error and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that generates the reward-tilted distribution. Empirically, A2D2 improves reward optimization while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance methods.

[LG-9] NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks

链接: https://arxiv.org/abs/2606.13543
作者: Fabien Chraim,Jian Zhang,Dominik Janzing,Xiang Song,Christos Faloutsos,John Evans
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models network incidents as graph-temporal processes and uses counterfactual simulation to rank candidate root causes. This approach produces an interpretable ranking of root cause hypotheses and integrates naturally with operator-defined mitigation and remediation actions. We train the model on over 1,500 incidents collected over six months from a leading cloud provider’s production network and evaluate it on 31 expert-labeled incidents. NetCause consistently improves root cause ranking quality in the regime most relevant to operational decision-making, achieving a 16.1% accuracy improvement over a rule-based heuristic baseline. While training is computationally intensive, inference is lightweight, requiring only seconds of GPU runtime per incident (well below typical telemetry collection latencies). Comments: 9 pages, 6 figures Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2606.13543 [cs.NI] (or arXiv:2606.13543v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.13543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

链接: https://arxiv.org/abs/2606.13532
作者: Fabien Chraim,Dominik Janzing,John Evans
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments. Comments: 6 pages, 4 figures Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2606.13532 [cs.NI] (or arXiv:2606.13532v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2606.13532 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

链接: https://arxiv.org/abs/2606.13501
作者: Xinwei Qiang,Yifan Hu,Shixuan Sun,Jing Yang,Han Zhao,Chen Chen,Yu Feng,Jingwen Leng,Minyi Guo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01 \times , reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 \mu s. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2606.13501 [cs.DC] (or arXiv:2606.13501v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.13501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Uncertainty Estimation for Molecular Diffusion Models

链接: https://arxiv.org/abs/2606.13451
作者: Paul Seij,Christian A. Naesseth,Stephan Mandt,Metod Jazbec
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have seen wide adoption for 3D molecular generation, yet they offer no principled signal of when a generated molecule is likely to be of low quality. We propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models. Building on a Laplace approximation of the denoising network, we measure the variability of the noise prediction across the generation trajectory. Empirically, we show that the resulting uncertainty score is informative of sample quality, exhibiting a negative correlation with established sample-level quality metrics. We further study how the proposed uncertainty score can be used to filter generated samples, improving model performance via test-time scaling.

[LG-13] Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

链接: https://arxiv.org/abs/2606.13444
作者: Rodrigo de Sapienza Luna,Daniel Ratton Figueiredo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

[LG-14] How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

链接: https://arxiv.org/abs/2606.13443
作者: Jihyeon Hur,Yongseok Kwon,Min-Gi Jo,Jeongwhan Choi,Noseong Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers’ equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from \barg \approx 0.7 to near-zero as resolution increases.

[LG-15] Accelerating Speculative Diffusions via Block Verification

链接: https://arxiv.org/abs/2606.13426
作者: Alexander Soen,Hisham Husain,Valentin De Bortoli,Arnaud Doucet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions – which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

[LG-16] Hölder: Improving the Quality-Coherence Trade-off in Multimodal VAEs ICML2026

链接: https://arxiv.org/abs/2606.13381
作者: Huyen Vo,María Martínez-García,Isabel Valera
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026. Camera-ready version

点击查看摘要

Abstract:Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

[LG-17] Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition INTERSPEECH2026

链接: https://arxiv.org/abs/2606.13379
作者: Benedikt Hilmes,Nick Rossenbach,Ralf Schlüter
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
*备注: Accepted at Interspeech 2026

点击查看摘要

Abstract:Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

[LG-18] Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

链接: https://arxiv.org/abs/2606.13347
作者: Jagriti Singh,Shekhar Verma,Muneendra Ojha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

[LG-19] Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

链接: https://arxiv.org/abs/2606.13338
作者: Kaijie Xu,Anqi Wang,Xilin Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

[LG-20] Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score ICML2026

链接: https://arxiv.org/abs/2606.13300
作者: Mariya Pavlova,Harrison Bo Hua Zhu,Elizsveta Semenova,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注: ICML 2026, Workshop on Forecasting as a New Frontier of Intelligence

点击查看摘要

Abstract:We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network’s rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

[LG-21] Clipping Makes Distributed and Federated Asynchronous SGD Robust to Strag glers

链接: https://arxiv.org/abs/2606.13287
作者: Samuel Erickson,Mikael Johansson
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping “stabilizes” training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

[LG-22] Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

链接: https://arxiv.org/abs/2606.13260
作者: Paolo Muratore,Mackenzie Weygandt Mathis
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

[LG-23] o GAN or Not To GAN: Segmentation Analysis on Mars DEM

链接: https://arxiv.org/abs/2606.13252
作者: Douglas Dziedzorm Agbeve,Aditya V. Handrale,Salim Fares,Seif E. Idani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

[LG-24] From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

链接: https://arxiv.org/abs/2606.13221
作者: Bora Kargi,David Salinas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge’s own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human this http URL facilitate reproducibility, we release our code at this https URL .

[LG-25] WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

链接: https://arxiv.org/abs/2606.13194
作者: Maximilian Burzer,Tobias King,Till Riedel,Michael Beigl,Tobias Röddiger
类目: Machine Learning (cs.LG)
*备注: 20 pages, 9 Figures, 3 Tables

点击查看摘要

Abstract:Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

[LG-26] he Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

链接: https://arxiv.org/abs/2606.13191
作者: Ryosuke Sakamoto,Kotaro Sakamoto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

[LG-27] Loss-Shift Transfer via Bayes Quotients

链接: https://arxiv.org/abs/2606.13178
作者: Vasileios Sevetlidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emphloss shift. A loss determines which information in (X) is Bayes-relevant, and two losses may therefore require different representations even under the same joint law (P(X,Y)). The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about (Y) discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

[LG-28] Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

链接: https://arxiv.org/abs/2606.13172
作者: Jacques Raynal,Pierre Slangen,Elsa Raynal,Jacques Margerit
类目: Machine Learning (cs.LG)
*备注: 22 pages, 1 figure. Conceptual framework for representation diagnostics in machine learning

点击查看摘要

Abstract:Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

[LG-29] When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

链接: https://arxiv.org/abs/2606.13168
作者: Aydin Javadov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ( 0.6 B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline’s routing weights are content-independent and reproduce the schedule’s analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

[LG-30] Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

链接: https://arxiv.org/abs/2606.13133
作者: Kaito Baba,Evripidis Bampis,Giorgos Mitropoulos
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by R|C_\max . By using predictions of heavy job assignments, we achieve a polynomial-time (1+\varepsilon) -approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

[LG-31] Disparate Impact in Synthetic Data Generation

链接: https://arxiv.org/abs/2606.13105
作者: Paul Andrey,Michaël Perrot,Batiste Le Bars,Marc Tommasi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

[LG-32] Authority Truth and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models ICML2026

链接: https://arxiv.org/abs/2606.13104
作者: Aryan Khurana,Aravind Ramana RN,Dhruv Kumar
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Accepted to AI4GOOD and EIML at ICML 2026

点击查看摘要

Abstract:Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: this https URL

[LG-33] Scale Buys Interpolation Structure Buys a Horizon: Certified Predictability for Equivariant World Models

链接: https://arxiv.org/abs/2606.13092
作者: Hongbo Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS)
*备注: 23 pages (9 main + appendices). Code: this https URL

点击查看摘要

Abstract:Scale buys interpolation; structure buys a certified horizon. A world model’s average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: T -step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor’s Lyapunov spectrum, T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j . The horizon is two-sided – a matching lower bound makes approximate equivariance provably horizon-limited – and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a \mathbbZ_N -equivariant network recovers the full Lyapunov spectrum ( R^2=0.98 ); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a c\times -inflated certificate provably needs c\times the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot – with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate’s own scope taxonomy – calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting – a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum – the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon. Comments: 23 pages (9 main + appendices). Code: this https URL Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS) Cite as: arXiv:2606.13092 [cs.LG] (or arXiv:2606.13092v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.13092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Limits of spectral learning under noise

链接: https://arxiv.org/abs/2606.13067
作者: Sabin Roman,Ljupco Todorovski,Saso Dzeroski,Marta Sales-Pardo,Roger Guimera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

[LG-35] A green solvent screening tool for emerging materials via uncertainty aware transformer enhanced transfer learning

链接: https://arxiv.org/abs/2606.13060
作者: Ioannis Kouroudis,Simon Ternes,Zhaosu Gu,Gohar Ali Siddiqui,Marina Ustinova,Angelo Lembo,Alessio Gagliardi,Aldo Di Carlo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

[LG-36] Reliability of Probabilistic Emulation of Physical Systems

链接: https://arxiv.org/abs/2606.12997
作者: Sam F. Greenbury(1),Radka Jersakova(1),Paolo Conti(1 and 2),Marjan Famili(1 and 3),Christopher Iliffe Sprague(1 and 4),Edwin Brown(1 and 5),Jason D. McEwen(1 and 6) ((1) The Alan Turing Institute, (2) Autodesk Research, (3) PhysicsX, (4) Orbital, (5) University of Sheffield, (6) University College London)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

[LG-37] DeepJEB: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

链接: https://arxiv.org/abs/2606.12994
作者: Soyoung Yoo,Leekyo Jeong,Jinsu Ra,Dongeon Lee,Sunwoong Yang,Hyogu Jeong,Namwoo Kang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 16 pages, 14 figures. Submitted to ASME Journal of Mechanical Design

点击查看摘要

Abstract:Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels – mass, stress, and displacement – without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets – a 40x expansion – using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

[LG-38] Exposure Bias as Epistemic Underidentification in Recursive Forecasting ICML2026

链接: https://arxiv.org/abs/2606.12990
作者: Riku Green,Zahraa S. Abdallah,Telmo M Silva Filho
类目: Machine Learning (cs.LG)
*备注: Accepted for ICML 2026 EIML workshop

点击查看摘要

Abstract:Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states Z and provenance variables P , and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation–class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

[LG-39] EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

链接: https://arxiv.org/abs/2606.12979
作者: Vedant Pandya
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, 9 tables, 5 code listings. Pre-registered experimental study with mechanism analysis

点击查看摘要

Abstract:JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor’s hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor’s weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^n=50 = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^n=50 trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

[LG-40] Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations INTERSPEECH2026

链接: https://arxiv.org/abs/2606.12971
作者: Tahiya Chowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted to Interspeech 2026

点击查看摘要

Abstract:Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

[LG-41] Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

链接: https://arxiv.org/abs/2606.12966
作者: Achyuthan Sivasankar
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 16 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Grokking – where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy – is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in 53, 71, 97, 113, 131, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in 53,97,131; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model’s FSD lags, confirming the precursor is a multi-block circuit property.

[LG-42] Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment ICML2026

链接: https://arxiv.org/abs/2606.12940
作者: Xiang Li,Yixuan Zhou,Jingran Xie,Zhiyong Wu,Hui Wang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 20 pages, 9 figures, accepted to ICML 2026, demo website available at this https URL

点击查看摘要

Abstract:Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder’s internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

[LG-43] Is Spurious Correlation Removal Always Learnable? ICML-2026

链接: https://arxiv.org/abs/2606.12930
作者: Yibo Zhou,Bo Li,Hai-Miao Hu,Hanzi Wang,Xiaokang Zhang,Ruifan Zhang
类目: Machine Learning (cs.LG)
*备注: poster paper in ICML-2026

点击查看摘要

Abstract:Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emphsamplable multi-environment instances with a one-dimensional predictive invariant subspace ( k=1 ) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter \gamma , which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is \mathbbE[\dist(\hatV,V_\mathrminv)^2]=\Theta(k(d-k)/(n|\mathcalE|)) , and under label-induced shifts a phase transition occurs at n^*\propto k(d-k)/(|\mathcalE|\gamma^2) with refined estimation error scaling proportional to 1/\gamma^2 . Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

[LG-44] Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function ICML2026

链接: https://arxiv.org/abs/2606.12917
作者: Atharva Gupta,Dhruv Kumar,Murari Mandal,Saurabh Deshpande
类目: Machine Learning (cs.LG)
*备注: Accepted to Workshop FMSD @ ICML 2026

点击查看摘要

Abstract:We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5’s feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head’s causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN’s in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

[LG-45] LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

链接: https://arxiv.org/abs/2606.12895
作者: Xinrui He,Qiyu Kang,Xuhao Li,Zheng-Jun Zha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a “memoryless” bottleneck, limiting the model’s capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at this https URL.

[LG-46] SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

链接: https://arxiv.org/abs/2606.12867
作者: Zhengyu Wu,Xu Wang,Hongchao Qin,Xunkai Li,Guang Zeng,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

[LG-47] Multimodal Graph Negative Learning

链接: https://arxiv.org/abs/2606.12863
作者: Zhengyu Wu,Xu Wang,Hongchao Qin,Xunkai Li,Guang Zeng,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

[LG-48] A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

链接: https://arxiv.org/abs/2606.12845
作者: John Fields,K M Sajjadul Islam,Ruchitha Thota,Victor Chen,Praveen Madiraju
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Information Reuse and Integration (IEEE IRI 2026)

点击查看摘要

Abstract:This study explores privacy-preserving machine learning (PPML) techniques using the PySyft platform to enable collaborative prediction of student retention between institutions. We developed a remote data science (RDS) framework with a semi-air-gapped architecture consisting of high-side and low-side servers, allowing researchers from three universities to build predictive models on sensitive student data without direct data access. Using historical data from a small private university (N=720), we evaluated three synthetic data generation approaches and validated the framework through inter-institutional collaboration. The results demonstrate consistent classification performance across institutions (Macro F1: 0.690–0.695) while maintaining strict Family Educational Rights and Privacy Act (FERPA) compliance. We also propose Data-Type-Aware Templates, a novel synthetic data method that prioritizes privacy over distributional fidelity. Our findings confirm that RDS-based PPML is technically feasible for educational settings and offers a practical alternative to federated learning for small-scale inter-institutional collaborations. The code is available at this https URL.

[LG-49] Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from Chinas A-Share Market

链接: https://arxiv.org/abs/2606.12843
作者: Xiao Han,Yao Xiao,Zhen Zhang,Moxuan Zheng
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

[LG-50] CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees ICML2026

链接: https://arxiv.org/abs/2606.12840
作者: Yixiao Wang,Hayden McTavish,Varun Babbar,Margo Seltzer,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

[LG-51] Adaptive Weighted Averag ing

链接: https://arxiv.org/abs/2606.12763
作者: Aditya Bhaskara,Ashok Cutkosky,Ravi Kumar,Manish Purohit
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the problem of selecting the largest among n unknown values x_1,\dots,x_n given only a single unbiased estimate y_i for each x_i . We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable “no-compromise” guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

[LG-52] Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

链接: https://arxiv.org/abs/2606.12740
作者: Takanobu Furuhashi,Hidekata Hontani,Qibin Zhao,Tatsuya Yokota
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

[LG-53] Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

链接: https://arxiv.org/abs/2606.12735
作者: Manuel Reyna,Alexandre Tartakovsky
类目: Machine Learning (cs.LG)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

[LG-54] Lets Ask Gauss: Improved One-Run Privacy Auditing

链接: https://arxiv.org/abs/2606.12733
作者: Adya Agrawal,Yu Wei,Jaspal Singh,Malik Magdon-Ismail,Vassilis Zikas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy auditing provides an important safeguard by estimating the actual information leaked by a model, thus ensuring that theoretical privacy guarantees hold in practice. We study empirical privacy auditing for differentially private (DP) machine learning, focusing on efficient one-run methods for mechanisms such as DP-SGD. Prior one-run approaches threshold training examples or “canaries” into binary membership guesses, which discards useful information. We show that, in the white-box DP-SGD setting, canary-aligned signals naturally form a sequence of random variables whose normalized sum is asymptotically Gaussian. Leveraging this distributional perspective, we develop a DP-auditing framework that leads to tighter privacy lower bounds from a single training run.

[LG-55] Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLM s

链接: https://arxiv.org/abs/2606.12731
作者: Elizaveta Tennant,Benjamin Henke,Anita Keshmirian,Murray Shanahan,Verena Rieser,Kristian Lum,Sydney Levine,Julia Haas
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model’s capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user’s stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user’s stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user’s moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

[LG-56] Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

链接: https://arxiv.org/abs/2606.12718
作者: Sudeepta Mondal,Ganesh Sundaramoorthi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

[LG-57] A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

链接: https://arxiv.org/abs/2606.12710
作者: Evan Scope Crafts,Umberto Villa,Saviz Mowlavi,Yanting Ma,Hassan Mansour,Wael H. Ali
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

[LG-58] A unified complexity bound for logconcave sampling

链接: https://arxiv.org/abs/2606.12694
作者: Yunbum Kook,Santosh S. Vempala
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 5 pages

点击查看摘要

Abstract:We give a simple, unified, and nearly tight bound for sampling arbitrary logconcave distributions from a warm start using the In-and-Out algorithm along with exponential lifting. The main new ingredient in the analysis is an improved bound on the Poincaré constant of a lifted distribution. As a consequence, the resulting convergence rate is nearly tight for both constrained settings (e.g., Gaussian restricted to a convex body) and well-conditioned settings (e.g., strongly logconcave and smooth densities).

[LG-59] Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

链接: https://arxiv.org/abs/2606.12687
作者: Yunbo Wang,Bolbi Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder’s perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

[LG-60] How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

链接: https://arxiv.org/abs/2606.12680
作者: Julia Kostin,Kasra Jalaldoust,Elias Bareinboim,Samory Kpotufe,Fanny Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to n_Q , an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.12680 [cs.LG] (or arXiv:2606.12680v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Fed-FBD: Federated Functional Block Diversification for Isolation Privacy and Surgical Unlearning

链接: https://arxiv.org/abs/2606.12679
作者: Weijie Chen,Alan B. McMillan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
*备注: 12 pages, 3 figures, 8 tables. Code: this https URL

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant’s right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant’s contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client’s own blocks with at most +/-0.01 AUC drift on the clean colors.

[LG-62] Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

链接: https://arxiv.org/abs/2606.12658
作者: Riya Bisht,Dhruv Agarwal
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue – which determines tumour kill and off-target toxicity – is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 - 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 - 0.82) but remains ~2 sigma below truth – a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2606.12658 [cs.LG] (or arXiv:2606.12658v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

链接: https://arxiv.org/abs/2606.12651
作者: Riya Bisht,Dhruv Agarwal
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2606.12651 [cs.LG] (or arXiv:2606.12651v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12651 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dhruv Agarwal [view email] [v1] Wed, 10 Jun 2026 20:21:56 UTC (346 KB) Full-text links: Access Paper: View a PDF of the paper titled Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter, by Riya Bisht and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs q-bio q-bio.QM References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-64] EDD: Robust Detection of Unstable Temporal Features

链接: https://arxiv.org/abs/2606.12643
作者: Ricardo Ribeiro Pereira,Bruno Casal Laraña,Nádia Soares,Miguel Araújo
类目: Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance’s timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

[LG-65] Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2606.12640
作者: Qingyun Guo,Junyi Shi,Jianuo Huang,Tianyu Shi
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Accepted to the 23rd IFAC World Congress, 2026

点击查看摘要

Abstract:Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

[LG-66] he Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

链接: https://arxiv.org/abs/2606.12639
作者: Dhruv Agarwal,Riya Bisht
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting how a cell’s transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound’s nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa – the textbook “simple baselines win” result. But under the contest’s true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers’ 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p 10^-4), and the proxy’s winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner – to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2606.12639 [cs.LG] (or arXiv:2606.12639v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.12639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] owards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

链接: https://arxiv.org/abs/2606.12615
作者: Owen O’Neill,Fintan Costello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model’s outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

[LG-68] Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

链接: https://arxiv.org/abs/2606.12611
作者: Wiliane Carolina Silva,Evandro César Vilas Boas,Felipe A. P. de Figueiredo
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66% macro-F1, followed by AutoGluon with 55%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

[LG-69] he Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Frag ility in AI Winter

链接: https://arxiv.org/abs/2606.12610
作者: Miquel Noguer i Alonso,David Pacheco Aznar
类目: Machine Learning (cs.LG)
*备注: 33 pages, 1 figure

点击查看摘要

Abstract:Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

[LG-70] Viral Proteins Reveal Geometry of Protein Language Models ICML2026

链接: https://arxiv.org/abs/2606.12609
作者: Arthur Bigot,Harmon Bhasin,Core Francisco Park,Eugene Shakhnovich,Dianzhuo Wang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at this https URL

点击查看摘要

Abstract:Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

[LG-71] Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

链接: https://arxiv.org/abs/2606.12552
作者: Célestin Eve,Gaël Varoquaux,Thomas Moreau
类目: Machine Learning (cs.LG)
*备注: 34 pages, 11 figures

点击查看摘要

Abstract:Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

[LG-72] Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

链接: https://arxiv.org/abs/2606.12507
作者: MohammadHossein Rezaei,Anas Mahmoud,Zihao Wang,Utkarsh Tyagi,Advait Gosai,Razvan-Gabriel Dumitru,Aakash Sabharwal,Bing Liu,Yunzhong He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

[LG-73] Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

链接: https://arxiv.org/abs/2606.12503
作者: Chiara Semenzin,Faadil Mustun,Roberto Dessi,Pierre Orhan,Alexis Emanuelli,Yair Lakretz,Gonzalo de Polavieja,German Sumbre
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

[LG-74] Policy-driven Conformal Prediction for Trustworthy QoT Estimation

链接: https://arxiv.org/abs/2606.12501
作者: Kiarash Rezaei,Omran Ayoub,Paolo Monti,Carlos Natalino
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92% to 99.6% on open datasets.

[LG-75] From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

链接: https://arxiv.org/abs/2606.12498
作者: Zhenqian Zhu,Yamin Hu,Yiya Diao,Weixiang Li,Haodong Li,Wenjian Luo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

[LG-76] μVLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

链接: https://arxiv.org/abs/2606.12497
作者: Egor Cherepanov,Nikita Kachaev,Daniil Zelezetsky,Aydar Bulatov,Artem Pshenitsyn,Yuri Kuratov,Alexey Skrynnik,Aleksandr I. Panov,Alexey K. Kovalev
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 34 pages, 20 figures, 9 tables

点击查看摘要

Abstract:Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as \mu VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, \mu VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in this https URL.

[LG-77] Net-Ev2: A Generative Simulator for Network Event Evolution KDD2026

链接: https://arxiv.org/abs/2606.12494
作者: Guangyu Wang,Zhaonan Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026 Research Track

点击查看摘要

Abstract:Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev ^2 ( \underline\textbfNet work \underline\textbfEv ent \underline\textbfEv olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev ^2 can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev ^2 -6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev ^2 . Code is made available at this https URL.

[LG-78] Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

链接: https://arxiv.org/abs/2606.12490
作者: Li-Jen Lin,Chih-Duo Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

[LG-79] Masked Neural Detection for Constrained Channel Coding in Molecular Communication

链接: https://arxiv.org/abs/2606.12489
作者: Melih Şahin,Ozgur B. Akan
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Molecular communication (MC) suffers from severe diffusion memory because molecules released for one symbol may arrive during later symbols. Neural sequence detectors, especially sliding bidirectional recurrent neural networks (SBRNNs), can substantially outperform threshold detectors in such channels. This raises a central question for MC channel coding: does a code whose advantage was established under threshold detection retain it when both coded and uncoded transmission are evaluated with neural detection? This letter answers this question for run-length-limited ISI-mitigation (RLIM) codes, a class of constrained codes previously shown to provide large BER gains in MC. Across the tested operating points, the best RLIM-SBRNN receiver beats the best uncoded receiver, chosen between threshold and SBRNN detection, in 46 of 59 cases, with a mean gain of 10.36\times over those wins. We also propose an RLIM-tailored training mask for compact SBRNN detectors, improving the unmasked RLIM-SBRNN in 227 of 236 comparisons with 3.267\times mean gain when masking is beneficial. Finally, the compact masked RLIM-SBRNN is competitive with channel-state-aware MLSE despite using no channel knowledge.

[LG-80] A Stationary (and Therefore Compatible) Representation is All You Need CVPR2024

链接: https://arxiv.org/abs/2606.12488
作者: Niccolò Biondi,Federico Pernici,Simone Ricci,Alberto Del Bimbo
类目: Machine Learning (cs.LG)
*备注: Accepted to TPAMI2026. Extension of the CVPR2024 version ( arXiv:2405.02581 )

点击查看摘要

Abstract:Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using d -Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a d -Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at this https URL.

[LG-81] DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

链接: https://arxiv.org/abs/2606.12487
作者: Zimo Zhao,Maolin Wang,Bowen Yu,Bowen Liu,Xiao Han,Xiangyu Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

[LG-82] An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

链接: https://arxiv.org/abs/2606.12486
作者: Valeriu Dimidov,Sasan Jafarnejad,Raphaël Frank
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution’s implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

[LG-83] Scalable anomaly detection via a univariate Christoffel function

链接: https://arxiv.org/abs/2606.12483
作者: Florian Grivet(CNES, LAAS-DISCO, Comue de Toulouse),Didier Henrion(LAAS-POP),Jean-Bernard Lasserre(TSE-R, LAAS-POP),Louise Travé-Massuyès(LAAS-DISCO, Comue de Toulouse)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

[LG-84] Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

链接: https://arxiv.org/abs/2606.12478
作者: Gilhan Kim,Daniel K. Park
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query–key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

[LG-85] Auditing Discriminatory Patterns in Mortgage Lending Through Association Rules and Fair Binning

链接: https://arxiv.org/abs/2606.12435
作者: Archit Rathod,Dhwani Chande,Het Nagda
类目: Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, fairness-aware mortgage lending analysis using HMDA 2023 data. Project repository available at GitHub

点击查看摘要

Abstract:Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.

[LG-86] Majority-of-Three is Optimal

链接: https://arxiv.org/abs/2606.13614
作者: Divit Rawal,Nikita Zhivotovskiy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 9 pages

点击查看摘要

Abstract:We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.

[LG-87] Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2606.13605
作者: Yashdeep Chaudhary,Roberto Armellin,Harry Holt,Marco Sagliano
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint. 39 pages, 16 figures

点击查看摘要

Abstract:This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

[LG-88] Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines

链接: https://arxiv.org/abs/2606.13454
作者: Dimitri Vanden Abeele,Daniele Veraldi,Davide Pierangeli,Claudio Conti,Serge Massar
类目: Optics (physics.optics); Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equilibrium Propagation offers a compelling alternative to traditional machine learning for training energy-based networks. Here we demonstrate a hybrid optical-digital implementation of EP using a Spatial Photonic Ising Machine (SPIM). The SPIM exploits the gauge transformation method to optically encode both continuous neuron states and rank-1 binary trainable patterns as phase modulations via a spatial light modulator, with inference realized using a finite difference scheme. The experimental system is evaluated on the Wine classification dataset. The potential of this approach, including the use of continuous couplings and structured coupling matrices, is evaluated numerically on the more complex MNIST dataset. Our work provides a concrete pathway toward energy-efficient physical implementations of Equilibrium Propagation.

[LG-89] Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting Chaos

链接: https://arxiv.org/abs/2606.13422
作者: Maida Wang,Xiao Xue,Minh Chung,Peter V. Coveney
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We develop theoretical foundations for a practical quantum-advantage mechanism in quantum-informed machine learning for chaotic dynamical systems. A family of k-indexed higher-order quantum statistical priors (Q-Priors) hosts the k-point marginal of the invariant measure on n_q = kq qubits, extending the single-site construction of prior work. We prove a two-stage advantage. In the representation stage, superposition and entanglement compactly store non-factorisable spatial correlations of the invariant measure on n_q qubits. In the extraction stage, joint Bell measurements on two copies estimate any post hoc Pauli functional with a copy-pair count independent of n_q, whereas any adaptive single-copy protocol for the corresponding full-Pauli read-out requires Omega(2^(n_q)) copies; this is a provable quantum-classical separation in copy-measurement complexity. The two-copy read-out is realised in simulation and on IQM superconducting processors. Two case studies instantiate the mechanism in workflows of independent scientific value: a turbulent channel-flow study in which the two-copy read-out yields a named non-diagonal correlator of the invariant measure (the velocity-direction coherence), and a medium-range weather forecasting workflow on the European Centre for Medium-Range Weather Forecasts ERA5 reanalysis in which the diagonal k = 2 Q-Prior steers a Koopman rollout, improves anomaly-correlation skill by 10-39% across 48-240 h lead times, and reduces the long-horizon collapse of rollouts onto a static mean field. The two conditions of our practical-advantage definition are met at complementary levels, identifying a candidate route to practical quantum advantage before fault-tolerant hardware.

[LG-90] Simultaneous Latent Budget Trees for Stratified Classification

链接: https://arxiv.org/abs/2606.13295
作者: Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni,Stefano Pellegrino,Giulia Vannucci,Roberta Siciliano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

[LG-91] ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

链接: https://arxiv.org/abs/2606.13277
作者: Aitor Sánchez-Ferrera,Elisabeth Wetzer,Kristoffer Wickstrøm,Michael Kampffmeyer,Robert Jenssen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures

点击查看摘要

Abstract:Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at this https URL.

[LG-92] Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

链接: https://arxiv.org/abs/2606.13146
作者: Federico P. Cortese,Alessio Farcomeni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey’s biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

[LG-93] A solvable model for unsupervised federated learning

链接: https://arxiv.org/abs/2606.13045
作者: Giovanni Catania,Aurélien Decelle,Gianluca Manzan,Beatriz Seoane,Daniele Tantari
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a theoretical framework for analyzing federated learning in a generative setting through a teacher-multiple interacting students scenario, in which each student receives a distinct realization of the data, either through a different noise corruption or by accessing a different subset, possibly of varying size. Using theoretical tools in equilibrium disordered system, we analytically show that interactions among students systematically enhance learning performance: highly noisy students require fewer samples to recover the underlying pattern, while low-noise students achieve a larger overlap with the ground-truth signal. We derive the optimal Bayesian conditions for teacher recovery as functions of the sample complexity, noise level, and interaction strength, and validate these predictions through numerical simulations. The resulting dynamics can be mapped onto equilibrium sampling in a Restricted Boltzmann Machine with a structured hidden layer, providing a principled theoretical understanding of how interactions improve distributed generative modeling.

[LG-94] Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

链接: https://arxiv.org/abs/2606.13017
作者: Stanisław Narębski,Tomasz Komendziński,Tomasz M. Rutkowski
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026

点击查看摘要

Abstract:Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed 347,232 EEG epochs from 290 older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal “state-sensing” engine for this http URL Bayes achieved the highest mean balanced accuracy ( 87.17% \pm 0.24% ), significantly outperforming a fully connected deep neural network (FNN: 81.58% ) and Random Forest ( 80.97% ). Linear models (LDA: 57.21% ; SVM: 51.01% ) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery. Comments: 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026 Subjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG) Cite as: arXiv:2606.13017 [q-bio.NC] (or arXiv:2606.13017v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2606.13017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-95] Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

链接: https://arxiv.org/abs/2606.12892
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.

[LG-96] Graph Reinforcement Learning for Calibration-Aware Quantum Circuit Routing

链接: https://arxiv.org/abs/2606.12816
作者: Yash Vardhan Tomar,Dheeraj Peddireddy,Vaneet Aggarwal
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum circuit routing is a key step in compiling programs for noisy intermediate-scale quantum processors. Routes that appear efficient by standard overhead metrics can still lose fidelity when they pass through poorly calibrated couplers. We study a calibration-aware graph reinforcement-learning router that uses same-day IBM Heron r2 calibration data to choose hardware-edge SWAPs. We train the policy with proximal policy optimization and evaluate it with exact simulated fidelity across nine Munich Quantum Toolkit (MQT) Bench circuits and three calibration snapshots. Across these evaluations, pooled mean exact fidelity is 0.727 , compared with 0.440 for SABRE-best20 and 0.481 for target-aware SABRE. Fidelity gains come with higher routed two-qubit counts and are concentrated in the 5q and 8q circuit families; under the fixed tree action graph, all 10q families favor SABRE-best20. Overall, our results show that calibration-aware learned routing can improve fidelity beyond gate-count-driven compilation.

[LG-97] Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems

链接: https://arxiv.org/abs/2606.12806
作者: Mansi Od,Param Pathak,Nouhaila Innan,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Short-term load forecasting is essential for reliable energy management, but practical deployment on edge devices requires models that remain accurate under limited memory, finite measurement budgets, and hardware noise. This work proposes a hardware-efficient Quantum Reservoir Computing (QRC) framework for energy load forecasting, where a fixed quantum reservoir transforms temporal input windows into high-dimensional features and only a classical Elastic Net readout is trained. To reduce deployment cost, the trained readout is compressed using post-training fixed-point quantization at bit widths from 8 to 2 bits. The framework is evaluated on the Tetouan and Spain energy load datasets under exact statevector simulation, 512-shot finite sampling, and realistic hardware-noise models from IBM FakeTorino and IBM FakeMarrakesh. Results show that 6-bit readout precision preserves full-precision forecasting performance while reducing readout memory by 81.2%. Below this point, degradation becomes dataset dependent, with Tetouan showing stronger sensitivity and Spain degrading more gradually. Hardware-noise validation further shows that the trained readout transfers to noisy reservoir states without retraining. These findings support quantized QRC as a resource-aware forecasting approach for near-term quantum time-series applications.

[LG-98] Computationally tractable robust differentially private mean estimation

链接: https://arxiv.org/abs/2606.12654
作者: Kelly Ramsay
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages, 17 figures

点击查看摘要

Abstract:We develop a new, differentially private mean estimator called the balloon mean. The main features of the balloon mean are that it is computationally tractable and enjoys robustness to outlying observations. It is based on an iterative clipping procedure over expanding Mahalanobis balls, or ``balloons.‘’ The method satisfies zero-concentrated differential privacy and depends on a small number of interpretable tuning parameters. We provide theoretical guarantees under heavy-tailed and contaminated elliptical models, characterizing its statistical performance and robustness to outliers. Extensive simulations demonstrate that the balloon mean is robust to heavy-tailed and contaminated data, and outperforms existing differentially private mean estimators in contaminated settings.

[LG-99] Epistemic Uncertainty Is Not the Reducible Kind

链接: https://arxiv.org/abs/2606.12646
作者: Robin Young
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The standard taxonomy of predictive uncertainty defines epistemic uncertainty as the part removable by collecting more data, while the standard measure identifies it with a mutual-information term. We prove the definition and the measure are extensionally inconsistent. On an explicit construction, the measure assigns all uncertainty to the epistemic class, yet no quantity of training data reduces it. Reducibility is instead a property of the pair (uncertainty, acquisition class), and the dichotomy resolves into three parts: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic uncertainty. An exact identity for the value of an observation shows that in-distribution data never reduces mechanism-irreducible uncertainty and generically increases it. Ensemble disagreement, the deployed epistemic estimate, tracks the training procedure rather than the epistemic term. It collapses to zero beneath a positive truth under consistent training, and equals hyperparameter-scaled initialization noise under interpolation. A finite-sample falsification test and seed-swept experiments confirm the theory.

[LG-100] Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

链接: https://arxiv.org/abs/2606.12623
作者: Oliver Dürr,Lisa Herzog,Pascal Bühler,Susanne Wegener,Beate Sick
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission = 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial’s reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months = 2). These findings support the use of causal models like TRAM-DAG for personalized decision-making in stroke care and highlight their ability to bridge the gap between observational evidence and clinical trials.

[LG-101] Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

链接: https://arxiv.org/abs/2606.12559
作者: Hemanth Chandravamsi,Hangchuan Hu,Ponkrshnan Thiagarajan,Tamer A. Zaki
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The ensemble Kalman filter (EnKF) is widely adopted for sequential data assimilation, but fails for solutions with discontinuities, such as shocks in compressible flows. Uncertainty in shock location induces multimodal ensemble statistics that violate the Gaussian assumptions underlying the EnKF, producing large-scale spurious oscillations in the analysis state. We introduce a feature-preserving latent-EnKF that performs the ensemble update in a learned low-dimensional latent space, where shock and flow features admit a smooth manifold representation, thereby preserving sharp features during EnKF analysis. The updated latent state is mapped back to physical state through a shared decoder for all ensemble members. The algorithm eliminates the member-specific ordered training and positivity flooring used in prior approaches. Numerical experiments on a Sod shock tube and Mach 2 shock interaction with a 2D cylinder, using sparse and noisy observations, show accurate feature recovery of shocks and contact discontinuities without spurious oscillations.

[LG-102] Analog Quantum Asynchronous Event-Based Graph Neural Network

链接: https://arxiv.org/abs/2606.11000
作者: Kristian Sotirov,Shaheen Acheche,Antonio A. Gentile,Osvaldo Simeone
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 31 pages, 8 figures, initial version

点击查看摘要

Abstract:Asynchronous, event-based graph neural networks (AEGNNs) have recently emerged as an efficient paradigm for processing the sparse and high-temporal-resolution data from event cameras. In this paper, we propose quantum analog AEGNNs (QA-AEGNNs), a novel framework to implement an AEGNN on a neutral-atom quantum computer. Neutral-atom quantum processors offer a programmable analog quantum computing platform based on controllable Rydberg-atom interactions. To this end, we map the streaming event data to an array of trapped neutral atoms, where each atom represents a graph node (event) and is positioned such that geometric proximity reflects the spatio-temporal neighborhood of events. The native Rydberg Hamiltonian of the quantum processor is programmed to mirror the message-passing computations of the AEGNN, with atomic qubit states serving as node feature embeddings and inter-atom interactions realizing graph edges. Furthermore, we propose a hybrid quantum-classical training scheme in which the analog Hamiltonian parameters (e.g., laser pulse amplitudes and detunings) are optimized using classical feedback to learn the quantum AEGNN model from data. Our approach leverages the continuous Hamiltonian dynamics and massive parallelism of neutral-atom quantum systems to natively execute event-based graph computations with potential accuracy improvements

附件下载

点击下载今日全部论文列表